1 Introduction
The choice of neural network architecture is an important consideration in the deep learning community. Among a plethora of options, Residual Neural Networks (ResNets)
He et al. (2016) have emerged as an important subclass of models, as they mitigate the gradient issues Balduzzi et al. (2017) arising with training deep neural networks by adding skip connections between the successive layers. Besides the architectural advancements inpired from the original scheme Zagoruyko and Komodakis (2016); Xie et al. (2017), recently Neural Ordinary Differential Equation (Neural ODE) models Chen et al. (2018b); E (2017); Lu et al. (2018); Haber and Ruthotto (2017) have been proposed as an analog of continuousdepth ResNets. While Neural ODEs do not necessarily improve upon the sheer predictive performance of ResNets, they offer the vast knowledge of ODE theory to be applied to deep learning research. For instance, the authors in Yan et al. (2020)discovered that for specific perturbations, Neural ODEs are more robust than convolutional neural networks. Moreover, inspired by the theoretical properties of the solution curves, they propose a regularizer which improved the robustness of Neural ODE models even further. However, if Neural ODEs are chosen for their theoretical advantages, it is essential that the effective model—the combination of ODE problem and its solution via a particular numerical method—is a close approximation of the true analytical, but practically inaccessible ODE solution.
In this work, we study the empirical risk minimization (ERM) problem
(1) 
where is a set of training data,
is a (nonnegative) loss function and
is a Neural ODE model with weights , i.e.,(2) 
where are (suitable) neural networks and and denote the upstream and downstream layers respectively. is defined to be the (analytical) flow of the dynamical system
(3) 
As the vector field of the dynamical system is itself defined by a neural network, evaluating is intractable and we have to resort to a numerical scheme to compute . belongs either to a class of fixed step size methods or is an adaptive step size solver as proposed in Chen et al. (2018b). For each initial value problem (IVP) of the Neural ODE , the trajectory computed using a particular solver is defined as where is uniquely defined for fixedstep solvers.
Since the numerical solvers play an essential role in approximation the solutions of an ODE, it is intuitive to ask: how the choice of the numerical method affects the training of a Neural ODE model? Specifically, does the discretization step of the numerical solver impact the resulting flow of the ODE? To test this, we first train a Neural ODE model on a synthetic classification task using a fixed step solver with a small step size. Figure 1 (a) shows that for the small step size model, the numerically computed trajectories of the individual IVPs do not cross, which is an indication that the learned flow approximates the true solution of an ODE. In contrast, the trajectories of the IVPs cross, if the training is repeated with a larger step size (see Figure 1 (b)). This behavior clearly indicates that the numerical solutions for solvers with large step sizes do not always agree with the true solutions of the ODE. For the latter model, the discretization error of the solver is so large that the resulting numerical solution no longer maintains the properties of ODE solutions.
If we are interested in extending the advances made in the ODE community to Neural ODE models, we have to ensure that the trained Neural ODE model indeed corresponds to a time continuous dynamical system. Consequently, if the trained model corresponds to an ODE that is (qualitatively) reasonably well approximated by the applied discretization, it also stands to reason that any discretization with similar or lesser discretization error should yield the same predictions. We observe that for the model trained with a small step size, using another solver with smaller step size for testing indeed achieves the same accuracy (Figure 1 (c)). However, the model trained with a large step size shows a sharp drop in the test performance when using a solver with smaller discretization error (Figure 1 (d)).
In the training process of Neural ODEs, the neural network describing the vector field of the ODE is not trained directly. Instead, the numerical solution where the neural network is evaluated at discrete points in time, is optimized. Therefore, for training with large step sizes, the resulting model can no longer be described by a time continuous ODE, but rather as a discrete finite difference equation. Hence, the model can no longer be viewed as being independent of a specific solver with a specific step size.
In this work we show that the training process of a Neural ODE yields a discrete ResNet without valid ODE interpretation if the step size is chosen too large. Furthermore, our rigorous Neural ODE experiments on two synthetic datasets as well as MNIST and cifar10 show that for each dataset there exists a step size where the ODE interpretation starts to be valid again. Based on this observation we propose an algorithm to find the coarsest discretization, that still leads to a continuous dynamical system. Additionally, we introduce a difficult synthetic dataset where the classification problem directly stems from a true generating vector field.
2 Related Work
The connections between ResNets and ODEs have been discussed in E (2017); Lu et al. (2018); Haber and Ruthotto (2017); Sonoda and Murata (2019). The authors in Behrmann et al. (2018) use similar ideas to build an invertible ResNet. Likewise, additional knowledge about the ODE solvers can be exploited to create more stable and robust architectures with a ResNet backend Haber and Ruthotto (2017); Haber et al. (2019); Chang et al. (2018); Ruthotto and Haber (2019); Ciccone et al. (2018); Cranmer et al. (2020); Benning et al. (2019).
Continuousdepth deep learning was first proposed in Chen et al. (2018b); E (2017). Although ResNets are universal function approximators Lin and Jegelka (2018), Neural ODEs require specific architectural choices to be as expressive as their discrete counterparts Dupont et al. (2019); Zhang et al. (2019a); Li et al. (2019). In this direction, one common approach is to introduce a timedependence for the weights of the neural network Zhang et al. (2019c); Thorpe and van Gennip (2018); Avelin and Nyström (2020). Other solutions include, novel Neural ODE models Lu et al. (2020); Massaroli et al. (2020) with improved training behavior, and variants based on kernels Owhadi and Yoo (2019) and Gaussian processes Hegde et al. (2019). Adaptive ResNet architectures have been proposed in Veit and Belongie (2018); Chang et al. (2017). The dynamical systems view of ResNets has lead to the development of methods using time step control as a part of the ResNet architecture Yang et al. (2020); Zhang et al. (2019b).
3 Synthetic datasets
3.1 Concentric sphere dataset
For our experiments, we introduce a classification task based on the concentric sphere dataset proposed by Dupont et al. (2019) . We use three concentric spheres, where the outer and the inner sphere correspond to the same class (see Figure 2
(c) for a 2 dimensional example). Whether this dataset can be fully described by an ODE, is dependent on the degrees of freedom introduced by combining the Neural ODE with additional downstream (and upstream) layers.
This dataset can easily be scaled up to arbitrarily high dimensions. We use the dataset with 2, 3, 10 and 900 dimensions. We chose the 900 dimensions variant because it (roughly) corresponds to the dimensionality of MNIST and cifar10 datasets.
3.2 Particle in energy landscape
For common deep learning classification tasks, it is unclear whether Neural ODEs are a good prior since they are not universal function approximators Dupont et al. (2019); Zhang et al. (2019a). Moreover, even if such a Neural ODE model exists, the question remains whether Neural ODEs are preferable or sensible. Therefore in this work, we introduce a new synthetic classification task, where the true generative process is an ODE, and thus using Neural ODEs to model this data should be a natural choice.
This synthetic dataset describes the dynamics of a particle in a 1D potential landscape including friction. Due to the potential, the particle experiences a force . Additionally, the particle experiences a friction force proportional to its velocity : , where is the friction coefficient. According to Newton’s second law of motion, the dynamics of the particle with unit mass can be described as:
(4) 
The problem is run until equilibrium, that is until the particle reaches zero velocity and does not experience any force. Since the dynamics are supposed to describe a classification task, the potential needs to fulfill the following conditions:


The potential needs to have a predefined number of local minima
The first point ensures that the particle cannot escape the potential landscape. The number of minima determine the number of classes of the classification task. Each minima is therefore assigned a unique category. For our experiments, we generated a dataset with three classes which corresponds to a potential landscape with three local minima (shown in Figure 2 (a) and (b)). However, this problem can be generalized to an arbitrary number of classes and to higher dimensions.
An appealing aspect of this dataset is that it allows to compare the recovered vector field with that of the true ODE (comparative figures can be found in the Supplementary Material). The true vector field has attracting sinks for each class, which are noticeably absent in the vector field of the trained Neural ODE model in which the points belonging to different classes are simply pushed towards different regions of the latent space. This illustrates that the vector field identification is illposed in the ERM setting as many feasible models exist that achieve high accuracy.
4 Interaction of Neural ODE and ODE solver can lead to discrete dynamics
We test the dependency of Neural ODEs on the choice of solvers by training different models on the energy landscape dataset with varying step sizes. To this end, we observe that the training performed using a solver with a small step size () results in smooth trajectories. On the other hand, a Neural ODE model trained with a large step size () leads to solutions which cross each other. The trajectory crossing problem is defined as follows:
If the trajectories found by the ODE solver are supposed to approximate the true solutions to the ODE, then the trajectories should not cross. We claim that for large step sizes, the Neural ODE can no longer be viewed as a time continuous system. Instead, the combination of solver and vector field has to be viewed as a discrete dynamical system. This system is no longer represented by an ODE, but a finite difference equation. Contrary to this, if the step size of the solver used for training the Neural ODE model is sufficiently small, the resulting vector field is a good enough numerical approximation to describe the energy landscape problem as a continuous dynamical system. Consequently during inference, for the model trained with a small step size, using different solvers with smaller discretization errors does not lead to a different test accuracy as the numerical solutions maintain enough significant digits. Specifically, these significant digits are dependent on the robustness
of the classifier block: for the classifier, there should exist a robustness threshold
such that . Thus, if two solvers compute the same solution up to , the classifier identifies these solutions as the same class and the result of the model is not affected by the interchanging these solvers. To test this conjecture, we use a solver with a smaller numerical error for testing. This can be achieved by reducing the step size of the solver. If the underlying dynamics can indeed be described by a time continuous ODE, then the accuracy should be independent of the solver used for testing as long as this solver has a smaller numerical error. For the model trained with the small step size, the accuracy is indeed independent of the solver used for testing (Figure 1 (c)). On the other hand, the model trained with the large step size shows a sharp drop in accuracy, when using smaller step sizes during testing (Figure 1 (d)). In this case, the vector field has adapted to the specific step size used during training and, therefore, the model is tied to a specific solver.4.1 Experiments
For our experiments, we use a fixed step solver to analyze the dynamics of Neural ODEs. In order to perform gradient descent, reversemode autodifferentiation through the solver has to be carried out. We choose to backpropagate through the numerical solver and not use the adjoint method described in Chen et al. (2018b). Theses choices were made, because we want to showcase that the interplay between vector field and ODE solver can lead to discrete dynamics in the simplest settings. However, the results pertain to adaptive solvers, see figures in the Supplementary Material. Additionally, using a fixed step solver makes analyzing the problem easy, as the numerical error of the method can be adjusted by simply changing the step size. For all our experiments we do not use an upstream block similar to the architectures proposed in Dupont et al. (2019). We chose such an architectural scheme to maximize the modeling contributions of the ODE block.
To test whether the observation discussed above generalizes across different datasets, we train Neural ODE models on the two synthetic datasets as well as on MNIST and cifar10. For training the Neural ODE, Euler’s method and a 4th order Runge Kutta (rk4) method were used (detailed descriptions of these methods can be found in Hairer et al. (1993)). The trained Neural ODE was then tested with different step sizes and solvers. For a Neural ODE trained with Euler’s method, the model was tested with Euler’s method, the midpoint method and the rk4 method. The testing step size was chosen as a factor of 0.5, 0.75, 1, 1.5, and 2 of the original step size used for training. For rk4, we only tested using the rk4 method with different step sizes. We report an average over five runs, where we used an aggregation of seeds for which the Neural ODE model trained successfully (the results for all seeds are disclosed in the Supplementary Material). We did not tune all hyperparameters to reach best performance for each step size. Rather, we focused on hyperparameters that worked well across the entire range of step sizes used for training (see Supplementary Material for more details of the choice of hyper parameters and the architecture of the Neural ODE).
Training and testing the model with the same step size, we observe that the test accuracy does not show any clear dependence on the step size on all four datasets. Since we did not tune the learning rate for each step size, any visible trends could be due to this choice. On all datasets, we observe similar behavior for dependence of the test accuracy on the test solver: when using large step sizes for training, the Neural ODE shows dependence on the solver used for testing. But there exists some critical step size below which the model shows no clear dependence on the test solver as long as this test solver has equal or smaller numerical error than the solver used for training (as seen in Figure 3 and Figure 4. For additional results on higher dimensional versions of the concentric sphere dataset we refer to the Supplementary Material.) We found that this critical step size is different for different datasets. To our surprise, the critical step sizes of the synthetic datasets were larger than the ones for MNIST and cifar10, which indicates that these tasks were more difficult than the standard image classification problems. For higher order solvers, such as the rk4, bigger step sizes than those for Euler’s method still lead to a valid ODE vector field. In agreement with convergence theory, the difference in the critical step sizes of rk4 and Euler is due to the difference in the discretization error of the two solvers.
5 Algorithm for step size adaption
Although the Neural ODE achieves good accuracy independent of whether the vector field has adapted to the solver or not, if theoretic results of ODEs are to be applicable to Neural ODEs, it is paramount to find a solution corresponding to an ODE flow. To ensure this, we propose an algorithm that checks whether the continuous property is preserved and adapts the step size if necessary.
The algorithm chooses the initial step size using the algorithm described by Hairer et al. (1993). This algorithm ensures that the Neural ODE chooses an appropriate step size for all neural networks and initializations. We found that the initial step size suggested by the algorithm is not too small, which makes the algorithm useful in practice. The Neural ODE starts training with the proposed step size using a solver of order . After a predefined number of steps (we chose ), the algorithm checks whether the Neural ODE is still continuous: The accuracy is calculated over one batch with the solver used for training and with a test solver, which has a smaller discretization error than the solver used for training. The crucial condition whether the solver is a feature of the model is whether a higher accuracy solver drops the performance significantly. If so, we decrease the step size and let the model train a couple of iterations to regain timecontinuous dynamics. If not, we cautiously increase the step size. Unlike in ODE solvers, the difference between train and test accuracy does not tell by how much the step size needs to be adapted, so we choose some constant multiplicative factor that works well in practice (see Algorithm 1). The algorithm was robust against small changes to the constants in the algorithm.
5.1 Experiments
We test the step adaption algorithm on three different datasets: the two synthetic datasets, MNIST and cifar10. For training we use Euler’s method and for testing we use the midpoint method. On all datasets we observe that the number of steps taken by the solver fluctuate. The reason for this is that the algorithm increases the step size until the step size is too large and training with this step size leads an adaption of the vector field to this particular step size. Continuing training with a smaller step size, this behavior is corrected (see Figure 6).
To compare the results of the step adaption algorithm to the results of the grid search we compare accuracy as well as number of average function evaluations (NFE) per iteration. For the grid search, we determine the critical number of steps using the same method as in the step adaption algorithm. We report the two step sizes closest to the critical step size which were part of the grid search. For the step adaption algorithm we calculate the NFE per iteration by including all function evaluations over course of training (see Table 1). The achieved accuracy and step size found by our algorithm is on par with the smallest step size above the critical threshold thereby eliminating the need for a grid search.
Grid search  Step adaption algorithm  

Data set  NFE  Accuracy  NFE  Accuracy 
Concentric spheres 2d  65129  %  100.5  % 
Three Minima  3365  %  43.3  % 
MNIST  916  %  9.8  % 
Cifar10  1733  %  21.9  % 
6 Conclusion
We have shown that step size of the fixed step solver used for training Neural ODEs impacts whether the resulting flow is an approximation to a time continuous system or a system discrete in time. As a simple test that works well in practice, we conclude that the model only corresponds to an ODE flow, if the performance does not depend on the exact solver configuration. We have verified this behavior on MNIST and cifar10 as well as two synthetic datasets. Based on these observations, we developed a step adaption algorithm, which maintains the continuous dynamics of the ODE throughout training. For minimal loss in accuracy and step size efficiency, our algorithm eliminates a massive grid search. In future work we plan to eliminate the oscillatory behavior of the step adaption algorithm.
Although we have focused on fixed step solvers in this work, we also observed a transition from ODElike dynamics to discrete dynamics for adaptive step size solvers (see Supplementary Material for these results). In this case, the tolerance parameter of the solver determines the behavior of the ODE. Additionally, instead of keeping the integration time constant, one could consider using a constant step size and increasing the integration time. We have not investigated this. Extending the step size algorithm to adaptive step size solvers and schemes where the integration time can be adapted is left to future work.
Broader Impact
The majority of our contributions are theoretical insights into the relatively recent deep learning paradigm of Neural ODEs. We hope and expect that this better understanding will improve the robustness and interpretability of Neural ODE models. As such, the impact of this work will depend on the usage of such models. We anticipate a positive impact for applications where the robustness and interpretability of model predictions are crucial, e.g., in medical applications or autonomous vehicles. However, we must also consider malicious applications, e.g., in surveillance and autonomous weapons. As the responsible engineers, it is our duty to advocate beneficial applications, e.g., by educating and supporting policy makers or refusing cooperation if a malicious application is suspected.
As a concrete result of the theoretical insights, we have suggested an algorithm that aims to reduce computational load. As a positive consequence, we expect a reduction of energy consumption as this not only minimizes the computational load per trained network, but hopefully eliminates the necessity for grid search over optimal step size.
Acknowledgments
The authors thank Andreas Look, Kenichi Nakazato and Sho Sonoda for helpful discussions. PH is grateful for financial support by the Ministry of Science, Research and Arts of the State of BadenWürttemberg, and the European Research Council (ERC StG Action 757275 / PANAMA).
References
 Avelin and Nyström [2020] B. Avelin and K. Nyström. Neural odes as the deep limit of resnets with constant weights. Analysis and Applications, 2020. doi: 10.1142/S0219530520400023.

Balduzzi et al. [2017]
D. Balduzzi, M. Frean, L. Leary, J. Lewis, K. W.D. Ma, and B. McWilliams.
The shattered gradients problem: If resnets are the answer, then what
is the question?
In
Proceedings of the 34th International Conference on Machine Learning
, pages 342–350, 2017.  Behrmann et al. [2018] J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J.H. Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.
 Benning et al. [2019] M. Benning, E. Celledoni, M. J. Ehrhardt, B. Owren, and C.B. Schönlieb. Deep learning as optimal control problems: Models and numerical methods. Journal of Computational Dynamics, 6:171, 2019. ISSN 21582491. doi: 10.3934/jcd.2019009.
 Chang et al. [2017] B. Chang, L. Meng, E. Haber, F. Tung, and D. Begert. Multilevel residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
 Chang et al. [2018] B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert, and E. Holtham. Reversible architectures for arbitrarily deep residual neural networks. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.

Chen et al. [2018a]
C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. C. Duke.
Continuoustime flows for efficient inference and density estimation.
In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 824–833, 2018a.  Chen et al. [2018b] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583. 2018b.
 Ciccone et al. [2018] M. Ciccone, M. Gallieri, J. Masci, C. Osendorfer, and F. Gomez. Naisnet: Stable deep networks from nonautonomous differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3025–3035. 2018.
 Cranmer et al. [2020] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho. Lagrangian neural networks. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020.
 Dupont et al. [2019] E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural odes. In Advances in Neural Information Processing Systems, pages 3134–3144. 2019.
 E [2017] W. E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 3 2017. doi: 10.1007/s403040170103z.
 Finlay et al. [2020] C. Finlay, J.H. Jacobsen, L. Nurbekyan, and A. M. Oberman. How to train your neural ode. arXiv preprint arXiv:2002.02798, 2020.
 Grathwohl et al. [2019] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Freeform continuous dynamics for scalable reversible generative models. International Conference on Learning Representations, 2019.
 Haber and Ruthotto [2017] E. Haber and L. Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.
 Haber et al. [2019] E. Haber, K. Lensink, E. Treister, and L. Ruthotto. IMEXnet a forward stable deep neural network. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 2525–2534, 2019.
 Hairer et al. [1993] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I – Nonstiff Problems. Springer, 2 edition, 1993. ISBN 9783540788621.

He et al. [2016]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  Hegde et al. [2019] P. Hegde, M. Heinonen, H. Lähdesmäki, and S. Kaski. Deep learning with differential gaussian process flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1812–1821, 2019.
 Huang et al. [2020] C.W. Huang, L. Dinh, and A. Courville. Solving {ode} with universal flows: Approximation theory for flowbased models. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020.
 Li et al. [2019] Q. Li, T. Lin, and Z. Shen. Deep learning via dynamical systems: An approximation perspective. arXiv preprint arXiv:1912.10382, 2019.

Lin and Jegelka [2018]
H. Lin and S. Jegelka.
Resnet with oneneuron hidden layers is a universal approximator.
In Advances in Neural Information Processing Systems 31, pages 6169–6178. 2018.  Lu et al. [2018] Y. Lu, A. Zhong, Q. Li, and B. Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 3276–3285, 2018.
 Lu et al. [2020] Y. Lu, C. Ma, Y. Lu, J. Lu, and L. Ying. A meanfield analysis of deep resnet and beyond: Towards provable optimization via overparameterization from depth. arXiv preprint arXiv:2003.05508, 2020.
 Massaroli et al. [2020] S. Massaroli, M. Poli, M. Bin, J. Park, A. Yamashita, and H. Asama. Stable neural flows. arXiv preprint arXiv:2003.08063, 2020.
 Owhadi and Yoo [2019] H. Owhadi and G. R. Yoo. Kernel flows: From learning kernels from data into the abyss. Journal of Computational Physics, 389:22 – 47, 2019. ISSN 00219991. doi: https://doi.org/10.1016/j.jcp.2019.03.040.

Ruthotto and Haber [2019]
L. Ruthotto and E. Haber.
Deep neural networks motivated by partial differential equations.
Journal of Mathematical Imaging and Vision, pages 1–13, 2019.  Sonoda and Murata [2019] S. Sonoda and N. Murata. Transport analysis of infinitely deep neural network. Journal of Machine Learning Research, 20(2):1–52, 2019.
 Thorpe and van Gennip [2018] M. Thorpe and Y. van Gennip. Deep limits of residual neural networks. arXiv preprint arXiv:1810.11741, 2018.
 Veit and Belongie [2018] A. Veit and S. Belongie. Convolutional networks with adaptive inference graphs. In The European Conference on Computer Vision (ECCV), September 2018.
 Xie et al. [2017] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
 Yan et al. [2020] H. Yan, J. Du, V. Tan, and J. Feng. On robustness of neural ordinary differential equations. In International Conference on Learning Representations, 2020.
 Yang et al. [2020] Y. Yang, J. Wu, H. Li, X. Li, T. Shen, and Z. Lin. Dynamical system inspired adaptive time stepping controller for residual network families. In ThirtyFourht AAAI Conference on Artificial Intelligence, 2020.
 Zagoruyko and Komodakis [2016] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 Zhang et al. [2019a] H. Zhang, X. Gao, J. Unterman, and T. Arodz. Approximation capabilities of neural ordinary differential equations. arXiv preprint arXiv:1907.12998, 2019a.
 Zhang et al. [2019b] J. Zhang, B. Han, L. Wynter, B. K. H. Low, and M. Kankanhalli. Towards robust resnet: A small step but a giant leap. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19, pages 4285–4291, 2019b.
 Zhang et al. [2019c] T. Zhang, Z. Yao, A. Gholami, J. E. Gonzalez, K. Keutzer, M. W. Mahoney, and G. Biros. Anodev2: A coupled neural ode framework. In Advances in Neural Information Processing Systems 32, pages 5151–5161, 2019c.
Appendix A Architecture and hyperparameters
We chose the architecture for our network similar to the architecture proposed by Dupont et al. [2019]
. We tried to find hyperparameters which worked well for all step sizes. The same hyperparameters were used for the grid search and for training with the step adaption algorithm:
a.1 Architecture and hyperparameters used for MNIST
a.2 Architecture and hyperparameters used for cifar10
Neural ODE Block

Conv2D(3, 128, Kernel 1x1, padding 0) + ReLu

Conv2D(128, 128, Kernel 3x3, padding 1) + ReLu

Conv2D(128, 3, Kernel 1x1, padding 0)
Classifier

Flatten + LinearLayer(3072,10) + SoftMax
Hyperparameters

Batch size: 256

Optimizer: Adam

Learning rate: 1e3

Iterations used for training: 7800
a.3 Architecture used for Concentric Sphere 2D dataset
Neural ODE Block

Conv1D(1, 32, Kernel 1x1, padding 0) + ReLu

Conv1D(32, 32, Kernel 3x3, padding 1) + ReLu

Conv1D(32, 1, Kernel 1x1, padding 0)
Classifier

Flatten + LinearLayer(2,2) + SoftMax
Hyperparameters

Batch size: 128

Optimizer: Adam

Learning rate: 1e4

Iterations used for training: 10000
a.4 Architecture used for Energy Landscape dataset
Neural ODE Block

LinearLayer(2, 48) + ReLu

LinearLayer(48, 48) + ReLu

LinearLayer(48, 2)
Classifier

Flatten + LinearLayer(2, 3) + SoftMax
Hyperparameters

Batch size: 128

Optimizer: Adam

Learning rate: 5e4

Iterations used for training: 30000
For the classifier we used a plain linear layer and a softmax

Flatten + LinearLayer(dim, out_dim) + SoftMax
Comments
There are no comments yet.