When are Neural ODE Solutions Proper ODEs?

07/30/2020 ∙ by Katharina Ott, et al. ∙ Bosch Universität Tübingen 10

A key appeal of the recently proposed Neural Ordinary Differential Equation(ODE) framework is that it seems to provide a continuous-time extension of discrete residual neural networks. As we show herein, though, trained Neural ODE models actually depend on the specific numerical method used during training. If the trained model is supposed to be a flow generated from an ODE, it should be possible to choose another numerical solver with equal or smaller numerical error without loss of performance. We observe that if training relies on a solver with overly coarse discretization, then testing with another solver of equal or smaller numerical error results in a sharp drop in accuracy. In such cases, the combination of vector field and numerical method cannot be interpreted as a flow generated from an ODE, which arguably poses a fatal breakdown of the Neural ODE concept. We observe, however, that there exists a critical step size beyond which the training yields a valid ODE vector field. We propose a method that monitors the behavior of the ODE solver during training to adapt its step size, aiming to ensure a valid ODE without unnecessarily increasing computational cost. We verify this adaption algorithm on two common bench mark datasets as well as a synthetic dataset. Furthermore, we introduce a novel synthetic dataset in which the underlying ODE directly generates a classification task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The choice of neural network architecture is an important consideration in the deep learning community. Among a plethora of options, Residual Neural Networks (ResNets) 

He et al. (2016) have emerged as an important subclass of models, as they mitigate the gradient issues Balduzzi et al. (2017) arising with training deep neural networks by adding skip connections between the successive layers. Besides the architectural advancements inpired from the original scheme Zagoruyko and Komodakis (2016); Xie et al. (2017), recently Neural Ordinary Differential Equation (Neural ODE) models Chen et al. (2018b); E (2017); Lu et al. (2018); Haber and Ruthotto (2017) have been proposed as an analog of continuous-depth ResNets. While Neural ODEs do not necessarily improve upon the sheer predictive performance of ResNets, they offer the vast knowledge of ODE theory to be applied to deep learning research. For instance, the authors in Yan et al. (2020)

discovered that for specific perturbations, Neural ODEs are more robust than convolutional neural networks. Moreover, inspired by the theoretical properties of the solution curves, they propose a regularizer which improved the robustness of Neural ODE models even further. However, if Neural ODEs are chosen for their theoretical advantages, it is essential that the effective model—the combination of ODE problem and its solution via a particular numerical method—is a close approximation of the true analytical, but practically inaccessible ODE solution.

In this work, we study the empirical risk minimization (ERM) problem

(1)

where is a set of training data,

is a (non-negative) loss function and

is a Neural ODE model with weights , i.e.,

(2)

where are (suitable) neural networks and and denote the upstream and downstream layers respectively. is defined to be the (analytical) flow of the dynamical system

(3)

As the vector field of the dynamical system is itself defined by a neural network, evaluating is intractable and we have to resort to a numerical scheme to compute . belongs either to a class of fixed step size methods or is an adaptive step size solver as proposed in Chen et al. (2018b). For each initial value problem (IVP) of the Neural ODE , the trajectory computed using a particular solver is defined as where is uniquely defined for fixed-step solvers.

Figure 1: The Neural ODE was trained on a classification task with a small (a) and large step size (b). In (a) the trajectories look smooth and do not cross. In (b) the solutions found by the solver cross. Panels (c) and (d) show the test accuracy of the Neural ODE solver using different step sizes, the dark blue circle indicates the number of steps used for training.

Since the numerical solvers play an essential role in approximation the solutions of an ODE, it is intuitive to ask: how the choice of the numerical method affects the training of a Neural ODE model? Specifically, does the discretization step of the numerical solver impact the resulting flow of the ODE? To test this, we first train a Neural ODE model on a synthetic classification task using a fixed step solver with a small step size. Figure 1 (a) shows that for the small step size model, the numerically computed trajectories of the individual IVPs do not cross, which is an indication that the learned flow approximates the true solution of an ODE. In contrast, the trajectories of the IVPs cross, if the training is repeated with a larger step size (see Figure 1 (b)). This behavior clearly indicates that the numerical solutions for solvers with large step sizes do not always agree with the true solutions of the ODE. For the latter model, the discretization error of the solver is so large that the resulting numerical solution no longer maintains the properties of ODE solutions.

If we are interested in extending the advances made in the ODE community to Neural ODE models, we have to ensure that the trained Neural ODE model indeed corresponds to a time continuous dynamical system. Consequently, if the trained model corresponds to an ODE that is (qualitatively) reasonably well approximated by the applied discretization, it also stands to reason that any discretization with similar or lesser discretization error should yield the same predictions. We observe that for the model trained with a small step size, using another solver with smaller step size for testing indeed achieves the same accuracy (Figure 1 (c)). However, the model trained with a large step size shows a sharp drop in the test performance when using a solver with smaller discretization error (Figure 1 (d)).

In the training process of Neural ODEs, the neural network describing the vector field of the ODE is not trained directly. Instead, the numerical solution where the neural network is evaluated at discrete points in time, is optimized. Therefore, for training with large step sizes, the resulting model can no longer be described by a time continuous ODE, but rather as a discrete finite difference equation. Hence, the model can no longer be viewed as being independent of a specific solver with a specific step size.

In this work we show that the training process of a Neural ODE yields a discrete ResNet without valid ODE interpretation if the step size is chosen too large. Furthermore, our rigorous Neural ODE experiments on two synthetic datasets as well as MNIST and cifar10 show that for each dataset there exists a step size where the ODE interpretation starts to be valid again. Based on this observation we propose an algorithm to find the coarsest discretization, that still leads to a continuous dynamical system. Additionally, we introduce a difficult synthetic dataset where the classification problem directly stems from a true generating vector field.

2 Related Work

The connections between ResNets and ODEs have been discussed in E (2017); Lu et al. (2018); Haber and Ruthotto (2017); Sonoda and Murata (2019). The authors in Behrmann et al. (2018) use similar ideas to build an invertible ResNet. Likewise, additional knowledge about the ODE solvers can be exploited to create more stable and robust architectures with a ResNet backend Haber and Ruthotto (2017); Haber et al. (2019); Chang et al. (2018); Ruthotto and Haber (2019); Ciccone et al. (2018); Cranmer et al. (2020); Benning et al. (2019).

Continuous-depth deep learning was first proposed in Chen et al. (2018b); E (2017). Although ResNets are universal function approximators Lin and Jegelka (2018), Neural ODEs require specific architectural choices to be as expressive as their discrete counterparts Dupont et al. (2019); Zhang et al. (2019a); Li et al. (2019). In this direction, one common approach is to introduce a time-dependence for the weights of the neural network Zhang et al. (2019c); Thorpe and van Gennip (2018); Avelin and Nyström (2020). Other solutions include, novel Neural ODE models Lu et al. (2020); Massaroli et al. (2020) with improved training behavior, and variants based on kernels Owhadi and Yoo (2019) and Gaussian processes Hegde et al. (2019). Adaptive ResNet architectures have been proposed in Veit and Belongie (2018); Chang et al. (2017). The dynamical systems view of ResNets has lead to the development of methods using time step control as a part of the ResNet architecture Yang et al. (2020); Zhang et al. (2019b).

Neural ODEs have also been proposed in generative modelling Chen et al. (2018b, a); Grathwohl et al. (2019). Applications of ODE theory to improve the training of generative models has been proposed in Finlay et al. (2020); Huang et al. (2020).

3 Synthetic datasets

3.1 Concentric sphere dataset

For our experiments, we introduce a classification task based on the concentric sphere dataset proposed by Dupont et al. (2019) . We use three concentric spheres, where the outer and the inner sphere correspond to the same class (see Figure 2

(c) for a 2 dimensional example). Whether this dataset can be fully described by an ODE, is dependent on the degrees of freedom introduced by combining the Neural ODE with additional downstream (and upstream) layers.

This dataset can easily be scaled up to arbitrarily high dimensions. We use the dataset with 2, 3, 10 and 900 dimensions. We chose the 900 dimensions variant because it (roughly) corresponds to the dimensionality of MNIST and cifar10 datasets.

[]   []   []

Figure 2: Synthetic datasets for Neural ODEs. (a) Potential landscape used to generate a classification task with three classes. The resulting distribution of the classes is shown in (b). The position of the dot corresponds to initial position and initial velocity . The color encodes the final position with each color corresponding to different local minimum that is a different class. (c) shows the concentric sphere dataset as a classification task.

3.2 Particle in energy landscape

For common deep learning classification tasks, it is unclear whether Neural ODEs are a good prior since they are not universal function approximators Dupont et al. (2019); Zhang et al. (2019a). Moreover, even if such a Neural ODE model exists, the question remains whether Neural ODEs are preferable or sensible. Therefore in this work, we introduce a new synthetic classification task, where the true generative process is an ODE, and thus using Neural ODEs to model this data should be a natural choice.

This synthetic dataset describes the dynamics of a particle in a 1D potential landscape including friction. Due to the potential, the particle experiences a force . Additionally, the particle experiences a friction force proportional to its velocity : , where is the friction coefficient. According to Newton’s second law of motion, the dynamics of the particle with unit mass can be described as:

(4)

The problem is run until equilibrium, that is until the particle reaches zero velocity and does not experience any force. Since the dynamics are supposed to describe a classification task, the potential needs to fulfill the following conditions:

  1. The potential needs to have a predefined number of local minima

The first point ensures that the particle cannot escape the potential landscape. The number of minima determine the number of classes of the classification task. Each minima is therefore assigned a unique category. For our experiments, we generated a dataset with three classes which corresponds to a potential landscape with three local minima (shown in Figure 2 (a) and (b)). However, this problem can be generalized to an arbitrary number of classes and to higher dimensions.

An appealing aspect of this dataset is that it allows to compare the recovered vector field with that of the true ODE (comparative figures can be found in the Supplementary Material). The true vector field has attracting sinks for each class, which are noticeably absent in the vector field of the trained Neural ODE model in which the points belonging to different classes are simply pushed towards different regions of the latent space. This illustrates that the vector field identification is ill-posed in the ERM setting as many feasible models exist that achieve high accuracy.

4 Interaction of Neural ODE and ODE solver can lead to discrete dynamics

We test the dependency of Neural ODEs on the choice of solvers by training different models on the energy landscape dataset with varying step sizes. To this end, we observe that the training performed using a solver with a small step size () results in smooth trajectories. On the other hand, a Neural ODE model trained with a large step size () leads to solutions which cross each other. The trajectory crossing problem is defined as follows:

If the trajectories found by the ODE solver are supposed to approximate the true solutions to the ODE, then the trajectories should not cross. We claim that for large step sizes, the Neural ODE can no longer be viewed as a time continuous system. Instead, the combination of solver and vector field has to be viewed as a discrete dynamical system. This system is no longer represented by an ODE, but a finite difference equation. Contrary to this, if the step size of the solver used for training the Neural ODE model is sufficiently small, the resulting vector field is a good enough numerical approximation to describe the energy landscape problem as a continuous dynamical system. Consequently during inference, for the model trained with a small step size, using different solvers with smaller discretization errors does not lead to a different test accuracy as the numerical solutions maintain enough significant digits. Specifically, these significant digits are dependent on the robustness

of the classifier block: for the classifier, there should exist a robustness threshold

such that . Thus, if two solvers compute the same solution up to , the classifier identifies these solutions as the same class and the result of the model is not affected by the interchanging these solvers. To test this conjecture, we use a solver with a smaller numerical error for testing. This can be achieved by reducing the step size of the solver. If the underlying dynamics can indeed be described by a time continuous ODE, then the accuracy should be independent of the solver used for testing as long as this solver has a smaller numerical error. For the model trained with the small step size, the accuracy is indeed independent of the solver used for testing (Figure 1 (c)). On the other hand, the model trained with the large step size shows a sharp drop in accuracy, when using smaller step sizes during testing (Figure 1 (d)). In this case, the vector field has adapted to the specific step size used during training and, therefore, the model is tied to a specific solver.

4.1 Experiments

[] []

[] []

Figure 3: A Neural ODE was trained with different step sizes (plotted in different colors) on the two dimensional concentric sphere dataset (a), (b) and on the potential landscape dataset (c), (d). The model was tested with different solvers and different step sizes. In (a), (c) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). For high resolution versions, see the Supplementary Materials.

[] []

[] []

Figure 4: A Neural ODE was trained with different step sizes (plotted in different colors) on MNIST (a), (b) and cifar10 (c), (d). The model was tested with different solvers and different step sizes. In (a), (c) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). For high resolution versions, see the Supplementary Materials.

For our experiments, we use a fixed step solver to analyze the dynamics of Neural ODEs. In order to perform gradient descent, reverse-mode auto-differentiation through the solver has to be carried out. We choose to back-propagate through the numerical solver and not use the adjoint method described in Chen et al. (2018b). Theses choices were made, because we want to showcase that the interplay between vector field and ODE solver can lead to discrete dynamics in the simplest settings. However, the results pertain to adaptive solvers, see figures in the Supplementary Material. Additionally, using a fixed step solver makes analyzing the problem easy, as the numerical error of the method can be adjusted by simply changing the step size. For all our experiments we do not use an upstream block similar to the architectures proposed in Dupont et al. (2019). We chose such an architectural scheme to maximize the modeling contributions of the ODE block.

To test whether the observation discussed above generalizes across different datasets, we train Neural ODE models on the two synthetic datasets as well as on MNIST and cifar10. For training the Neural ODE, Euler’s method and a 4th order Runge Kutta (rk4) method were used (detailed descriptions of these methods can be found in Hairer et al. (1993)). The trained Neural ODE was then tested with different step sizes and solvers. For a Neural ODE trained with Euler’s method, the model was tested with Euler’s method, the midpoint method and the rk4 method. The testing step size was chosen as a factor of 0.5, 0.75, 1, 1.5, and 2 of the original step size used for training. For rk4, we only tested using the rk4 method with different step sizes. We report an average over five runs, where we used an aggregation of seeds for which the Neural ODE model trained successfully (the results for all seeds are disclosed in the Supplementary Material). We did not tune all hyper-parameters to reach best performance for each step size. Rather, we focused on hyper-parameters that worked well across the entire range of step sizes used for training (see Supplementary Material for more details of the choice of hyper parameters and the architecture of the Neural ODE).

Training and testing the model with the same step size, we observe that the test accuracy does not show any clear dependence on the step size on all four datasets. Since we did not tune the learning rate for each step size, any visible trends could be due to this choice. On all datasets, we observe similar behavior for dependence of the test accuracy on the test solver: when using large step sizes for training, the Neural ODE shows dependence on the solver used for testing. But there exists some critical step size below which the model shows no clear dependence on the test solver as long as this test solver has equal or smaller numerical error than the solver used for training (as seen in Figure 3 and Figure 4. For additional results on higher dimensional versions of the concentric sphere dataset we refer to the Supplementary Material.) We found that this critical step size is different for different datasets. To our surprise, the critical step sizes of the synthetic datasets were larger than the ones for MNIST and cifar10, which indicates that these tasks were more difficult than the standard image classification problems. For higher order solvers, such as the rk4, bigger step sizes than those for Euler’s method still lead to a valid ODE vector field. In agreement with convergence theory, the difference in the critical step sizes of rk4 and Euler is due to the difference in the discretization error of the two solvers.

5 Algorithm for step size adaption

Although the Neural ODE achieves good accuracy independent of whether the vector field has adapted to the solver or not, if theoretic results of ODEs are to be applicable to Neural ODEs, it is paramount to find a solution corresponding to an ODE flow. To ensure this, we propose an algorithm that checks whether the continuous property is preserved and adapts the step size if necessary.

The algorithm chooses the initial step size using the algorithm described by Hairer et al. (1993). This algorithm ensures that the Neural ODE chooses an appropriate step size for all neural networks and initializations. We found that the initial step size suggested by the algorithm is not too small, which makes the algorithm useful in practice. The Neural ODE starts training with the proposed step size using a solver of order . After a predefined number of steps (we chose ), the algorithm checks whether the Neural ODE is still continuous: The accuracy is calculated over one batch with the solver used for training and with a test solver, which has a smaller discretization error than the solver used for training. The crucial condition whether the solver is a feature of the model is whether a higher accuracy solver drops the performance significantly. If so, we decrease the step size and let the model train a couple of iterations to regain time-continuous dynamics. If not, we cautiously increase the step size. Unlike in ODE solvers, the difference between train and test accuracy does not tell by how much the step size needs to be adapted, so we choose some constant multiplicative factor that works well in practice (see Algorithm 1). The algorithm was robust against small changes to the constants in the algorithm.

Result: Algorithm that adapts the step size of the ODE solver to achieve time continuous dynamics
initialize starting step_size according to (Hairer et al., 1993, p. 169);
while Training do
       if Iteration % 50 == 0 then
             test_acc = calculate_accuracy_higher_order_solver();
             if |train_acc-test_acc| > 0.1 then
                   new_step_size = 0.5 step_size;
                  
            else
                   new_step_size = 1.1 step_size;
                  
             end if
            
       end if
      
end while
Algorithm 1 Step adaption algorithm

5.1 Experiments

We test the step adaption algorithm on three different datasets: the two synthetic datasets, MNIST and cifar10. For training we use Euler’s method and for testing we use the midpoint method. On all datasets we observe that the number of steps taken by the solver fluctuate. The reason for this is that the algorithm increases the step size until the step size is too large and training with this step size leads an adaption of the vector field to this particular step size. Continuing training with a smaller step size, this behavior is corrected (see Figure 6).

[] []

[] []

Figure 5: Using the step adaption algorithm for training on MNIST (a), (b) and Cifar10 (c), (d). (a), (c) shows the test accuracy over the course of training for five different seeds. (b), (d) shows the number of steps chosen by the algorithm over the course of training.

[] []

Figure 6: Behavior of the step adaption algorithm. (a) shows the test accuracy. At certain points in time (also marked in (b)), the model is evaluated with solvers of smaller discretization error (orange green and red data points). Triangles correspond to a 4th order Runge Kutta method, crosses to the midpoint method. (b) shows the number of steps chosen by the algorithm.

To compare the results of the step adaption algorithm to the results of the grid search we compare accuracy as well as number of average function evaluations (NFE) per iteration. For the grid search, we determine the critical number of steps using the same method as in the step adaption algorithm. We report the two step sizes closest to the critical step size which were part of the grid search. For the step adaption algorithm we calculate the NFE per iteration by including all function evaluations over course of training (see Table 1). The achieved accuracy and step size found by our algorithm is on par with the smallest step size above the critical threshold thereby eliminating the need for a grid search.

Grid search Step adaption algorithm
Data set NFE Accuracy NFE Accuracy
Concentric spheres 2d 65-129 % 100.5 %
Three Minima 33-65 % 43.3 %
MNIST 9-16 % 9.8 %
Cifar10 17-33 % 21.9 %
Table 1: Results for the accuracy and the number of function evaluations to achieve time continuous dynamics using a grid search and the proposed step adaption algorithm. For the grid search, we report the highest accuracy of all the runs which were identified as ODE-like.

6 Conclusion

We have shown that step size of the fixed step solver used for training Neural ODEs impacts whether the resulting flow is an approximation to a time continuous system or a system discrete in time. As a simple test that works well in practice, we conclude that the model only corresponds to an ODE flow, if the performance does not depend on the exact solver configuration. We have verified this behavior on MNIST and cifar10 as well as two synthetic datasets. Based on these observations, we developed a step adaption algorithm, which maintains the continuous dynamics of the ODE throughout training. For minimal loss in accuracy and step size efficiency, our algorithm eliminates a massive grid search. In future work we plan to eliminate the oscillatory behavior of the step adaption algorithm.

Although we have focused on fixed step solvers in this work, we also observed a transition from ODE-like dynamics to discrete dynamics for adaptive step size solvers (see Supplementary Material for these results). In this case, the tolerance parameter of the solver determines the behavior of the ODE. Additionally, instead of keeping the integration time constant, one could consider using a constant step size and increasing the integration time. We have not investigated this. Extending the step size algorithm to adaptive step size solvers and schemes where the integration time can be adapted is left to future work.

Broader Impact

The majority of our contributions are theoretical insights into the relatively recent deep learning paradigm of Neural ODEs. We hope and expect that this better understanding will improve the robustness and interpretability of Neural ODE models. As such, the impact of this work will depend on the usage of such models. We anticipate a positive impact for applications where the robustness and interpretability of model predictions are crucial, e.g., in medical applications or autonomous vehicles. However, we must also consider malicious applications, e.g., in surveillance and autonomous weapons. As the responsible engineers, it is our duty to advocate beneficial applications, e.g., by educating and supporting policy makers or refusing cooperation if a malicious application is suspected.

As a concrete result of the theoretical insights, we have suggested an algorithm that aims to reduce computational load. As a positive consequence, we expect a reduction of energy consumption as this not only minimizes the computational load per trained network, but hopefully eliminates the necessity for grid search over optimal step size.

Acknowledgments

The authors thank Andreas Look, Kenichi Nakazato and Sho Sonoda for helpful discussions. PH is grateful for financial support by the Ministry of Science, Research and Arts of the State of Baden-Württemberg, and the European Research Council (ERC StG Action 757275 / PANAMA).

References

  • Avelin and Nyström [2020] B. Avelin and K. Nyström. Neural odes as the deep limit of resnets with constant weights. Analysis and Applications, 2020. doi: 10.1142/S0219530520400023.
  • Balduzzi et al. [2017] D. Balduzzi, M. Frean, L. Leary, J. Lewis, K. W.-D. Ma, and B. McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? In

    Proceedings of the 34th International Conference on Machine Learning

    , pages 342–350, 2017.
  • Behrmann et al. [2018] J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J.-H. Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.
  • Benning et al. [2019] M. Benning, E. Celledoni, M. J. Ehrhardt, B. Owren, and C.-B. Schönlieb. Deep learning as optimal control problems: Models and numerical methods. Journal of Computational Dynamics, 6:171, 2019. ISSN 2158-2491. doi: 10.3934/jcd.2019009.
  • Chang et al. [2017] B. Chang, L. Meng, E. Haber, F. Tung, and D. Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
  • Chang et al. [2018] B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert, and E. Holtham. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Chen et al. [2018a] C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. C. Duke.

    Continuous-time flows for efficient inference and density estimation.

    In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 824–833, 2018a.
  • Chen et al. [2018b] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583. 2018b.
  • Ciccone et al. [2018] M. Ciccone, M. Gallieri, J. Masci, C. Osendorfer, and F. Gomez. Nais-net: Stable deep networks from non-autonomous differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3025–3035. 2018.
  • Cranmer et al. [2020] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho. Lagrangian neural networks. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020.
  • Dupont et al. [2019] E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural odes. In Advances in Neural Information Processing Systems, pages 3134–3144. 2019.
  • E [2017] W. E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 3 2017. doi: 10.1007/s40304-017-0103-z.
  • Finlay et al. [2020] C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. M. Oberman. How to train your neural ode. arXiv preprint arXiv:2002.02798, 2020.
  • Grathwohl et al. [2019] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. International Conference on Learning Representations, 2019.
  • Haber and Ruthotto [2017] E. Haber and L. Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.
  • Haber et al. [2019] E. Haber, K. Lensink, E. Treister, and L. Ruthotto. IMEXnet a forward stable deep neural network. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 2525–2534, 2019.
  • Hairer et al. [1993] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I – Nonstiff Problems. Springer, 2 edition, 1993. ISBN 978-3-540-78862-1.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • Hegde et al. [2019] P. Hegde, M. Heinonen, H. Lähdesmäki, and S. Kaski. Deep learning with differential gaussian process flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1812–1821, 2019.
  • Huang et al. [2020] C.-W. Huang, L. Dinh, and A. Courville. Solving {ode} with universal flows: Approximation theory for flow-based models. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020.
  • Li et al. [2019] Q. Li, T. Lin, and Z. Shen. Deep learning via dynamical systems: An approximation perspective. arXiv preprint arXiv:1912.10382, 2019.
  • Lin and Jegelka [2018] H. Lin and S. Jegelka.

    Resnet with one-neuron hidden layers is a universal approximator.

    In Advances in Neural Information Processing Systems 31, pages 6169–6178. 2018.
  • Lu et al. [2018] Y. Lu, A. Zhong, Q. Li, and B. Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 3276–3285, 2018.
  • Lu et al. [2020] Y. Lu, C. Ma, Y. Lu, J. Lu, and L. Ying. A mean-field analysis of deep resnet and beyond: Towards provable optimization via overparameterization from depth. arXiv preprint arXiv:2003.05508, 2020.
  • Massaroli et al. [2020] S. Massaroli, M. Poli, M. Bin, J. Park, A. Yamashita, and H. Asama. Stable neural flows. arXiv preprint arXiv:2003.08063, 2020.
  • Owhadi and Yoo [2019] H. Owhadi and G. R. Yoo. Kernel flows: From learning kernels from data into the abyss. Journal of Computational Physics, 389:22 – 47, 2019. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2019.03.040.
  • Ruthotto and Haber [2019] L. Ruthotto and E. Haber.

    Deep neural networks motivated by partial differential equations.

    Journal of Mathematical Imaging and Vision, pages 1–13, 2019.
  • Sonoda and Murata [2019] S. Sonoda and N. Murata. Transport analysis of infinitely deep neural network. Journal of Machine Learning Research, 20(2):1–52, 2019.
  • Thorpe and van Gennip [2018] M. Thorpe and Y. van Gennip. Deep limits of residual neural networks. arXiv preprint arXiv:1810.11741, 2018.
  • Veit and Belongie [2018] A. Veit and S. Belongie. Convolutional networks with adaptive inference graphs. In The European Conference on Computer Vision (ECCV), September 2018.
  • Xie et al. [2017] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  • Yan et al. [2020] H. Yan, J. Du, V. Tan, and J. Feng. On robustness of neural ordinary differential equations. In International Conference on Learning Representations, 2020.
  • Yang et al. [2020] Y. Yang, J. Wu, H. Li, X. Li, T. Shen, and Z. Lin. Dynamical system inspired adaptive time stepping controller for residual network families. In Thirty-Fourht AAAI Conference on Artificial Intelligence, 2020.
  • Zagoruyko and Komodakis [2016] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • Zhang et al. [2019a] H. Zhang, X. Gao, J. Unterman, and T. Arodz. Approximation capabilities of neural ordinary differential equations. arXiv preprint arXiv:1907.12998, 2019a.
  • Zhang et al. [2019b] J. Zhang, B. Han, L. Wynter, B. K. H. Low, and M. Kankanhalli. Towards robust resnet: A small step but a giant leap. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 4285–4291, 2019b.
  • Zhang et al. [2019c] T. Zhang, Z. Yao, A. Gholami, J. E. Gonzalez, K. Keutzer, M. W. Mahoney, and G. Biros. Anodev2: A coupled neural ode framework. In Advances in Neural Information Processing Systems 32, pages 5151–5161, 2019c.

Appendix A Architecture and hyper-parameters

We chose the architecture for our network similar to the architecture proposed by Dupont et al. [2019]

. We tried to find hyperparameters which worked well for all step sizes. The same hyper-parameters were used for the grid search and for training with the step adaption algorithm:


a.1 Architecture and hyper-parameters used for MNIST

Neural ODE Block

  • Conv2D(1, 96, Kernel 1x1, padding 0) + ReLu

  • Conv2D(96, 96, Kernel 3x3, padding 1) + ReLu

  • Conv2D(96, 1, Kernel 1x1, padding 0)

Classifier

  • Flatten + LinearLayer(784,10) + SoftMax

Hyper-parameters

  • Batch size: 256

  • Optimizer: SGD

  • Learning rate: 1e-2

  • Iterations used for training: 7020

a.2 Architecture and hyper-parameters used for cifar10

Neural ODE Block

  • Conv2D(3, 128, Kernel 1x1, padding 0) + ReLu

  • Conv2D(128, 128, Kernel 3x3, padding 1) + ReLu

  • Conv2D(128, 3, Kernel 1x1, padding 0)

Classifier

  • Flatten + LinearLayer(3072,10) + SoftMax

Hyper-parameters

  • Batch size: 256

  • Optimizer: Adam

  • Learning rate: 1e-3

  • Iterations used for training: 7800

a.3 Architecture used for Concentric Sphere 2D dataset

Neural ODE Block

  • Conv1D(1, 32, Kernel 1x1, padding 0) + ReLu

  • Conv1D(32, 32, Kernel 3x3, padding 1) + ReLu

  • Conv1D(32, 1, Kernel 1x1, padding 0)

Classifier

  • Flatten + LinearLayer(2,2) + SoftMax

Hyper-parameters

  • Batch size: 128

  • Optimizer: Adam

  • Learning rate: 1e-4

  • Iterations used for training: 10000

a.4 Architecture used for Energy Landscape dataset

Neural ODE Block

  • LinearLayer(2, 48) + ReLu

  • LinearLayer(48, 48) + ReLu

  • LinearLayer(48, 2)

Classifier

  • Flatten + LinearLayer(2, 3) + SoftMax

Hyper-parameters

  • Batch size: 128

  • Optimizer: Adam

  • Learning rate: 5e-4

  • Iterations used for training: 30000

For the classifier we used a plain linear layer and a softmax

  • Flatten + LinearLayer(dim, out_dim) + SoftMax

Appendix B Extended results

b.1 Concentric Sphere 2D

[] []

[] []

[] []

Figure 7: A Neural ODE was trained with different step sizes ((a), (b) 1 step, (c),(d) 2 steps, (e), (f) 4 steps) on the two dimensional concentric sphere dataset. The model was tested with different solvers and different step sizes. In (a), (c), (e) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d), (f) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

[] []

[] []

[] []

Figure 8: A Neural ODE was trained with different step sizes ((a), (b) 8 steps, (c),(d) 16 steps, (e), (f) 32 steps) on the two dimensional concentric sphere dataset. The model was tested with different solvers and different step sizes. In (a), (c), (e) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d), (f) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

[] []

[] []

[]

Figure 9: A Neural ODE was trained with different step sizes ((a), (b) 64 steps, (c),(d) 128 steps, (e) 256 steps) on the two dimensional concentric sphere dataset. The model was tested with different solvers and different step sizes. In (a), (c), (e) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

b.2 Energy Landscape

[] []

[] []

[] []

Figure 10: A Neural ODE was trained with different step sizes ((a), (b) 1 step, (c),(d) 2 steps, (e), (f) 4 steps) on the energy landscape dataset. The model was tested with different solvers and different step sizes. In (a), (c), (e) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d), (f) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

[] []

[] []

[] []

Figure 11: A Neural ODE was trained with different step sizes ((a), (b) 8 steps, (c),(d) 16 steps, (e), (f) 32 steps) on the energy landscape dataset. The model was tested with different solvers and different step sizes. In (a), (c), (e) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d), (f) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

[] []

[]

[]

Figure 12: A Neural ODE was trained with different step sizes ((a), (b) 64 steps, (c) 128 steps, (d) 256 steps) on the energy landscape dataset. The model was tested with different solvers and different step sizes. In (a), (c), (d) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

b.3 Mnist

[] []

[] []

[] []

Figure 13: A Neural ODE was trained with different step sizes ((a), (b) 1 step, (c),(d) 2 steps, (e), (f) 4 steps) on the MNIST dataset. The model was tested with different solvers and different step sizes. In (a), (c), (e) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d), (f) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

[] []

[] []

[] []

[]

Figure 14: A Neural ODE was trained with different step sizes ((a), (b) 8 steps, (c),(d) 16 steps, (e), (f) 32 steps (g) 64 steps) on the MNIST dataset. The model was tested with different solvers and different step sizes. In (a), (c), (e), (g) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d), (f) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

b.4 cifar10

[] []

[] []

[] []

Figure 15: A Neural ODE was trained with different step sizes ((a), (b) 1 step, (c),(d) 2 steps, (e), (f) 4 steps) on the cifar10 dataset. The model was tested with different solvers and different step sizes. In (a), (c), (e) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d), (f) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

[] []

[] []

[]

[]

Figure 16: A Neural ODE was trained with different step sizes ((a), (b) 8 step, (c),(d) 16 steps, (e) 32 steps, (f) 64 steps) on the cifar10 dataset. The model was tested with different solvers and different step sizes. In (a), (c), (e), (f) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles). Excluded seeds are shown as grey circles.

b.5 Results for n-dimensional concentric sphere dataset

[]   []

Figure 17: A Neural ODE was trained with different step sizes (plotted in different colors) on the 3 dimensional concentric sphere dataset (a), (b) and on the potential landscape dataset (c), (d). The model was tested with different solvers and different step sizes. In (a), (c) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles).

[]   []

Figure 18: A Neural ODE was trained with different step sizes (plotted in different colors) on the 10 dimensional concentric sphere dataset (a), (b) and on the potential landscape dataset (c), (d). The model was tested with different solvers and different step sizes. In (a), (c) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles).

[]   []

Figure 19: A Neural ODE was trained with different step sizes (plotted in different colors) on the 900 dimensional concentric sphere dataset (a), (b) and on the potential landscape dataset (c), (d). The model was tested with different solvers and different step sizes. In (a), (c) the model was trained using Euler’s method. Results obtained by using the same solver for training and testing are marked by dark circles. Light data indicated different step sizes used for testing. Circles correspond to Euler’s method, cross to the midpoint method and triangles to a 4th order Rung Kutta method. In (b), (d) a 4th order Runge Kutta methods was used for training (dark circles) and testing (light circles).

b.6 Results with adaptive step size solvers

[] []

[] []

Figure 20: Training a Neural ODE on the Concentric Sphere 2D dataset using the Fehlberg2(1) method with different tolerances. The model was then tested with different tolerances. For high tolerance (a), (b) the dynamics of the model are dependent on this specific solver configuration. For small tolerances (c), (d) the performances of the model is independent of the tolerance of the solver as long as it is smaller than the tolerance the model was trained with.

[] []

[] []

Figure 21: Training a Neural ODE on the Concentric Sphere 2D dataset using the Dopri5(4) method with different tolerances. The model was then tested with different tolerances. For high tolerance (a), (b) the dynamics of the model are dependent on this specific solver configuration. For small tolerances (c), (d) the performances of the model is independent of the tolerance of the solver as long as it is smaller than the tolerance the model was trained with.

b.7 Trajectories of the Neural ODE for the energy landscape dataset

[] []

[] []

[] []

Figure 22: True solutions and solutions found by the Neural ODE. (a) show the true underlying dynamics of the energy landscape problem. (b), (c), (d), (e), (f) are the solutions found by the Neural ODE for the energy landscape problem when started with different seeds and using 256 steps.