 # ANODEV2: A Coupled Neural ODE Evolution Framework

It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, which allow more general discretization schemes with adaptive time stepping. Here, we propose ANODEV2, which is an extension of this approach that also allows evolution of the neural network parameters, in a coupled ODE-based formulation. The Neural ODE method introduced earlier is in fact a special case of this new more general framework. We present the formulation of ANODEV2, derive optimality conditions, and implement a coupled reaction-diffusion-advection version of this framework in PyTorch. We present empirical results using several different configurations of ANODEV2, testing them on multiple models on CIFAR-10. We report results showing that this coupled ODE-based framework is indeed trainable, and that it achieves higher accuracy, as compared to the baseline models as well as the recently-proposed Neural ODE approach.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Residual networks [1, 2] have enabled training of very deep neural networks (DNNs). Recent work has shown an interesting connection between residual blocks and Ordinary Differential Equations (ODEs), showing that a residual network can be viewed as a discretization of a continuous ODE operator [3, 4, 5, 6, 7, 8]. These formulations are commonly called Neural ODEs, and here we follow the same convention. Neural ODEs provide a general framework that connects discrete DNNs to continuous dynamical systems theory, as well as discretization and optimal control of ODEs, all subjects with very rich theory.

A basic Neural ODE formulation and its connection to residual networks (for a single block in a network) is the following:

 (1a) z1 =z0+f(z0,θ) \small ResNet, (1b) z1 =z0+∫10f(z(t),θ)dt \small ODE, (1c) z1 =z0+f(z0,θ) \small ODE forward Euler.

Here, is the input to the network and is the output activation;

is the vector of network weights (independent of time); and

is the nonlinear operator defined by this block. (Here, we have written the ODE in terms of its solution at .) We can see that a single-step of forward Euler discretization of the ODE is identical to a traditional residual block. Alternatively, we could use a different time-stepping scheme or, more interestingly, use more time steps. Once the connection to ODEs was identified, several groups have incorporated the Neural ODE structure in neural networks and evaluated its performance on several different learning tasks.

A major challenge with training Neural ODEs is that backpropogating through ODE layers requires storage of all the intermediate activations (i.e. ) in time. In principle, the memory footprint of ODE layers has a cost of (where is the number of time steps to solve the ODE layer), which is prohibitive. The recent work of  proposed an adjoint based method to address this, with a training strategy that required only storage of the activation at the end of the ODE layer. All the intermediate activations were then “re-computed” by solving the ODE layers backwards. However, it has been recently shown that such an approach could lead to incorrect gradients, due both to numerical instability of backward ODE solve, and also to inconsistencies that relate to optimizing infinite dimensional operators . This is basically the well-known Discretize-Then-Optimize (DTO) versus Optimize-Then-Discretize issue. The checkpointing-based method ANODE  was proposed to solve the incorrect gradient problem of Neural ODE. More importantly, it was observed that using other discretization schemes such as RK2 or RK4, or using more time steps, does not affect the generalization performance of Neural ODEs as compared to baseline networks, despite the common belief . This is a very important challenge, as lack of any performance gain obviates the need for using Neural ODEs. In this paper, building on the latter approach of , we propose ANODEV2, a more general Neural ODE framework that addresses this problem. The key idea of ANODEV2 is that it allows the evolution of both weights and activations with a coupled system of ODEs:

 (2) {z(1)=z(0)+∫10f(z(t),θ(t))dt              parent network'',θ(t)=θ(0)+∫t0q(θ(t),p)dt,   θ(0)=θ0   weight network''.

Here, is a nonlinear operator (essentially controlling the dynamics of the network parameters in time); and are the corresponding parameters for the weight network. Our approach allows to be time dependent: is parameterized by the learnable dynamics of . This, in turn, is parameterized by and . In other words, instead of optimizing for a constant , we optimize for and . During inference, both weights and activations are forward-propagated in time by solving Eq. 2. Observe that if we set then we recover the Neural ODE approach proposed by [3, 4, 5, 6, 7, 8]. Eq. 2 replaces the problem of designing appropriate neural network blocks () with the problem of choosing appropriate function () in an ODE to model the changes of parameter (the weight network).

In summary, our main contributions are the following.

• We provide a general framework that extends Neural ODEs to a system of coupled ODEs which allows the coupled evolution of both model parameters and activations. This coupled formulation addresses the challenge with Neural ODEs, in that using more time steps or different discretization schemes do not affect model’s generalization performance .

• We derive the optimality conditions for how backpropagation should be performed for the coupled ODE formulation, using the so-called Karush–Kuhn–Tucker (KKT) conditions. In particular, we implement the corresponding DTO approach, along with a checkpointing scheme presented in

.

• We test the framework using multiple different residual models on CIFAR-10 by considering different coupled formulations. In particular, we show examples illustrating how a bio-physically motivated reaction-diffusion-advection (RDA) ODE could be used to model the evolution of the neural network parameters.

Our work fits into a rich literature on neural evolution research [10, 11, 12, 13, 14, 15, 16]. For example, several approaches similar to ANODEV2

have been taken in the line of evolutionary computing, where an auxiliary “child” network is used to generate the parameters for a “parent” network. This approach permits the restriction of the effective depth that the activations must go through, since the parent network could have smaller weight space than the child network. One example is HyperNEAT

, which uses “Compositional Pattern Producing Networks” (CPRNs) to evolve the model parameters [18, 19]. A similar approach using “Compressed Weight Search” was proposed in . A follow up work extended this approach by using differentiable CPRNs . The authors show that neural network parameters could be encoded through a fully connected architecture. Another seminal work in this direction is [22, 23]

, where an auxiliary network learns to produce “context-aware” weights in a recurrent neural network model.

A similar recent approach is taken in Hypernetworks , where model parameters are evolved through an auxiliary learnable neural network. This approach is a special case of ANODEV2, which could be derived by using a single time step discretization of Eq. 2, with a neural network for the evolution operator (denoted by and introduced in the next section).

Our framework is a generalization of these evolutionary algorithms, and it provides more flexibility for modeling the evolution of the model parameters in time. For instance, we will show how RDA operators could be used for the evolution operator

, with negligible increase in the model parameter size.

## 2 Methodology

In this section, we discuss the formulation for the coupled ODE-based neural network model described above, and we derive the corresponding optimality conditions. For a typical learning problem, the goal is to minimize the empirical risk over a set of training examples. Given a loss function

, we seek to find weights, , such that:

 (3) minθ1NN∑i=1ℓ(z(θ;xi,yi))+R(θ),

where is a regularization operator, is the training sample and its label, and the number of training samples. The loss function depends implicitly on through the network activation vector

. This problem is typically solved using Stochastic Gradient Descent (SGD) and backpropagation to compute the gradient of

with respect to .

### 2.1 Neural ODE

Consider the following notation for a residual block: , where is the input activation,

is the neural network kernel (e.g., comprising a series of convolutional blocks with non-linear or linear activation functions), and

is the output activation. As discussed above, an alternative view of a residual network is the following continuous-time formulation: , with (we will use both and to denote activation at time ). In the ODE-based formulation, this neural network has a continuous depth. In this case, we need to solve the following constrained optimization problem (Neural ODE):

 (4) minθ1NN∑i=1ℓ(z1;xi,yi)+R(θ)subject to:dzdt=f(z(t),  θ),z(0)=z0.

Note that in this formulation the neural network parameters do not change with respect to time. In fact, it has been observed that using adaptive time stepping or higher order discretization methods such as Runge-Kutta does not result in any gains in generalization performance using the above framework . To address this, we extend the Neural ODEs by considering a system of coupled ODEs, where both the model parameters and the activations evolve in time. This formulation is slightly more general than what we described in the introduction. We introduce an auxiliary dynamical system for , which we use to define . This allows for more general evolution of the model parameters and activations, which will be discussed in §2.2. In particular, we propose the following formulation:

 (5a) minp,w0J(z(1)) =1NN∑i=1ℓ(z(1);xi,yi)+R(w0,p), (5b) dzdt (5c) ∂w∂t (5d) θ(t) =∫t0K(t−τ)w(τ)dτ.

Note that here is a function of time, and it is parameterized by the whole dynamics of and a time convolution kernel (which in the simplest form could be a Dirac delta function so that ). Also,

can be a general function, e.g., another neural network, a linear operator, or even a discretized Partial Differential Equation (PDE) based operator. The latter is useful if we consider the

as a function , where parameterizes the signal space (e.g., 2D pixel space for images). This formulation allows for rich variations of , while using a lower dimensional parameterization: notice that implicitly we have that . Also, this formulation permits novel regularization techniques. For instance, instead of regularizing , we can regularize and .

A crucial question here is: how should one perform backpropagation for this formulation? It is instructive to compute the actual derivatives to illustrate the structure of the problem. To derive the optimality conditions for this constrained optimization problem, we need first to form the Lagrangian operator, and then we derive the KKT conditions. Here is the Lagrangian:

 (6) L=J(z1)+∫10α(t)⋅(dzdt−f(z(t),θ(t)))dt+∫10β(t)⋅(∂w∂t−q(w;p))dt+∫10γ(t)⋅(θ(t)−∫t0K(t−τ)w(τ)dτ)dt.

Here, , , and are the corresponding adjoint variables (Lagrange multiplier vector functions) for the constraints in Eq. 5. The solution to the optimization problem of Eq. 5 could be found by computing the stationary points of the Lagrangian (the KKT conditions), which are the gradient of with respect to and the adjoints . The variations of with respect to the three adjoint functions just result in the ODE constraints in Eq. 5. The remaining variations of are the most interesting and are given below (please see Appendix E for the derivation):

 (7a) ∂J(z1)∂z1+α1=0,    −∂α∂t−(∂f∂z)Tα(t)=0;(∂Lz) (7b) (7c) (7d) (7e)

To compute the gradients and (Eq. 7d and Eq. 7e), we proceed as follows. Given and , we forward propagate to compute and then . Then, using we can compute the activations .

Afterward, we need to solve the first adjoint equation for using the terminal condition of . Having we can compute the second adjoint variable from Eq. 7b. Lastly, we need to plug in into Eq. 7c to solve for which is the term needed to compute the gradients (by plugging it in in Eq. 7d, Eq. 7e). We use the Discretize-Then-Optimize (DTO) method to find the gradients .

Notice that if we set then we will derive the optimality conditions for the Neural ODE without any dynamics for the model parameters, which was the model presented in . The benefit of our more general framework is that we can encapsulate the time dynamics of the model parameter without increasing the memory footprint of the model. In fact, this approach only requires storing initial conditions for the parameters, which are parameterized by , along with the parameters of the control operator , which are denoted by . As we show in the results section (§3) the latter have negligible memory footprint, but yet allow rich representation of model parameter dynamics.

PDE-inspired formulation. There are several different models one could consider for the , the evolution function for the neural network parameters. One possibility is to use an auxiliary neural network such as the approach used in HyperNetworks . However, this may increase the total number of parameters. Inspired by Turing’s reaction-diffusion-advection partial differential equation models for pattern formation, we view a convolutional filter as a time-varying pattern, where the NN parameters evolve in time . To illustrate this, we consider a PDE-based model for the control operator , as follows:

 (8) dwdt=σ(dΔw+υ⋅∇w+ρw),

where is control diffusion (), controls the advection/transport (), controls the reaction (), and is a nonlinear activation (such as sigmoid or tanh). Here, we are viewing the weights as a time series signal, starting from the initial signal, . This initial condition is then evolved in time to produce . In fact, one can show that this formulation can evolve the initial parameters, , to any arbitrary weights , if there exists a diffeomorphic transformation of between the two distributions (i.e., if there exists a velocity field such is the solution of Eq. 8, with initial condition  ).

Although this operator is mainly used as an example control block (i.e., ANODEV2

is not limited to this model), the RDA operator can capture interesting dynamics for model parameter evolution. For instance, consider a Gaussian operator for a convolutional kernel with unit variance as shown in Figure

3

. A diffusion operator can simulate multiple different Gaussian distributions with different variances in time. This requires storing only a single diffusion parameter (i.e., the diffusion parmaeter

in Eq. 8). Another interesting operator is the advection/transport operator which models species transport. For the Gaussian case, this operator could transport the center of the Gaussian to different positions other than the center, as shown in second row of Figure 3. Finally, the reaction operator could allow growth or decay of the intensity of the convolution filters (third row of Figure 3). The full RDA operator could encapsulate more complex dynamics of the neural network parameters in time. An synthetic example is shown in Figure 3 in the appendix, and a real example ( convolutional kernel of AlexNet) is shown in Figure 1. Figure 1: Illustration of how different convolutional operators are evolved in time during the coupled neural ODE solve (through the evolution operator q in Eq. 5c). The figure corresponds to the first channel of the first convolution kernel parameters of AlexNet. These filters will be applied to activation in different time steps (through the f operator in the coupled formulation in Eq. 5b). This is schematically shown in Figure 2 for three of the filters (the filters are denoted by different shades of brown bars denoted by θ). As the time step increases, the kernel turns out to focus on some specific part on the activation map. Similar illustrations for ResNet-4 and ResNet-10 are shown in Figure 4 and 5 in the appendix.

### 2.2 Two methods used in this paper

The motivation to introduce the auxiliary variable in Eq. 5 is to enable different coupling configurations between model parameters, , and activations, . We use two different coupling configurations of ANODEV2 as described below.

#### Configuration 1

Here, we use multiple time steps to solve for both and in the network, instead of just one time step as in the original ResNet. In this setting, we will alternatively update the value of and according to the following equation (for details, see Appendix B):

 (9) zt0+δt=zt0+δtf(zt0;θt0);   θt0+δt=σ(F−1(exp((−dk2+ikυ+ρ)δt)F(θt0))).

where

is the discretization time scale, and F is Fast Fourier Transform (FFT) operator. where

in the forward solver.

If we use time steps to solve the equation, the computational cost for an ODE block will be roughly times more expensive, compared to that for the original residual block (note the approach presented in  also increases computational cost by the same factor). This network can be viewed as applying different residual blocks in the network but with neural network weights that evolve in time. Note that this configuration does not increase the parameter size of the original network, except for a slight overhead of and .

The configuration 1 is shown in Figure 2 (top), where the model parameters and activations are solved with the same discretization scheme. This is similar to the Neural ODE framework of , except that the model parameters are evolved in time, whereas in  the same model parameters are applied to the activations (and only time horizon is changed). The dynamics of the model parameters are illustrated by different colors used for the convolution kernels in top of Figure 2. This configuration is equivalent to using the Dirac delta function for the function in Eq. 5d.

In short, this configuration allows evolution of both the model parameters and activations, but both have the same time discretization. While this is a simple configuration, but it is not very general, since the model parameters may require more time to evolve and prematurely applying them to input activation may not be optimal. The next configuration relaxes this constraint to addresses this limitation.

#### Configuration 2

ANODEV2 supports different coupling configurations between the dynamics of activations and model parameters. For example, it is possible to not restrict the dynamics of and to align in time, which is the second configuration that we consider. Here, we allow model parameters to evolve and only apply to activations after a fixed number of time steps. For instance, consider the Gaussian example illustrated in Figure 3 in Appendix A. In configuration 1, a residual block is created for each of the three time steps. However, in configuration 2, we only apply the first and last time evolutions of the parameters (i.e., we only use and to apply to activations as shown in Figure 2). This configuration allows sufficient time for the model parameters to evolve, and importantly it limits the depth of the network that activations go through. In this case, the depth of the network is increased by a factor of two, instead of as in configuration 1 (which is the approach used in [8, 9]).

Both configuration 1 and configuration 2 are supported in ANODEV2, and we will present preliminary results for both settings. Figure 2: Illustration of different configurations in ANODEV2. The top figure shows configuration 1, where both the activation z and weights θ are evolved through a coupled system of ODEs. During inference, we solve both of these ODEs forward in time. Blue boxes in the figure represent activation with multiple channels; all the bars (in different shades of brown) represent the convolution kernel evolving in time. The convolution weights θ are computed by solving an auxiliary ODE. That is, at every time step, we solve both z and θ forward concurrently. The bottom figure shows configuration 2, where first the weights are evolved in time before applying them to the activations. Comparing to configuration 1, only the first and last weights are applied.

## 3 Results

In this section, we report the results of ANODEV2 for the two configurations discussed in section 2, on the CIFAR-10 dataset, which consists of 60,000 3232 colour images in 10 classes. The framework is developed as a library in Pytorch, and it uses the checkpointing method proposed in , along with the DTO formulation of the optimality conditions shown in Eq. 7.

We test ANODEV2

on AlexNet with residual connections, as well as two different ResNets. See Appendix

B and Appendix C for the details of training settings and model architectures. We consider the two coupling configurations between the evolution of the activations and model parameters, as discussed next.

### 3.1 Configuration 1

We first start with configuration 1, which is the same as the setting used in [8, 9]. The model parameters and activations are evolved in time concurrently and for each time step, as shown in Figure 2 (top). All the experiments were repeated five times, and we report both the min/max accuracy as well as the average of these five runs. The results are shown in Table 1.

Note that the coupled ODE based approach outperforms the baseline in all three of the statistical properties (i.e., min/max/average accuracy). For example, on ResNet-10 the coupled ODE network achieves 89.04% average test accuracy, as compared to 88.10% of baseline, which is 0.94% better. Meanwhile, a noticeable observation is that the minimum performance of the coupled ODE based network is comparable or even better than the maximum performance of baseline. The coupled ODE based AlexNet has 88.59% minimum accuracy which is 1.44% higher than the best performance of baseline out of five runs. Hence, the generalization performances of the coupled ODE based network are consistently better than those of the baseline. It is important to note that the model parameter size of the coupled ODE approach in ANODEV2 is the same as that of the baseline. This is because the size of the evolution parameters in Eq. 5 is negligible (please see Table 3).

The dynamics of how the neural network parameters evolve in this configuration are illustrated in Figure 1, where we extract the first convolution of AlexNet and show how it evolves in time. Here, represents how long evolves in time, i.e., shows the result of and shows the result of . It can be clearly seen that the coupled ODE based method encapsulates more complex dynamics of in time. Similar illustrations for ResNet-4 and ResNet-10 are shown in Figure 4 and 5 in Appendix C.

### 3.2 Configuration 2

Here, we test configuration 2, where the evolution of the parameters and the activations could have different time steps. This means the parameter is applied only after a certain number of time steps of evolution, i.e., not at every time step, which was the case in configuration 1. This effectively reduces the network depth and the computational cost, and it allows sufficient time for the neurons to be evolved, instead of naively applying them at each time step. An illustration for this configuration is shown in Figure

2 (bottom). The results on AlexNet, ResNet4, and ResNet10 are shown in Table 2, where we again report the min/max/average accuracy over five runs. As in the previous setting (configuration 1), here in configuration 2, the coupled ODE based network performs better in all cases. The minimum performance of the coupled ODE based network still is comparable or even better than the maximum performance of the baseline. Although the overall performance of this configuration 2 is slightly worse than that of configuration 1, the computational cost is much less, due to the smaller effective depth of the network that the activations go through.

## 4 Ablation Study

In this section, we compare ANODEV2 to models with the same number of parameters. In particular, we compare with the Neural ODE approach of . As mentioned before, the approach used in this paper to save memory results in numerical instability, and to allow for fair comparison, we also compare with ANODE presented in , which addresses the instability problem. Precisely, we use two time steps for the activation ODE (Eq. 5b) and ten time steps for the evolution of the model parameters (Eq. 5c). The results are shown in Table 4.

As one can see, there is indeed benefit in allowing the model parameters to evolve in time. This is not surprising, since it gives more flexibility to the neural network to evolve the model parameters. Also note that the performance of the Neural ODE approach used in  is significantly worse than ANODE and ANODEV2. The results remained the same despite hyper-parameter tuning. However, this is expected as the Neural ODE method in  may result in incorrect gradient information . However, even addressing this instability with ANODE results in suboptimal performance as compared to ANODEV2. Also note that evolving model parameters has a negligible computational cost, since we can actually use analytical solutions for solving the RDA, which is discussed in Appendix B.

We also present the parameter sizes of the both configurations of ANODEV2 as well as the Neural ODE and ANODE models tested above. Table 3 summarizes all the results. It can be clearly seen that the model sizes of both configurations are roughly the same as those of the baseline models. In fact, configuration 1 grows the parameter sizes of AlexNet, ResNet-4, and ResNet-10 by only to , as compared to those of baseline models. In configuration 2, the parameter size increases from to compared to baseline model. (Note that we even count the additional batch norm parameters for fair comparison.)

## 5 Conclusions

The connection between residual networks and ODEs has been discussed in recent work. Here, motivated by work in neural evolution, we propose ANODEV2, which is a more general extension of this approach, obtained by introducing a coupled ODE based framework. The framework allows dynamical evolution of both the residual parameters as well as the activations in a coupled ODE formulation. This provides more flexibility to the neural network to adjust the parameters to achieve better generalization performance. We derived the optimality conditions for this coupled formulation and presented preliminary empirical results using two different configurations, and we showed that we can indeed train such models using our differential framework. The results on three Neural Networks (AlexNet, ResNet-4, and ResNet-10) all show accuracy gains across five different trials. In fact, the worst accuracy of the coupled ODE formulation was better than the best performance of the baseline. This is achieved with negligible change in the model parameter size. To the best of the our knowledge, this is the first coupled ODE formulation that allows for the evolution of the model parameters in time along with the activations. We are working on extending the framework for other learning tasks. The source code will be released as open source software to the public.

## Acknowledgments

This work was supported by a gracious fund from Intel corporation, Berkeley Deep Drive (BDD), and Berkeley AI Research (BAIR) sponsors. We would like to thank the Intel VLAB team for providing us with access to their computing cluster. We also gratefully acknowledge the support of NVIDIA Corporation for their donation of two Titan Xp GPU used for this research. MWM would also like to acknowledge ARO, DARPA, NSF, ONR, and Intel for providing partial support of this work.

## References

•  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

Proceedings of the IEEE conference on computer vision and pattern recognition

, pp. 770–778, 2016.
•  K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision, pp. 630–645, Springer, 2016.
• 

E. Weinan, “A proposal on machine learning via dynamical systems,”

Communications in Mathematics and Statistics, vol. 5, no. 1, pp. 1–11, 2017.
•  E. Haber and L. Ruthotto, “Stable architectures for deep neural networks,” Inverse Problems, vol. 34, no. 1, p. 014004, 2017.
•  L. Ruthotto and E. Haber, “Deep neural networks motivated by partial differential equations,” arXiv preprint arXiv:1804.04272, 2018.
•  Y. Lu, A. Zhong, Q. Li, and B. Dong, “Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations,” in International Conference on Machine Learning, pp. 3282–3291, 2018.
•  M. Ciccone, M. Gallieri, J. Masci, C. Osendorfer, and F. Gomez, “NAIS-Net: stable deep networks from non-autonomous differential equations,” in Advances in Neural Information Processing Systems, pp. 3025–3035, 2018.
•  T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” in Advances in Neural Information Processing Systems, pp. 6571–6583, 2018.
•  A. Gholami, K. Keutzer, and G. Biros, “ANODE: Unconditionally accurate memory-efficient gradients for neural odes,” 2019 International Joint Conference on Artificial Intelligence, (arXiv:1902.10298), 2019.
•  A. Lindenmayer, “Mathematical models for cellular interactions in development i. filaments with one-sided inputs,” Journal of theoretical biology, vol. 18, no. 3, pp. 280–299, 1968.
•  A. M. Turing, “The chemical basis of morphogenesis,” Bulletin of mathematical biology, vol. 52, no. 1-2, pp. 153–197, 1990.
•  R. K. Belew and T. E. Kammeyer, “Evolving aesthetic sorting networks using developmental grammars.,” in ICGA, p. 629, Citeseer, 1993.
•  P. Bentley and S. Kumar, “Three ways to grow designs: A comparison of embryogenies for an evolutionary design problem,” in Proceedings of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume 1, pp. 35–43, Morgan Kaufmann Publishers Inc., 1999.
•  F. Dellaert and R. D. Beer, “A developmental model for the evolution of complete autonomous agents,” in Proceedings of the fourth international conference on simulation of adaptive behavior, pp. 393–401, MIT Press Cambridge, MA, 1996.
•  P. Eggenberger, “Evolving morphologies of simulated 3d organisms based on differential gene expression,” in Proceedings of the fourth european conference on Artificial Life, pp. 205–213, 1997.
•  G. S. Hornby and J. B. Pollack, “Creating high-level components with a generative representation for body-brain evolution,” Artificial life, vol. 8, no. 3, pp. 223–246, 2002.
•  K. O. Stanley, D. B. D’Ambrosio, and J. Gauci, “A hypercube-based encoding for evolving large-scale neural networks,” Artificial life, vol. 15, no. 2, pp. 185–212, 2009.
•  K. O. Stanley, “Exploiting regularity without development,” in Proceedings of the AAAI Fall Symposium on Developmental Systems, p. 37, AAAI Press Menlo Park, CA, 2006.
•  K. O. Stanley, “Compositional pattern producing networks: A novel abstraction of development,” Genetic programming and evolvable machines, vol. 8, no. 2, pp. 131–162, 2007.
•  J. Koutnik, F. Gomez, and J. Schmidhuber, “Evolving neural networks in compressed weight space,” in Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp. 619–626, ACM, 2010.
•  C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg, M. Lanctot, and D. Wierstra, “Convolution by evolution: Differentiable pattern producing networks,” in Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 109–116, ACM, 2016.
•  J. Schmidhuber, “Learning to control fast-weight memories: An alternative to dynamic recurrent networks,” Neural Computation, vol. 4, no. 1, pp. 131–139, 1992.
•  J. Schmidhuber, “A ‘self-referential’weight matrix,” in International Conference on Artificial Neural Networks, pp. 446–450, Springer, 1993.
•  D. Ha, A. Dai, and Q. Le, “HyperNetworks,” in International Conference on Learning Representations, 2017.
•  L. Younes, Shapes and diffeomorphisms, vol. 171. Springer Science & Business Media, 2010.

## Appendix A Reaction-diffusion-advection (RDA) Simulation

In this section, we provide details for the reaction-diffusion-advection (RDA) solver as well as an exemplary simulation, shown in Figure 3. For an illustration of the idea, we set the initial distribution of to be a unit Gaussian centered in the middle of the figure. In the first row, we show how this single modal Gaussian changes in time when only a diffusion operator is used in the control operator. As shown in the first row of the figure, the diffusion operator allows the parameters to evolve from a Gaussian with unit variance to a Gaussian with higher variance. In the second row, we show how this single modal Gaussian changes with an advection operator. Notice how the advection operator allows modeling of different filters centered at different locations with the same variance (since advection operator does not diffuse filters but transports them). In the third row, we show how this single modal Gaussian changes when we only use an exponential growth operator for the reaction part. Notice how this operator could allow the kernel to increase/decrease its intensities at different pixels in time. Finally, in the last row, we show a more complex example, where we use all three operators together. Figure 3: Illustration of how different convolution maps could be encoded through the parameter PDE solver. Here, we show an exemplary convolution at time t=0 (left images), as well as its evolution through time, when we apply the RDA PDE with different settings for the model parameters. Note that with this PDE based encoding, we only need to store the initial condition for the parameters (i.e., t=0). The rest of the model parameters could be computed using this initial condition.

## Appendix B Numerical Method

In this section, we provide more details on our numerical techniques. For simplicity, we set to be a Dirac delta function, and we use the above RDA function for . In this case, we have

 (10) {dzdt=f(z;θ),dwdt=σ(dΔw+υ⋅∇w+ρw).

Here, we discuss how can we solve the ODE system, given in Eq. 10. For the evolution of , we follow  and use forward Euler method to solve . For example, if we set time step to be 2, then

 z1/2=z0+12f(z0;θ0);   z1=z1/2+12f(z1/2;θ1/2).

It is not hard to see that if , then the output is the same as the original ResNet. For the evolution of without non-linearity, i.e., when is the identity map, there exists an analytic solution in the frequency domain. Applying the Fast Fourier Transform (FFT) from Eq. 8, we will get:

 (11) F(w)t=F(dΔw+υ⋅∇w+ρw),

where denotes FFT operator. Since the diffusion, advection, and reaction coefficients are constant, we can find the analytical solution in the frequency domain. That is:

 (12) wt0+δt=F−1(exp(−δtdk2+ikδtυ+δtρ)F(wt0)),

where is inverse FFT. Note that due to the existence of this analytical solution, the computational cost of solving the evolution for becomes negligible, which is an important benefit of this approach.

When a non-linearity is applied, we use an approximation to solve Eq. 8, obtaining

 (13) wt0+δt=σ(F−1(exp(−δtdk2+ikδtυ+δtρ)F(wt0))).

This means we first apply FFT and its inverse to solve the linear system, and then we apply the non-linear function . Here, means the time scale to compute . In this paper, we set the non-linearity function to be tanh, but other non-linearities could also be used.

For configuration 1, we use . For configuration 2, we use to solve and to solve . In this case, the FLOPS will be only that of the original baseline network. Upon this condition, the process can be formulated as,

 z1/2=z0+12f(z0;θ0);   z1=z1/2+12f(z1/2;θ1);

where is generated with .

## Appendix C Model Configuration

In this section, we provide the architecture we used for the tests in section 3. The AlexNet, ResNet-4, and ResNet-10 we are using are described in following sections.

### c.1 AlexNet

We used a 2-layer convolution with residual connection added to the second convolution. Thus, we can transform the second convolution into an ODE. Table 5

explains detailed structure layer by layer. For simplicity, we omit the batch normalization and ReLU layer added after each convolution.

#### Training details

We train AlexNet for 120 epochs with initial learning rate 0.1. The learning rate decays by a factor of 10 at epoch 40, 80 and 100. Data augmentation is implemented. Also, the batch size used for training is 256. Note that the setting is the same for all experiments, i.e., baseline, Neural ODE, and

ANODEV2.

### c.2 ResNet-4 and ResNet-10

Here, we provide the architecture of ResNet-4 and ResNet-10 used section 3. We omit the batch normalization and ReLU, for simplicity. Detailed structure is provided in Table 6.

#### Training details

We train ResNet-4/10 for 350 epochs with initial learning rate 0.1. The learning rate decays by a factor of 10 at epoch 150, and 300. Data augmentation is implemented. Also, the batch size used for training is 256. Note that the setting is the same for all experiments, i.e., baseline, Neural ODE, and ANODEV2.

## Appendix D Convolution kernel Evolution Example

In this section, we show some examples of how the model parameters are evolved in time. Results for ResNet-4 and ResNet-10 are shown in Figure 4 and Figure 5, respectively. Figure 4: Illustration of how different convolutional operators are evolved in time during the coupled neural ODE solve (through the evolution operator q in Eq. 5c). The figure corresponds to the first channel of the first convolution kernel parameters of ResNet-4. These filters will be applied to activation in different time steps (through the f operator in the coupled formulation in Eq. 5b). This is schematically shown in Figure 2 for three of the filters (the filters are denoted by different shades of brown bars denoted by θ). Similar pattern can be observed as Figure 1. Figure 5: Illustration of how different convolutional operators are evolved in time during the coupled neural ODE solve (through the evolution operator q in Eq. 5c). The figure corresponds to the first channel of the first convolution kernel parameters of ResNet-10. These filters will be applied to activation in different time steps (through the f operator in the coupled formulation in Eq. 5b). This is schematically shown in Figure 2 for three of the filters (the filters are denoted by different shades of brown bars bars denoted by θ). Similar pattern can be observed as Figure 1.

## Appendix E Derivation of Optimality Conditions

In this section, we present a detailed derivation of the optimality conditions corresponding to Eq. 5. We need to find the so-called KKT conditions, which can be found by finding stationary points of the corresponding Lagrangian, defined as:

 (14) L=J(z1)+∫10α(t)⋅(dzdt−f(z(t),θ(t)))dt+∫10β(t)⋅(∂w∂t−q(w,p))dt+∫10γ(t)⋅(θ(t)−∫t0K(t−τ)w(τ)dτ)dt

In order to derive the optimality conditions, we first take variations with respect to , , and . This basically results in the “Activation ODE”, the “Evolution ODE”, and the relation between and , shown in Eq. 5. Taking variations with respect to will result in a backward-in-time ODE for the , which is continuous equivalent to backpropagation (i.e., Optimize-Then-Discretize). Taking variations with respect to will result in an algebraic relation between and . Taking variations with respect to will be split in two parts: variations with respect to for ; and variations with respect to , which is in fact one of our unknown parameters. The split is performed by first integrating by parts the term to expose a term that reads , and then taking variations with respect to . Finally, we also need to take variations with respect to the vector . An important technical detail is that to take the variations of the with respect to can be done easily by converting it to . The details are given below.

In order to satisfy the first optimality condition on we have:

 (∂L∂z)T^z=0,

where this equality must hold for any variation in space and time. We have:

 (15) (∂L∂z)T^z=(∂J(z1)∂z1)T^z1+αT1^z1+∫10(−∂α∂t−∂f(z,θ)∂zTα)T^zdt=0.

Imposing this condition holds for all variation will result in the first adjoint equation as follows:

 (16) ∂J(z1)∂z1+α1=0,    −∂α∂t−(∂f∂z)Tα=0.

For , the following equation needs to be satisfied:

 (∂L∂θ)T^θ=0.

We have

 (17) (∂L∂θ)T^θ=∫10(−∂f(z,θ)∂θ)TαT^θdt+∫10γT^θdt.

This further implies:

 (18) −∂f(z,θ)∂θTα+γ=0.

Finally, the inversion equation on could be found by performing variation on :

 (∂L∂w)T^w=0.

We have

 (19) (∂L∂w)T^w=∫10−(∂β∂t−∂q(w;p)∂wTβ)T^wdt+βT1^w1+∫10∫t0−H(τ−t)KT(t−τ)γdτ^wdt=∫10−(∂β∂t−∂q(w;p)∂wTβ)T^wdt+βT1^w1+∫10∫t0−H(τ−t)KT(t−τ)γdτ^wdt,

where is the scalar Heaviside function. Imposing that this condition holds for all variation will result in the inversion equation as follows,

 (20) −∂β∂t−∂q(w;p)∂wTβ+∫1t−KT(t−τ)γdτ,   β1=0.

The gradient of with respect to can be computed as,

 (21) gw0=∂L∂w0=∂R(w0,p)∂w0−β0.

Finally, the gradient of with respect to can be computed as,

 (22) gp=∂L∂p=∂R(w0,p)∂p−∫10∂q(w,p)∂pTβ(t)dt.

Note that if optimality conditions are achieved with respect to and , then

 (23) gw0=0,   gp=0.