1 Introduction
Residual networks [1, 2] have enabled training of very deep neural networks (DNNs). Recent work has shown an interesting connection between residual blocks and Ordinary Differential Equations (ODEs), showing that a residual network can be viewed as a discretization of a continuous ODE operator [3, 4, 5, 6, 7, 8]. These formulations are commonly called Neural ODEs, and here we follow the same convention. Neural ODEs provide a general framework that connects discrete DNNs to continuous dynamical systems theory, as well as discretization and optimal control of ODEs, all subjects with very rich theory.
A basic Neural ODE formulation and its connection to residual networks (for a single block in a network) is the following:
(1a)  
(1b)  
(1c) 
Here, is the input to the network and is the output activation;
is the vector of network weights (independent of time); and
is the nonlinear operator defined by this block. (Here, we have written the ODE in terms of its solution at .) We can see that a singlestep of forward Euler discretization of the ODE is identical to a traditional residual block. Alternatively, we could use a different timestepping scheme or, more interestingly, use more time steps. Once the connection to ODEs was identified, several groups have incorporated the Neural ODE structure in neural networks and evaluated its performance on several different learning tasks.A major challenge with training Neural ODEs is that backpropogating through ODE layers requires storage of all the intermediate activations (i.e. ) in time. In principle, the memory footprint of ODE layers has a cost of (where is the number of time steps to solve the ODE layer), which is prohibitive. The recent work of [8] proposed an adjoint based method to address this, with a training strategy that required only storage of the activation at the end of the ODE layer. All the intermediate activations were then “recomputed” by solving the ODE layers backwards. However, it has been recently shown that such an approach could lead to incorrect gradients, due both to numerical instability of backward ODE solve, and also to inconsistencies that relate to optimizing infinite dimensional operators [9]. This is basically the wellknown DiscretizeThenOptimize (DTO) versus OptimizeThenDiscretize issue. The checkpointingbased method ANODE [9] was proposed to solve the incorrect gradient problem of Neural ODE. More importantly, it was observed that using other discretization schemes such as RK2 or RK4, or using more time steps, does not affect the generalization performance of Neural ODEs as compared to baseline networks, despite the common belief [9]. This is a very important challenge, as lack of any performance gain obviates the need for using Neural ODEs. In this paper, building on the latter approach of [9], we propose ANODEV2, a more general Neural ODE framework that addresses this problem. The key idea of ANODEV2 is that it allows the evolution of both weights and activations with a coupled system of ODEs:
(2) 
Here, is a nonlinear operator (essentially controlling the dynamics of the network parameters in time); and are the corresponding parameters for the weight network. Our approach allows to be time dependent: is parameterized by the learnable dynamics of . This, in turn, is parameterized by and . In other words, instead of optimizing for a constant , we optimize for and . During inference, both weights and activations are forwardpropagated in time by solving Eq. 2. Observe that if we set then we recover the Neural ODE approach proposed by [3, 4, 5, 6, 7, 8]. Eq. 2 replaces the problem of designing appropriate neural network blocks () with the problem of choosing appropriate function () in an ODE to model the changes of parameter (the weight network).
In summary, our main contributions are the following.

We provide a general framework that extends Neural ODEs to a system of coupled ODEs which allows the coupled evolution of both model parameters and activations. This coupled formulation addresses the challenge with Neural ODEs, in that using more time steps or different discretization schemes do not affect model’s generalization performance [9].

We derive the optimality conditions for how backpropagation should be performed for the coupled ODE formulation, using the socalled Karush–Kuhn–Tucker (KKT) conditions. In particular, we implement the corresponding DTO approach, along with a checkpointing scheme presented in
[9]. 
We test the framework using multiple different residual models on CIFAR10 by considering different coupled formulations. In particular, we show examples illustrating how a biophysically motivated reactiondiffusionadvection (RDA) ODE could be used to model the evolution of the neural network parameters.
Our work fits into a rich literature on neural evolution research [10, 11, 12, 13, 14, 15, 16]. For example, several approaches similar to ANODEV2
have been taken in the line of evolutionary computing, where an auxiliary “child” network is used to generate the parameters for a “parent” network. This approach permits the restriction of the effective depth that the activations must go through, since the parent network could have smaller weight space than the child network. One example is HyperNEAT
[17], which uses “Compositional Pattern Producing Networks” (CPRNs) to evolve the model parameters [18, 19]. A similar approach using “Compressed Weight Search” was proposed in [20]. A follow up work extended this approach by using differentiable CPRNs [21]. The authors show that neural network parameters could be encoded through a fully connected architecture. Another seminal work in this direction is [22, 23], where an auxiliary network learns to produce “contextaware” weights in a recurrent neural network model.
A similar recent approach is taken in Hypernetworks [24], where model parameters are evolved through an auxiliary learnable neural network. This approach is a special case of ANODEV2, which could be derived by using a single time step discretization of Eq. 2, with a neural network for the evolution operator (denoted by and introduced in the next section).
Our framework is a generalization of these evolutionary algorithms, and it provides more flexibility for modeling the evolution of the model parameters in time. For instance, we will show how RDA operators could be used for the evolution operator
, with negligible increase in the model parameter size.2 Methodology
In this section, we discuss the formulation for the coupled ODEbased neural network model described above, and we derive the corresponding optimality conditions. For a typical learning problem, the goal is to minimize the empirical risk over a set of training examples. Given a loss function
, we seek to find weights, , such that:(3) 
where is a regularization operator, is the training sample and its label, and the number of training samples. The loss function depends implicitly on through the network activation vector
. This problem is typically solved using Stochastic Gradient Descent (SGD) and backpropagation to compute the gradient of
with respect to .2.1 Neural ODE
Consider the following notation for a residual block: , where is the input activation,
is the neural network kernel (e.g., comprising a series of convolutional blocks with nonlinear or linear activation functions), and
is the output activation. As discussed above, an alternative view of a residual network is the following continuoustime formulation: , with (we will use both and to denote activation at time ). In the ODEbased formulation, this neural network has a continuous depth. In this case, we need to solve the following constrained optimization problem (Neural ODE):(4) 
Note that in this formulation the neural network parameters do not change with respect to time. In fact, it has been observed that using adaptive time stepping or higher order discretization methods such as RungeKutta does not result in any gains in generalization performance using the above framework [9]. To address this, we extend the Neural ODEs by considering a system of coupled ODEs, where both the model parameters and the activations evolve in time. This formulation is slightly more general than what we described in the introduction. We introduce an auxiliary dynamical system for , which we use to define . This allows for more general evolution of the model parameters and activations, which will be discussed in §2.2. In particular, we propose the following formulation:
(5a)  
(5b)  
(5c)  
(5d) 
Note that here is a function of time, and it is parameterized by the whole dynamics of and a time convolution kernel (which in the simplest form could be a Dirac delta function so that ). Also,
can be a general function, e.g., another neural network, a linear operator, or even a discretized Partial Differential Equation (PDE) based operator. The latter is useful if we consider the
as a function , where parameterizes the signal space (e.g., 2D pixel space for images). This formulation allows for rich variations of , while using a lower dimensional parameterization: notice that implicitly we have that . Also, this formulation permits novel regularization techniques. For instance, instead of regularizing , we can regularize and .A crucial question here is: how should one perform backpropagation for this formulation? It is instructive to compute the actual derivatives to illustrate the structure of the problem. To derive the optimality conditions for this constrained optimization problem, we need first to form the Lagrangian operator, and then we derive the KKT conditions. Here is the Lagrangian:
(6) 
Here, , , and are the corresponding adjoint variables (Lagrange multiplier vector functions) for the constraints in Eq. 5. The solution to the optimization problem of Eq. 5 could be found by computing the stationary points of the Lagrangian (the KKT conditions), which are the gradient of with respect to and the adjoints . The variations of with respect to the three adjoint functions just result in the ODE constraints in Eq. 5. The remaining variations of are the most interesting and are given below (please see Appendix E for the derivation):
(7a)  
(7b)  
(7c)  
(7d)  
(7e) 
To compute the gradients and (Eq. 7d and Eq. 7e), we proceed as follows. Given and , we forward propagate to compute and then . Then, using we can compute the activations .
Afterward, we need to solve the first adjoint equation for using the terminal condition of . Having we can compute the second adjoint variable from Eq. 7b. Lastly, we need to plug in into Eq. 7c to solve for which is the term needed to compute the gradients (by plugging it in in Eq. 7d, Eq. 7e). We use the DiscretizeThenOptimize (DTO) method to find the gradients [9].
Notice that if we set then we will derive the optimality conditions for the Neural ODE without any dynamics for the model parameters, which was the model presented in [8]. The benefit of our more general framework is that we can encapsulate the time dynamics of the model parameter without increasing the memory footprint of the model. In fact, this approach only requires storing initial conditions for the parameters, which are parameterized by , along with the parameters of the control operator , which are denoted by . As we show in the results section (§3) the latter have negligible memory footprint, but yet allow rich representation of model parameter dynamics.
PDEinspired formulation. There are several different models one could consider for the , the evolution function for the neural network parameters. One possibility is to use an auxiliary neural network such as the approach used in HyperNetworks [24]. However, this may increase the total number of parameters. Inspired by Turing’s reactiondiffusionadvection partial differential equation models for pattern formation, we view a convolutional filter as a timevarying pattern, where the NN parameters evolve in time [11]. To illustrate this, we consider a PDEbased model for the control operator , as follows:
(8) 
where is control diffusion (), controls the advection/transport (), controls the reaction (), and is a nonlinear activation (such as sigmoid or tanh). Here, we are viewing the weights as a time series signal, starting from the initial signal, . This initial condition is then evolved in time to produce . In fact, one can show that this formulation can evolve the initial parameters, , to any arbitrary weights , if there exists a diffeomorphic transformation of between the two distributions (i.e., if there exists a velocity field such is the solution of Eq. 8, with initial condition [25]).
Although this operator is mainly used as an example control block (i.e., ANODEV2
is not limited to this model), the RDA operator can capture interesting dynamics for model parameter evolution. For instance, consider a Gaussian operator for a convolutional kernel with unit variance as shown in Figure
3. A diffusion operator can simulate multiple different Gaussian distributions with different variances in time. This requires storing only a single diffusion parameter (i.e., the diffusion parmaeter
in Eq. 8). Another interesting operator is the advection/transport operator which models species transport. For the Gaussian case, this operator could transport the center of the Gaussian to different positions other than the center, as shown in second row of Figure 3. Finally, the reaction operator could allow growth or decay of the intensity of the convolution filters (third row of Figure 3). The full RDA operator could encapsulate more complex dynamics of the neural network parameters in time. An synthetic example is shown in Figure 3 in the appendix, and a real example ( convolutional kernel of AlexNet) is shown in Figure 1.2.2 Two methods used in this paper
The motivation to introduce the auxiliary variable in Eq. 5 is to enable different coupling configurations between model parameters, , and activations, . We use two different coupling configurations of ANODEV2 as described below.
Configuration 1
Here, we use multiple time steps to solve for both and in the network, instead of just one time step as in the original ResNet. In this setting, we will alternatively update the value of and according to the following equation (for details, see Appendix B):
(9) 
where
is the discretization time scale, and F is Fast Fourier Transform (FFT) operator. where
in the forward solver.If we use time steps to solve the equation, the computational cost for an ODE block will be roughly times more expensive, compared to that for the original residual block (note the approach presented in [8] also increases computational cost by the same factor). This network can be viewed as applying different residual blocks in the network but with neural network weights that evolve in time. Note that this configuration does not increase the parameter size of the original network, except for a slight overhead of and .
The configuration 1 is shown in Figure 2 (top), where the model parameters and activations are solved with the same discretization scheme. This is similar to the Neural ODE framework of [8], except that the model parameters are evolved in time, whereas in [8] the same model parameters are applied to the activations (and only time horizon is changed). The dynamics of the model parameters are illustrated by different colors used for the convolution kernels in top of Figure 2. This configuration is equivalent to using the Dirac delta function for the function in Eq. 5d.
In short, this configuration allows evolution of both the model parameters and activations, but both have the same time discretization. While this is a simple configuration, but it is not very general, since the model parameters may require more time to evolve and prematurely applying them to input activation may not be optimal. The next configuration relaxes this constraint to addresses this limitation.
Configuration 2
ANODEV2 supports different coupling configurations between the dynamics of activations and model parameters. For example, it is possible to not restrict the dynamics of and to align in time, which is the second configuration that we consider. Here, we allow model parameters to evolve and only apply to activations after a fixed number of time steps. For instance, consider the Gaussian example illustrated in Figure 3 in Appendix A. In configuration 1, a residual block is created for each of the three time steps. However, in configuration 2, we only apply the first and last time evolutions of the parameters (i.e., we only use and to apply to activations as shown in Figure 2). This configuration allows sufficient time for the model parameters to evolve, and importantly it limits the depth of the network that activations go through. In this case, the depth of the network is increased by a factor of two, instead of as in configuration 1 (which is the approach used in [8, 9]).
Both configuration 1 and configuration 2 are supported in ANODEV2, and we will present preliminary results for both settings.
3 Results
In this section, we report the results of ANODEV2 for the two configurations discussed in section 2, on the CIFAR10 dataset, which consists of 60,000 3232 colour images in 10 classes. The framework is developed as a library in Pytorch, and it uses the checkpointing method proposed in [9], along with the DTO formulation of the optimality conditions shown in Eq. 7.
We test ANODEV2
on AlexNet with residual connections, as well as two different ResNets. See Appendix
B and Appendix C for the details of training settings and model architectures. We consider the two coupling configurations between the evolution of the activations and model parameters, as discussed next.3.1 Configuration 1
We first start with configuration 1, which is the same as the setting used in [8, 9]. The model parameters and activations are evolved in time concurrently and for each time step, as shown in Figure 2 (top). All the experiments were repeated five times, and we report both the min/max accuracy as well as the average of these five runs. The results are shown in Table 1.
Note that the coupled ODE based approach outperforms the baseline in all three of the statistical properties (i.e., min/max/average accuracy). For example, on ResNet10 the coupled ODE network achieves 89.04% average test accuracy, as compared to 88.10% of baseline, which is 0.94% better. Meanwhile, a noticeable observation is that the minimum performance of the coupled ODE based network is comparable or even better than the maximum performance of baseline. The coupled ODE based AlexNet has 88.59% minimum accuracy which is 1.44% higher than the best performance of baseline out of five runs. Hence, the generalization performances of the coupled ODE based network are consistently better than those of the baseline. It is important to note that the model parameter size of the coupled ODE approach in ANODEV2 is the same as that of the baseline. This is because the size of the evolution parameters in Eq. 5 is negligible (please see Table 3).
AlexNet  ResNet4  ResNet10  
Min / Max  Avg  Min / Max  Avg  Min / Max  Avg  
Baseline  86.84% / 87.15%  87.03%  76.47% / 77.35%  76.95%  87.79% / 88.52%  88.10% 
ANODEV2  88.59% / 88.96%  88.78%  77.27% / 78.58%  78.11%  88.67% / 89.39%  89.04% 
Imp.  1.75% / 1.81%  1.75%  0.80% / 1.23%  1.16%  0.88% / 0.87%  0.94% 
The dynamics of how the neural network parameters evolve in this configuration are illustrated in Figure 1, where we extract the first convolution of AlexNet and show how it evolves in time. Here, represents how long evolves in time, i.e., shows the result of and shows the result of . It can be clearly seen that the coupled ODE based method encapsulates more complex dynamics of in time. Similar illustrations for ResNet4 and ResNet10 are shown in Figure 4 and 5 in Appendix C.
3.2 Configuration 2
Here, we test configuration 2, where the evolution of the parameters and the activations could have different time steps. This means the parameter is applied only after a certain number of time steps of evolution, i.e., not at every time step, which was the case in configuration 1. This effectively reduces the network depth and the computational cost, and it allows sufficient time for the neurons to be evolved, instead of naively applying them at each time step. An illustration for this configuration is shown in Figure
2 (bottom). The results on AlexNet, ResNet4, and ResNet10 are shown in Table 2, where we again report the min/max/average accuracy over five runs. As in the previous setting (configuration 1), here in configuration 2, the coupled ODE based network performs better in all cases. The minimum performance of the coupled ODE based network still is comparable or even better than the maximum performance of the baseline. Although the overall performance of this configuration 2 is slightly worse than that of configuration 1, the computational cost is much less, due to the smaller effective depth of the network that the activations go through.AlexNet  ResNet4  ResNet10  
Min / Max  Avg  Min / Max  Avg  Min / Max  Avg  
Baseline  86.84% / 87.15%  87.03%  76.47% / 77.35%  76.95%  87.79% / 88.52%  88.10% 
ANODEV2  88.1% / 88.33%  88.26%  77.23% / 78.28%  77.73%  88.65% / 89.19%  88.93% 
Imp.  1.26% / 1.18%  1.23%  0.76% / 0.93%  0.78%  0.86% / 0.67%  0.83% 
4 Ablation Study
In this section, we compare ANODEV2 to models with the same number of parameters. In particular, we compare with the Neural ODE approach of [8]. As mentioned before, the approach used in this paper to save memory results in numerical instability, and to allow for fair comparison, we also compare with ANODE presented in [9], which addresses the instability problem. Precisely, we use two time steps for the activation ODE (Eq. 5b) and ten time steps for the evolution of the model parameters (Eq. 5c). The results are shown in Table 4.
As one can see, there is indeed benefit in allowing the model parameters to evolve in time. This is not surprising, since it gives more flexibility to the neural network to evolve the model parameters. Also note that the performance of the Neural ODE approach used in [8] is significantly worse than ANODE and ANODEV2. The results remained the same despite hyperparameter tuning. However, this is expected as the Neural ODE method in [8] may result in incorrect gradient information [9]. However, even addressing this instability with ANODE results in suboptimal performance as compared to ANODEV2. Also note that evolving model parameters has a negligible computational cost, since we can actually use analytical solutions for solving the RDA, which is discussed in Appendix B.
We also present the parameter sizes of the both configurations of ANODEV2 as well as the Neural ODE and ANODE models tested above. Table 3 summarizes all the results. It can be clearly seen that the model sizes of both configurations are roughly the same as those of the baseline models. In fact, configuration 1 grows the parameter sizes of AlexNet, ResNet4, and ResNet10 by only to , as compared to those of baseline models. In configuration 2, the parameter size increases from to compared to baseline model. (Note that we even count the additional batch norm parameters for fair comparison.)
AlexNet  ResNet4  ResNet10  
Baseline  1756.68K  7.71K  44.19K 
Neural ODE [8]  1757.13K  7.96K  44.95K 
ANODE [9]  1757.13K  7.96K  44.95K 
ANODEV2 conf. 1  1757.51K  8.23K  45.77K 
ANODEV2 conf. 2  1757.13K  7.99K  45.05K 
AlexNet  ResNet4  ResNet10  
Min / Max  Avg  Min / Max  Avg  Min / Max  Avg  
Baseline  86.84% / 87.15%  87.03%  76.47% / 77.35%  76.95%  87.79% / 88.52%  88.10% 
NeuralODE [8]  74.54% / 76.78%  75.67%  44.73% / 49.91%  47.37%  64.7% / 70.06%  67.94% 
ANODE [9]  87.86% / 88.14%  88.02%  76.92% / 77.45%  77.30%  88.48% / 88.75%  88.60% 
ANODEV2 Conf. 2  88.1% / 88.33%  88.26%  77.23% / 78.28%  77.73%  88.65% / 89.19%  88.93% 
5 Conclusions
The connection between residual networks and ODEs has been discussed in recent work. Here, motivated by work in neural evolution, we propose ANODEV2, which is a more general extension of this approach, obtained by introducing a coupled ODE based framework. The framework allows dynamical evolution of both the residual parameters as well as the activations in a coupled ODE formulation. This provides more flexibility to the neural network to adjust the parameters to achieve better generalization performance. We derived the optimality conditions for this coupled formulation and presented preliminary empirical results using two different configurations, and we showed that we can indeed train such models using our differential framework. The results on three Neural Networks (AlexNet, ResNet4, and ResNet10) all show accuracy gains across five different trials. In fact, the worst accuracy of the coupled ODE formulation was better than the best performance of the baseline. This is achieved with negligible change in the model parameter size. To the best of the our knowledge, this is the first coupled ODE formulation that allows for the evolution of the model parameters in time along with the activations. We are working on extending the framework for other learning tasks. The source code will be released as open source software to the public.
Acknowledgments
This work was supported by a gracious fund from Intel corporation, Berkeley Deep Drive (BDD), and Berkeley AI Research (BAIR) sponsors. We would like to thank the Intel VLAB team for providing us with access to their computing cluster. We also gratefully acknowledge the support of NVIDIA Corporation for their donation of two Titan Xp GPU used for this research. MWM would also like to acknowledge ARO, DARPA, NSF, ONR, and Intel for providing partial support of this work.
References

[1]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  [2] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision, pp. 630–645, Springer, 2016.

[3]
E. Weinan, “A proposal on machine learning via dynamical systems,”
Communications in Mathematics and Statistics, vol. 5, no. 1, pp. 1–11, 2017.  [4] E. Haber and L. Ruthotto, “Stable architectures for deep neural networks,” Inverse Problems, vol. 34, no. 1, p. 014004, 2017.
 [5] L. Ruthotto and E. Haber, “Deep neural networks motivated by partial differential equations,” arXiv preprint arXiv:1804.04272, 2018.
 [6] Y. Lu, A. Zhong, Q. Li, and B. Dong, “Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations,” in International Conference on Machine Learning, pp. 3282–3291, 2018.
 [7] M. Ciccone, M. Gallieri, J. Masci, C. Osendorfer, and F. Gomez, “NAISNet: stable deep networks from nonautonomous differential equations,” in Advances in Neural Information Processing Systems, pp. 3025–3035, 2018.
 [8] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” in Advances in Neural Information Processing Systems, pp. 6571–6583, 2018.
 [9] A. Gholami, K. Keutzer, and G. Biros, “ANODE: Unconditionally accurate memoryefficient gradients for neural odes,” 2019 International Joint Conference on Artificial Intelligence, (arXiv:1902.10298), 2019.
 [10] A. Lindenmayer, “Mathematical models for cellular interactions in development i. filaments with onesided inputs,” Journal of theoretical biology, vol. 18, no. 3, pp. 280–299, 1968.
 [11] A. M. Turing, “The chemical basis of morphogenesis,” Bulletin of mathematical biology, vol. 52, no. 12, pp. 153–197, 1990.
 [12] R. K. Belew and T. E. Kammeyer, “Evolving aesthetic sorting networks using developmental grammars.,” in ICGA, p. 629, Citeseer, 1993.
 [13] P. Bentley and S. Kumar, “Three ways to grow designs: A comparison of embryogenies for an evolutionary design problem,” in Proceedings of the 1st Annual Conference on Genetic and Evolutionary ComputationVolume 1, pp. 35–43, Morgan Kaufmann Publishers Inc., 1999.
 [14] F. Dellaert and R. D. Beer, “A developmental model for the evolution of complete autonomous agents,” in Proceedings of the fourth international conference on simulation of adaptive behavior, pp. 393–401, MIT Press Cambridge, MA, 1996.
 [15] P. Eggenberger, “Evolving morphologies of simulated 3d organisms based on differential gene expression,” in Proceedings of the fourth european conference on Artificial Life, pp. 205–213, 1997.
 [16] G. S. Hornby and J. B. Pollack, “Creating highlevel components with a generative representation for bodybrain evolution,” Artificial life, vol. 8, no. 3, pp. 223–246, 2002.
 [17] K. O. Stanley, D. B. D’Ambrosio, and J. Gauci, “A hypercubebased encoding for evolving largescale neural networks,” Artificial life, vol. 15, no. 2, pp. 185–212, 2009.
 [18] K. O. Stanley, “Exploiting regularity without development,” in Proceedings of the AAAI Fall Symposium on Developmental Systems, p. 37, AAAI Press Menlo Park, CA, 2006.
 [19] K. O. Stanley, “Compositional pattern producing networks: A novel abstraction of development,” Genetic programming and evolvable machines, vol. 8, no. 2, pp. 131–162, 2007.
 [20] J. Koutnik, F. Gomez, and J. Schmidhuber, “Evolving neural networks in compressed weight space,” in Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp. 619–626, ACM, 2010.
 [21] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg, M. Lanctot, and D. Wierstra, “Convolution by evolution: Differentiable pattern producing networks,” in Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 109–116, ACM, 2016.
 [22] J. Schmidhuber, “Learning to control fastweight memories: An alternative to dynamic recurrent networks,” Neural Computation, vol. 4, no. 1, pp. 131–139, 1992.
 [23] J. Schmidhuber, “A ‘selfreferential’weight matrix,” in International Conference on Artificial Neural Networks, pp. 446–450, Springer, 1993.
 [24] D. Ha, A. Dai, and Q. Le, “HyperNetworks,” in International Conference on Learning Representations, 2017.
 [25] L. Younes, Shapes and diffeomorphisms, vol. 171. Springer Science & Business Media, 2010.
Appendix A Reactiondiffusionadvection (RDA) Simulation
In this section, we provide details for the reactiondiffusionadvection (RDA) solver as well as an exemplary simulation, shown in Figure 3. For an illustration of the idea, we set the initial distribution of to be a unit Gaussian centered in the middle of the figure. In the first row, we show how this single modal Gaussian changes in time when only a diffusion operator is used in the control operator. As shown in the first row of the figure, the diffusion operator allows the parameters to evolve from a Gaussian with unit variance to a Gaussian with higher variance. In the second row, we show how this single modal Gaussian changes with an advection operator. Notice how the advection operator allows modeling of different filters centered at different locations with the same variance (since advection operator does not diffuse filters but transports them). In the third row, we show how this single modal Gaussian changes when we only use an exponential growth operator for the reaction part. Notice how this operator could allow the kernel to increase/decrease its intensities at different pixels in time. Finally, in the last row, we show a more complex example, where we use all three operators together.
Appendix B Numerical Method
In this section, we provide more details on our numerical techniques. For simplicity, we set to be a Dirac delta function, and we use the above RDA function for . In this case, we have
(10) 
Here, we discuss how can we solve the ODE system, given in Eq. 10. For the evolution of , we follow [9] and use forward Euler method to solve . For example, if we set time step to be 2, then
It is not hard to see that if , then the output is the same as the original ResNet. For the evolution of without nonlinearity, i.e., when is the identity map, there exists an analytic solution in the frequency domain. Applying the Fast Fourier Transform (FFT) from Eq. 8, we will get:
(11) 
where denotes FFT operator. Since the diffusion, advection, and reaction coefficients are constant, we can find the analytical solution in the frequency domain. That is:
(12) 
where is inverse FFT. Note that due to the existence of this analytical solution, the computational cost of solving the evolution for becomes negligible, which is an important benefit of this approach.
When a nonlinearity is applied, we use an approximation to solve Eq. 8, obtaining
(13) 
This means we first apply FFT and its inverse to solve the linear system, and then we apply the nonlinear function . Here, means the time scale to compute . In this paper, we set the nonlinearity function to be tanh, but other nonlinearities could also be used.
For configuration 1, we use . For configuration 2, we use to solve and to solve . In this case, the FLOPS will be only that of the original baseline network. Upon this condition, the process can be formulated as,
where is generated with .
Appendix C Model Configuration
In this section, we provide the architecture we used for the tests in section 3. The AlexNet, ResNet4, and ResNet10 we are using are described in following sections.
c.1 AlexNet
We used a 2layer convolution with residual connection added to the second convolution. Thus, we can transform the second convolution into an ODE. Table 5
explains detailed structure layer by layer. For simplicity, we omit the batch normalization and ReLU layer added after each convolution.
Training details
We train AlexNet for 120 epochs with initial learning rate 0.1. The learning rate decays by a factor of 10 at epoch 40, 80 and 100. Data augmentation is implemented. Also, the batch size used for training is 256. Note that the setting is the same for all experiments, i.e., baseline, Neural ODE, and
ANODEV2.Name  output size  Channel In / Out  Kernel Size  Residual 
conv1  3232  3 / 64  55  No 
max pool  1616  64 / 64     
conv2  1616  64 / 64  55  Yes 
max pool  88  64 / 64     
Name  input size  output size  
fc1  4096  384  
fc2  384  192  
fc3  192  10 
c.2 ResNet4 and ResNet10
Here, we provide the architecture of ResNet4 and ResNet10 used section 3. We omit the batch normalization and ReLU, for simplicity. Detailed structure is provided in Table 6.
Training details
We train ResNet4/10 for 350 epochs with initial learning rate 0.1. The learning rate decays by a factor of 10 at epoch 150, and 300. Data augmentation is implemented. Also, the batch size used for training is 256. Note that the setting is the same for all experiments, i.e., baseline, Neural ODE, and ANODEV2.
Name  output size  Channel In / Out  Kernel Size  Residual  Blocks(ResNet4 / ResNet10) 
conv1  3232  3 / 16  33  No  1 / 1 
layer1_1  3232  16 / 16  [ 33 ]  Yes  1 / 1 
33  
layer1_2  3232  16 / 16  [ 33 ]  Yes  0 / 1 
33  
layer2_1  1616  16 / 32  [ 33 ]  Yes  0 / 1 
33  
layer2_2  1616  32 / 32  [ 33 ]  Yes  0 / 1 
33  
Name  Kernel Size  Stride  Output Size (ResNet4/ResNet10)  
max pool  88  8  44 / 22  
Name  input size (ResNet4/ResNet10)  output size  
fc  256 / 128  10 
Appendix D Convolution kernel Evolution Example
Appendix E Derivation of Optimality Conditions
In this section, we present a detailed derivation of the optimality conditions corresponding to Eq. 5. We need to find the socalled KKT conditions, which can be found by finding stationary points of the corresponding Lagrangian, defined as:
(14) 
In order to derive the optimality conditions, we first take variations with respect to , , and . This basically results in the “Activation ODE”, the “Evolution ODE”, and the relation between and , shown in Eq. 5. Taking variations with respect to will result in a backwardintime ODE for the , which is continuous equivalent to backpropagation (i.e., OptimizeThenDiscretize). Taking variations with respect to will result in an algebraic relation between and . Taking variations with respect to will be split in two parts: variations with respect to for ; and variations with respect to , which is in fact one of our unknown parameters. The split is performed by first integrating by parts the term to expose a term that reads , and then taking variations with respect to . Finally, we also need to take variations with respect to the vector . An important technical detail is that to take the variations of the with respect to can be done easily by converting it to . The details are given below.
In order to satisfy the first optimality condition on we have:
where this equality must hold for any variation in space and time. We have:
(15) 
Imposing this condition holds for all variation will result in the first adjoint equation as follows:
(16) 
For , the following equation needs to be satisfied:
We have
(17) 
This further implies:
(18) 
Finally, the inversion equation on could be found by performing variation on :
We have
(19) 
where is the scalar Heaviside function. Imposing that this condition holds for all variation will result in the inversion equation as follows,
(20) 
The gradient of with respect to can be computed as,
(21) 
Finally, the gradient of with respect to can be computed as,
(22) 
Note that if optimality conditions are achieved with respect to and , then
(23) 
Comments
There are no comments yet.