1 What do differential equations have to do with machine learning?
The first question someone not familiar with the field might ask is, why are differential equations important in this context? The simple answer is that a differential equation is a way to specify an arbitrary nonlinear transform by mathematically encoding prior structural assumptions.
Let’s unpack that statement a bit. There are three common ways to define a nonlinear transform: direct modeling, machine learning, and differential equations. Directly writing down the nonlinear function only works if you know the exact functional form that relates the input to the output. However, in many cases, such exact relations are not known a priori. So how do you do nonlinear modeling if you don’t know the nonlinearity?
One way to address this is to use machine learning. In a typical machine learning problem, you are given some input and you want to predict an output . This generation of a prediction from is a machine learning model (let’s call it ). During training, we attempt to adjust the parameters of so that it generates accurate predictions. We can then use for inference (i.e., produce s for novel inputs ). This is just a nonlinear transformation . The reason
is interesting is because its form is basic but adapts to the data itself. For example, a simple neural network (in design matrix form) with sigmoid activation functions is simply matrix multiplications followed by application of sigmoid functions. Specifically,
is a threelayer deep neural network, where are learnable parameters. You then choose such that reasonably fits the function you wanted it to fit. The theory and practice of machine learning confirms that this is a good way to learn nonlinearities. For example, the Universal Approximation Theorem states that, for enough layers or enough parameters (i.e. sufficiently large matrices), can approximate any nonlinear function sufficiently close (subject to common constraints).So great, this always works! But it has some caveats, the main being that it has to learn everything about the nonlinear transform directly from the data. In many cases we do not know the full nonlinear equation, but we may know details about its structure. For example, the nonlinear function could be the population of rabbits in the forest, and we might know that their rate of births is dependent on the current population. Thus instead of starting from nothing, we may want to use this known a priori relation and a set of parameters that defines it. For the rabbits, let’s say that we want to learn
(1) 
In this case, we have prior knowledge of the rate of births being dependent on the current population. The way to mathematically state this structural assumption is via a differential equation. Here, what we are saying is that the birth rate of the rabbit population at a given time point increases when we have more rabbits. The simplest way of encoding that is
(2) 
where is some learnable constant. If you know your calculus, the solution here is exponential growth from the starting point with a growth rate : . But notice that we didn’t need to know the solution to the differential equation to validate the idea: we encoded the structure of the model and mathematics itself then outputs what the solution should be. Because of this, differential equations have been the tool of choice in most science. For example, physical laws tell you how electrical quantities emit forces (Maxwell’s Equations). These are essentially equations of how things change and thus "where things will be" is the solution to a differential equation. But in recent decades this application has gone much further, with fields like systems biology learning about cellular interactions by encoding known biological structures and mathematically enumerating our assumptions or in targeted drug dosage through PK/PD modelling in systems pharmacology.
So as our machine learning models grow and are hungry for larger and larger amounts of data, differential equations have become an attractive option for specifying nonlinearities in a learnable (via the parameters) but constrained form. They are essentially a way of incorporating prior domainspecific knowledge of the structural relations between the inputs and outputs. Given this way of looking at the two, both methods trade off advantages and disadvantages, making them complementary tools for modeling. It seems like a clear next step in scientific practice to start putting them together in new and exciting ways!
2 What is the Neural Ordinary Differential Equation (ODE)?
The neural ordinary differential equation is one of many ways to put these two subjects together. The simplest way of explaining it is that, instead of learning the nonlinear transformation directly, we wish to learn the structures of the nonlinear transformation. Thus instead of doing , we put the machine learning model on the derivative, , and now solve the ODE. Why would you ever do this? Well, one motivation is that defining the model in this way and then solving the ODE using the simplest and most error prone method, the Euler method, what you get is equivalent to a residual neural network. The way the Euler method works is based on the fact that , thus
(3) 
which implies that
(4) 
This looks similar in structure to a ResNet, one of the most successful image processing models. The insight of the the Neural ODEs paper was that increasingly deep and powerful ResNetlike models effectively approximate a kind of "infinitely deep" model as each layer tends to zero. Rather than adding more layers, we can just model the differential equation directly and then solve it using a purposebuilt ODE solver. Numerical ODE solvers are a science that goes all the way back to the first computers, and modern ones can adaptively choose step sizes and use high order approximations to dratically reduce the number of actual steps required. And as it turns out, this works well in practice, too.
3 How do you solve an ODE?
First, how do you numerically specify and solve an ODE? If you’re new to solving ODEs, you may want to watch our video tutorial on solving ODEs in Julia and look through the ODE tutorial of the DifferentialEquations.jl documentation. The idea is that you define an ODEProblem via a derivative equation u’=f(u,p,t), and provide an initial condition u0, and a timespan tspan to solve over, and specify the parameters p.
For example, the LotkaVolterra equations describe the dynamics of the population of rabbits and wolves. They can be written as:
(5)  
(6) 
and encoded in Julia like:
Then to solve the differential equations, you can simply call solve on the prob:
One last thing to note is that we can make our initial condition (u0) and time spans (tspans) to be functions of the parameters (the elements of p). For example, we can define the ODEProblem:
In this form, everything about the problem is determined by the parameter vector (
p, referred to as in associated literature). The utility of this will be seen later.DifferentialEquations.jl has many powerful options for customising things like accuracy, tolerances, solver methods, events and more; check out the docs for more details on how to use it in more advanced ways.
4 Let’s Put an ODE Into a Neural Net Framework!
To understand embedding an ODE into a neural network, let’s look at what a neural network layer actually is. A layer is really just a differentiable function which takes in a vector of size n and spits out a new vector of size m. That’s it! Layers have traditionally been simple functions like matrix multiply, but in the spirit of differentiable programming people are increasingly experimenting with much more complex functions, such as ray tracers and physics engines.
Turns out that differential equations solvers fit this framework, too: A solve takes in some vector p (which might include parameters like the initial starting point), and outputs some new vector, the solution. Moreover it’s differentiable, which means we can put it straight into a larger differentiable program. This larger program can happily include neural networks, and we can keep using standard optimisation techniques like ADAM to optimise their weights.
DiffEqFlux.jl makes it convenient to do just this; let’s take it for a spin. We’ll start by solving an equation as before, without gradients.
Let’s plot (t,A) over the ODE’s solution to see what we got:
The most basic differential equation layer is diffeq_rd, which does the same thing with a slightly altered syntax. diffeq_rd takes in parameters p for the integrand, puts it in the differential equation defined by prob, and solves it with the chosen arguments (solver, tolerance, etc). For example:
The nice thing about diffeq_rd
is that it takes care of the type handling necessary to make it compatible with the neural network framework (here Flux). To show this, let’s define a neural network with the function as our single layer, and then a loss function that is the squared distance of the output values from
1. In Flux, this looks like:Now we tell Flux to train the neural network by running a 100 epoch to minimise our loss function (
loss_rd()) and thus obtain the optimized parameters:(Animation omitted from the paper. Please see the original blog post)
Flux finds the parameters of the neural network (p) which minimize the cost function, i.e. it trains the neural network: it just so happens that the forward pass of the neural network includes solving an ODE. Since our cost function put a penalty whenever the number of rabbits was far from 1, our neural network found parameters where our population of rabbits and wolves are both constant 1.
Now that we have solving ODEs as just a layer, we can add it anywhere. For example, the multilayer perceptron is written in Flux as
and if we had an appropriate ODE which took a parameter vector of the right size, we can stick it right in there:
or we can stick it into a convolutional neural network, where the previous layers define the initial condition for the ODE:
As long as you can write down the forward pass, we can take any parameterised, differentiable program and optimise it. The world is your oyster.
5 Why is a full ODE solver suite necessary for doing this well?
Where we have combined an existing solver suite and deep learning library, the excellent torchdiffeq
project takes an alternative approach, instead implementing solver methods directly in PyTorch, including an adaptive Runge Kutta 45 (
dopri5) and an AdamsBashforthMoulton method (adams). However, while their approach is very effective for certain kinds of models, not having access to a full solver suite is limiting.Consider the following example, the ROBER ODE. The most welltested (and optimized) implementation of an AdamsBashforthMoulton method is the CVODE integrator in the C++ package SUNDIALS (a derivative of the classic LSODE). Let’s use DifferentialEquations.jl to call CVODE with its Adams method and have it solve the ODE for us:
(For those familiar with solving ODEs in MATLAB, this is similar to ode113)
Both this and the dopri method from Ernst Hairer’s Fortran Suite stall and fail to solve the equation. This happens because the ODE is stiff, and thus methods with "smaller stability regions" will not be able to solve it appropriately (for more details, I suggest reading Hairer’s Solving Ordinary Differential Equations II). On the other hand Rodas5() to this problem, the equation is solved in a blink of an eye:
This is just one example of subtlety in integration: Stabilizing explicit methods via PIadaptive controllers, step prediction in implicit solvers, etc. are all intricate details that take a lot of time and testing to become efficient and robust. Different problems require different methods: Symplectic integrators are required to adequately handle physical many problems without drift, and tools like IMEX integrators are required to handle ODEs which come from partial differential equations. Building a productionquality solver is thus an enormous undertaking and relatively few exist.
Rather than building an MLspecific solver suite in parallel to one suitable for scientific computing, in Julia they are one and the same, meaning you can take advantage of all of these methods today.
6 What kinds of differential equations are there?
Ordinary differential equations are only one kind of differential equation. There are many additional features you can add to the structure of a differential equation. For example, the amount of bunnies in the future isn’t dependent on the number of bunnies right now because it takes a nonzero amount of time for a parent to come to term after a child is incepted. Thus the birth rate of bunnies is actually due to the amount of bunnies in the past. Using a lag term in a differential equation’s derivative makes this equation known as a delay differential equation (DDE). Since DifferentialEquations.jl handles DDEs through the same interface as ODEs, it can be used as a layer in Flux as well. Here’s an example:
The full code for this example, including generating an animation, can be found in the modelzoo
Additionally we can add randomness to our differential equation to simulate how random events can cause extra births or more deaths than expected. This kind of equation is known as a stochastic differential equation (SDE). Since DifferentialEquations.jl handles SDEs (and is currently the only library with adaptive stiff and nonstiff SDE integrators), these can be handled as a layer in Flux similarly. Here’s a neural net layer with an SDE:
And we can train the neural net to watch it in action and find parameters to make the amount of bunnies close to constant:
(Animation omitted from the paper. Please see the original blog post).
And we can keep going. We can make neural versions of each of these. For example, the model zoo contains an example training a neural SDE. There are differential equations which are piecewise constant used in biological simulations, or jump diffusion equations from financial models, and the solvers map right over to the Flux neural network framework through DiffEqFlux.jl. DiffEqFlux.jl uses only around 100 lines of code to pull this all off.
7 Implementing the Neural ODE layer in Julia
Let’s go all the way back for a second and now implement the neural ODE layer in Julia. Remember that this is simply an ODE where the derivative function is defined by a neural network itself. To do this, let’s first define the neural net for the derivative. In Flux, we can define a multilayer perceptron with 1 hidden layer and a tanh activation function like:
To define a neural_ode layer, we then just need to give it a timespan and use the neural_ode function:
As a side note, to run this on the GPU, it is sufficient to make the initial condition and neural network be on the GPU. This will cause the entire ODE solver’s internal operations to take place on the GPU without extra data transfers in the integration scheme. This looks like:
8 Understanding the Neural ODE layer behavior by example
Now let’s use the neural ODE layer in an example to find out what it means. First, let’s generate a time series of an ODE at evenly spaced time points. We’ll use the test equation from the Neural ODE paper.
Now let’s pit a neural ODE against this data. To do so, we will define a single layer neural network which just has the same neural ODE as before (but lower the tolerances to help it converge closer, makes for a better animation!):
Notice that the neural_ode has the same timespan and saveat as the solution that generated the data. This means that given an x (and initial value), it will generate a guess for what it thinks the time series will be where the dynamics (the structure) is predicted by the internal neural network. Let’s see what time series it gives before we train the network. Since the ODE has twodependent variables, we will simplify the plot by only showing the first. The code for the plot is:
But now let’s train our neural network. To do so, define a prediction function like before, and then define a loss between our prediction and data:
And now we train the neural network and watch as it learns how to predict our time series:
(Animation omitted from the paper. Please see the original blog post).
This code can be found in the modelzoo
Notice that we are not learning a solution to the ODE. Instead, what we are learning is the tiny ODE system from which the ODE solution is generated. I.e., the neural network inside the neural_ode layer learns this function:
Thus it learned a compact representation of how the time series works, and it can easily extrapolate to what would happen with different starting conditions. Not only that, it’s a very flexible method for learning such representations. For example, if your data is unevenly spaced at time points t, just pass in saveat=t and the ODE solver takes care of it.
As you could probably guess by now, the DiffEqFlux.jl has all kinds of extra related goodies like Neural SDEs (
neural_msde) for you to explore in your applications.9 The core technical challenge: backpropagation through differential equation solvers
Let’s end by explaining the technical issue that needed a solution to make this all possible. The core to any neural network framework is the ability to backpropagate derivatives in order to calculate the gradient of the loss function with respect to the network’s parameters. Thus if we stick an ODE solver as a layer in a neural network, we need to backpropagate through it.
There are multiple ways to do this. The most common is known as (adjoint) sensitivity analysis. Sensitivity analysis defines a new ODE whose solution gives the gradients to the cost function w.r.t. the parameters, and solves this secondary ODE. This is the method discussed in the neural ordinary differential equations paper, but actually dates back much further, and popular ODE solver frameworks like FATODE, CASADI, and CVODES have been available with this adjoint method for a long time (CVODES came out in 2005!). DifferentialEquations.jl has sensitivity analysis implemented too
The efficiency problem with adjoint sensitivity analysis methods is that they require multiple forward solutions of the ODE. As you would expect, this is very costly. Methods like the checkpointing scheme in CVODES reduce the cost by saving closer time points to make the forward solutions shorter at the cost of using more memory. The method in the neural ordinary differential equations paper tries to eliminate the need for these forward solutions by doing a backwards solution of the ODE itself along with the adjoints. The issue with this is that this method implicitly makes the assumption that the ODE integrator is reversible. Sadly, there are no reversible adaptive integrators for firstorder ODEs, so with no ODE solver method is this guaranteed to work. For example, here’s a quick equation where a backwards solution to the ODE using the Adams method from the paper has >1700% error in its final point, even with solver tolerances of 1e12:
(Here we once again use the CVODE C++ solvers from SUNDIALS since they are a close match to the SciPy integrators used in the neural ODE paper.)
This inaccuracy is the reason why the method from the neural ODE paper is not implemented in software suites, but it once again highlights a detail. Not all ODEs will have a large error due to this issue. And for ODEs where it’s not a problem, this will be the most efficient way to do adjoint sensitivity analysis. And this method only applies to ODEs. Not only that, it doesn’t even apply to all ODEs. For example, ODEs with discontinuities (events) are excluded by the assumptions of the derivation. Thus once again we arrive at the conclusion that one method is not enough.
In DifferentialEquations.jl have implemented many different methods for computing the derivatives of differential equations with respect to parameters. We have a recent preprint detailing some of these results [4]. One of the things we have found is that direct use of automatic differentiation can be one of the most efficient and flexible methods. Julia’s ForwardDiff.jl [5], Flux, and ReverseDiff.jl can directly be applied to perform automatic differentiation on the native Julia differential equation solvers themselves, and this can increase performance while giving new features. Our findings show that forwardmode automatic differentiation is fastest when there are less than 100 parameters in the differential equations, and that for >100 number of parameters adjoint sensitivity analysis is the most efficient. Even then, we have good reason to believe that the next generation reversemode automatic differentiation via sourcetosource AD, Zygote.jl [6], will be more efficient than all of the adjoint sensitivity implementations for large numbers of parameters.
Altogether, being able to switch between different gradient methods without changing the rest of your code is crucial for having a scalable, optimized, and maintainable framework for integrating differential equations and neural networks. And this is precisely what DiffEqFlux.jl gives the user direct access to. There are three functions with a similar API:

diffeq_rd uses Flux’s reversemode AD through the differential equation solver.

diffeq_fd uses ForwardDiff.jl’s forwardmode AD through the differential equation solver.

diffeq_adjoint uses adjoint sensitivity analysis to "backprop the ODE solver"
Therefore, to switch from a reversemode AD layer to a forwardmode AD layer, one simply has to change a single character. Since Juliabased automatic differentiation works on Julia code, the native Julia differential equation solvers will continue to benefit from advances in this field.
10 Conclusion
Machine learning and differential equations are destined to come together due to their complementary ways of describing a nonlinear world. In the Julia ecosystem we have merged the differential equation and deep learning packages in such a way that new independent developments in the two domains can directly be used together. We are only beginning to understand the possibilities that have opened up with this software. We hope that future blog posts will detail some of the cool applications which mix the two disciplines, such as embedding our coming pharmacometric simulation engine PuMaS.jl [7] into the deep learning framework. With access to the full range of solvers for ODEs, SDEs, DAEs, DDEs, PDEs, discrete stochastic equations, and more, we are interested to see what kinds of next generation neural networks you will build with Julia.
References
 [1] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural Ordinary Differential Equations. arXiv:1806.07366 [cs, stat], June 2018. arXiv: 1806.07366.
 [2] Mike Innes. Flux: Elegant Machine Learning with Julia. Journal of Open Source Software, 2018.
 [3] Christopher Rackauckas and Qing Nie. DifferentialEquations.jl  A Performant and FeatureRich Ecosystem for Solving Differential Equations in Julia. Journal of Open Research Software, 5:15, 2017.
 [4] Christopher Rackauckas, Yingbo Ma, Vaibhav Dixit, Xingjian Guo, Mike Innes, Jarrett Revels, Joakim Nyberg, and Vijay Ivaturi. A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions. CoRR, abs/1812.01892, 2018.
 [5] Jarrett Revels, Miles Lubin, and Theodore Papamarkou. ForwardMode Automatic Differentiation in Julia. arXiv:1607.07892 [cs], July 2016. arXiv: 1607.07892.
 [6] Michael Innes. Don’t Unroll Adjoint: Differentiating SSAForm Programs. eprint arXiv:1810.07951, page arXiv:1810.07951, October 2018.
 [7] Christopher Rackacukas, Joakim Nyberg, and Vijay Ivaturi. PuMaS: A Pharmaceutical Modeling and Simulation Engine for drug development and clinical therapeutics in Julia. Abstracts for the Ninth American Conference on Pharmacometrics (ACoP9). Journal of Pharmacokinetics and Pharmacodynamics, 45(1):3–134, October 2018.
Comments
There are no comments yet.