Structure preserving numerical schemes have their roots in geometric integration [hairer2006geometric], and numerical schemes that build on characterisations of PDEs as metric gradient flows [ambrosio2008gradient], just to name a few. The overarching aim of structure preserving numerics is to preserve certain properties of the continuous model, e.g mass or energy conservation, in its discretisation. But structure preservation is not just restricted to play a role in classical numerical analysis of ODEs and PDEs. Indeed, through the advent of continuum interpretations of neural networks [haber2017stable, E2017deepode, E2018, Ruthotto2019pde], structure preservation is also entering the field of deep learning. Here, the main objectives are to use the continuum model and structure preserving schemes to derive stable and converging neural networks and associated training procedures, and algorithms, e.g. neural networks which generalise well.
1.1. Neural Networks
Neural networks are a rich class of machine learning models that can be leveraged for many different tasks including regression, classification, natural language processing, reinforcement learning and image generation[lecun2015deep]. While it is difficult to provide an all-encompassing definition for neural networks, they can generally be characterised as a combination of simple, parametric functions between feature spaces. These functions act as individual building blocks (commonly called the layers of the network). The main mechanism for combining these layers, which we will adopt in this work, is by function composition.
For any , let
denote a vector space (ourfeature space). While in most applications, these are simply finite-dimensional Euclidean spaces, we will assume more general structures (such as Banach spaces) when appropriate. With this, we then define a generic layer
where is the set of possible parameter values of this layer. A neural network
can then be defined via the iteration
such that and , where denotes the entirety of the network’s parameters. The first layer is commonly referred to as the neural network’s input layer, the final layer as the neural network’s output layer, and all of the remaining layers are called hidden layers.
While the above definitions are still quite general, in practice several standard layer types are employed. Most ubiquitous are those that can be written as a learnable affine combination of the input, followed by a simple, nonlinear function: the layer’s activation function. The quintessential example are fully-connected layers
(ReLU). For classification networks, the most common choice for the output layer’s activation function is the softmax activation function given by
such that the neural network’s output’s entries can be regarded as the individual class membership probabilities of the input. An important extension to the concept of fully-connected layers lie inconvolutional layers [lecun1989backpropagation] where the matrix-vector product is replaced by the application of a (multi-channel) convolution operator. These are the main building block of neural networks used in imaging applications.
Suppose we are given a set of paired training data , which is the case for predictive tasks like regression or classification. Training the model then amounts to solving the optimisation problem
Here is the loss for a specific data point where
is a general loss function which usually satisfiesand if and only if . The function is usually smooth on its effective domain and convex. acts as a regulariser which penalises and constrains unwanted solutions. In this setting, solving (4) is a form of empirical risk minimisation [shalev2014understanding]
. Typically, variants of stochastic gradient descent are employed to solve this task. The calculation of the necessary gradients is performed using the famousbackpropagation algorithm, which can be understood as an application of reverse-mode auto-differentiation [linnainmaa1970representation].
1.2. Residual Networks and Differential Equations
In the following, we will discuss artificial neural networks architectures that arise from the numerical discretisation of time and parameter dependent differential equations. Differential equations have a long history in the mathematical treatment of neural networks. Initially, neural networks were motivated by biological neurons in the brain. Mathematical models for their interactions are based on nonlinear, time-dependent differential equations. These have inspired some famous artificial neural networks such as the Hopfield networks[hopfield1982neural].
On a time interval , the values of the approximation to the solution of the differential equation at different discrete times , e.g. and , corresponding to the different layers of the network architecture. For a fixed final time , the existence of an underlying continuous model guarantees the existence of a continuous limit as the number of layers goes to infinity and goes to zero.
In the spirit of geometric numerical integration [hairer2006geometric], we discuss structural properties of the ANN as arising from the structure preserving discretisation of a differential equation with appropriate qualitative features such as having an underlying symmetry, an energy function or a Lyapunov function, preserving a volume form or a symplectic structure.
Contrary to the ’classical’ design principle for layers of affine transformations followed by activation functions (cf. (3)), so-called residual layers are a variation to this principle, which has risen to become a standard design concept of neural networks. Here, the output of one such layer is again added to its input, which again defines a network, the ResNet [He2016resnet], that is now given by the iteration
if . If on the other hand the output space differs from , it is common to add another layer on top, which defines a network . This is for example a common scenario in classification, where the dimensionality of is determined by the number of classes.
It is easy to see that (5) corresponds to a particular discretisation of an ODE. To make the connection more precise, denote by samples of three functions and . With these notations we can write the ResNet (5) as with
if and thus . For general , it can be readily seen that (6) corresponds to the forward Euler discretisation of
Thus, there is a natural connection of the discrete deep learning problem (4)+(5) with the optimal control formulation (4)+(7). The dynamics of the ResNet (5) as a discretisation of (7) is depicted in Figure 1.
From here on we will suppress the dependence on whenever it is clear from the context.
It is a well-known fact that the optimal control formulation can be phrased as a closed dynamical system by using Pontryagin’s principle and this results in a constrained Hamiltonian boundary value problem [Benning2019ode], the Hamiltonian of this system is given as
where is a vector of Lagrange multipliers. The dynamical system is then given by
subject to the constraint
The adjoint equation for can be expressed as
In what follows we will review some of the guiding principles of structure preserving deep learning, and in particular recent contributions for new neural networks architectures as discretisations for ODEs and PDEs in Section 2 and the interpretation of the training of neural networks as an optimal control problem in Section 3, invertible neural networks in Section 4, equivariant neural networks in Section 5, and structure-preserving numerical schemes for the training of neural networks in Section 6.
2. Neural networks inspired by differential equations
2.1. Structure preserving ODE formulations
In this section, we look at how the ODE formulation (7) can be restricted or extended in order to ensure that its flow has favourable properties. We are zooming in on the forward problem itself, assuming that the parameters are fixed. By abuse of notaton, we will in this section write for in that we consider a known function of . It is perhaps not so obvious what the desirable properties of the flow should be, and to some extent we here lean on prior work by Haber and Ruthotto [haber2017stable] as well as Chang et al. [chang2018reversible]. It seems desirable that small perturbations in the data should not lead to large differences in the end result. Preferably, the forward model should have good stability properties, for instance in the sense that for any two nearby solutions and to (7)
for a moderately sized constant
. It is well-known that this type of estimate can be obtained in several different ways, depending on the properties of the underlying vector field in (7). If is Lipschitz in its second argument with constant , then (13) holds with . Looking at (8), one can use where is a Lipschitz constant for the activation function .
Stability can also be studied in terms of Lyapunov functions, that is, functions that are non-increasing along solution trajectories. Functions that are constant along solutions are called first integrals, and a particular instance is the energy function of autonomous Hamiltonian systems.
For stability of nonlinear ODEs one may for instance consider growth and contractivity in the -norm using a one-sided Lipschitz condition. This is similar to the analysis proposed in [Zhang2020resnet]. We assume that there exists a constant such that for all admissible , and we have
In this case it is easily seen that for any two solutions , to (7) one has
where the supremum is taken over all diagonal matrices with diagonal entries in . Some care should be taken here: It is not
so that the sign of the eigenvalues ofis invariant under the set of all positive diagonal matrices . In particular, even if
is skew-symmetric, the vector field (8) may still have a positive one sided Lipschitz constant , but if we assume for instance that , it holds that . Haber and Ruthotto [haber2017stable] suggest to use the eigenvalues of the linearised ODE vector field to analyse the stability behaviour of the forward problem. If the eigenvalues of the resulting Jacobian matrix have only non-positive real parts, it may be an indication of stability, yet for non-autonomous systems such an analysis may lead to wrong conclusions.
Clearly, the set of all vector fields of the form (8) include both stable and unstable cases. There are different ways of ensuring that a vector field has stable or contractive trajectories. One could be to put restrictions on the family of matrices , another option is to alter the form of (8) either by adding symmetry to it or by embedding it into a larger space where it can be given desirable geometric properties, e.g. by doubling the dimension, it is possible in several different ways to obtain a non-autonomous Hamiltonian vector field. In [chang2018reversible] and [haber2017stable] several models are suggested, and we mention a few of them in Section 2.1.2.
2.1.1. Dissipative models and gradient flows.
For different reasons it is desirable to use vector fields with good stability properties. This will ensure that data which are close to each other in a chosen norm initially remain close as the features propagate through the neural network. A model suggested in [haber2017stable] was to consider weight matrices of the form
where is arbitrary and is a small dissipation parameter. Flows of vector fields of the form exactly preserve the -norm of the flow when is skew-symmetric. The non-linearity will generally alter this behaviour, but adding a small amount of dissipation can improve the stability. It is however not guaranteed to reduce the one-sided Lipschitz condition.
Zhang and Schaeffer [Zhang2020resnet] analyse the stability of a model where the activation function is always chosen to be the Rectified Linear Unit (ReLU) function. The form of the discretised model is such that in the limit when tends to zero it must be replaced by a differential inclusion rather than an ODE of the form discussed above, meaning that belongs to some specified subdomain of and their model vector field is
Growth and stability estimates are derived for this class of vector fields as well as for cases where restrictions are imposed on the parameters, such as having positive elements or the case and . For this last case, we consider for simplicity the ODE
which is a gradient system in the sense that with where and is the vector of ones.
Let be twice differentiable and convex in the second argument. Then the gradient vector field satisfies a one-sided Lipschitz condition (14) with .
Proof. (1) We compute
Therefore, by the convexity of
(2) Let and be vectors in . Using (16) while suppressing the -dependence in the parameters, we find
For real scalars and we have
and since a.e. we have
Using this inequality in (17) we obtain
Since the result follows.
Remark. In Theorem 2.1 we restricted the class of activation functions to be absolutely continuous with . This is true for many of the activation functions proposed in the literature, in particular for the ReLU function and the sigmoid .
2.1.2. Hamiltonian vector fields.
One may take inspiration from mechanical systems and introduce the Hamiltonian framework. Separable Hamiltonian systems are defined in terms of kinetic and potential energy functions and
The leads to differential equations of the form
There are different ways to construct a Hamiltonian system from (8). In [chang2018reversible] the following model is suggested
Let be such that . The corresponding Hamiltonian is
where . A simple case is obtained by choosing , , and which after eliminating yields the second order ODE
A second example considered in [chang2018reversible] is obtained by setting .
From the outset, it might not be so clear which geometric features such a non-autonomous Hamiltonian system has. There seem to be at least two ways to understand this problem [marthinsen16gio]. Let us assume that with “positions” and “momenta” forming the phase space. We have the natural symplectic form on the phase space . This form can be represented by the Darboux-matrix as
and the Hamiltonian vector field is defined via .
Let us now introduce as phase space by including the time as a dependent variable. The natural projection is . The space can then be furnished with a contact structure defined as
The first term is just the original form, ignoring the -component of the tangents, the second form is only concerned with the “-direction”. One can write the matrix of by adding a row and a column to coming from the second term
The resulting vector field is then and can be expressed through the equations
where stands for the interior derivative of the form by , e.g. applied to the two-form is the one-form such that for all vector fields . The form is preserved along the flow, but is not.
Extended phase space. One can in fact recover an autonomous Hamiltonian problem from the non-autonomous one by adding two extra dependent variables, say and . We do this by considering the manifold which can be identified with . One needs a new projection and we can define an extended (autonomous) Hamiltonian on as
with corresponding two-form
The corresponding matrix, , is just the original Darboux-matrix where each of the identity matrices has been replaced by corresponding identity matrices. The extended Hamiltonian is exactly preserved along the flow so that the new conjugate momentum variable will keep track of how is varying along solutions. The resulting ODE vector feld can be written as and in coordinates the ODE system becomes
We see at once that since the equations for do not depend on and since the solution for is explicitly given, we solve the same problem as before. After solving for and , we obtain independently by integration. The second thing one may observe is that if a numerical method has the property that111It is common to assume that a given numerical integrator is defined on systems in any dimension
then what we obtain by just considering the first components of the numerical solution to the extended system is precisely the same as what we would have obtained applying the same method to the original non-autonomous problem. This observation was used by Asorey et al. [asorey83gct] to define what is meant by canonical transformations in the non-autonomous case, and we refer to this paper for further results on the structural connections between the two systems. In applications to deep learning, one should note that geometric properties of the solution can mostly be deduced from the extended system rather than the original non-autonomous one, there are numerical methods which preserve energy or the symplectic form and rigorous results can be proved for the long time behaviour of such integrators [hairer2006geometric].
2.2. Structure preserving numerical methods for the ODE model
The rationale behind proposing ODE formulations with geometric structure is to enforce a controlled behaviour of the solution as it is propagated through the hidden layers of the network. It is therefore also important that when the ODE is approximated by a numerical method, this behaviour should be retained by the numerical scheme.
2.2.1. Numerical methods preserving dissipativity
When solving numerically ODEs which satisfy the one-sided Lipschitz condition (14) a desirable property of the numerical scheme would be that it contracts two nearby numerical solutions whenever . That is, it should satisfy
in each time step , preferably without too severe restrictions on the time step . Methods which can achieve this for any step size exist, and are called B-stable. There are many B-stable methods, and for Runge–Kutta methods B-stability can be easily checked by the condition of algebraic stability. Examples of B-stable methods are the implicit Euler method, and all implicit Runge–Kutta methods within the classes, Gauss, Radau IA, Radau IIA and Lobatto IIIC [hairer2010ode2]. In deep learning algorithms it has been more usual to consider explicit schemes since they are much cheaper per step than implicit ones, but are subject to restrictions on the time step used. Note for instance that the explicit Euler method is unstable for any step size when applied even to linear constant coefficient systems where there are eigenvalues of the coefficient matrix on the imaginary axis. To consider contractivity of explicit schemes, we need to replace (14) by a different monotonicity condition
where we assume . For every Runge–Kutta method with strictly positive weights there is a constant such that the numerical solution is contractive whenever the step size satisfies
The value of can be calculated for every Runge–Kutta method with positive weights, and e.g. for the explicit Euler method as well as for the classical 4th order method of Kutta one has [dahlquist79gdo].
2.2.2. Numerical methods preserving energy or symplecticity
For autonomous Hamiltonian systems there are two important geometric properties which are conserved. One is the energy or Hamiltonian the other one is the symplectic structure, a closed non-degenerate differential two-form. These two properties are also the main targets for structure preserving numerical methods.
All Hamiltonian deep learning models presented here can be extended to a separable, autonomous canonical system, i.e. a system of the form (18)-(19). Such systems preserve the symplectic two-form and there are many examples of explicit numerical methods that also preserve this same form in the sense that . The simplest example of such a scheme is the symplectic Euler method, defined for the variables and as
The symplectic Euler method is explicit for separable Hamiltonian systems and is an example of a splitting method [mclachlan02sm]. Many other examples and a comprehensive treatment of symplectic integrators can be found in [hairer2006geometric]. When applying symplectic integrators to Hamiltonian problems one has the important notion of backward error analysis. The numerical approximations obtained can be interpreted as the exact flow of a perturbed system with Hamiltonian 222This is a divergent asymptotic series, but truncation is possible at the expense of an exponentially small remainder term. This partly explains the popularity of symplectic integrators, since many of the characteristics of Hamiltonian flows are inherited by the numerical integrator.
There exist many numerical methods which preserve energy exactly for autonomous problems, for instance there is a large class of schemes based on discrete gradients [mclachlan99giu]. A discrete gradient of a function is a continuous map which satisfies the following two conditions
For a Hamiltonian problem, it is easily seen that the method defined as
will be energy preserving in the sense that for all . There are many choices of discrete gradients, but most of them lead to implicit schemes and therefore have the disadvantage of being computationally expensive.
Another disadvantage is that even if it makes sense to impose energy conservation for the extended autonomised system explained above for deep learning models, it is not clear what that would mean for the original problem. It remains an open problem to understand the potential for and the benefits of using energy preserving schemes for non-autonomous Hamiltonian systems in deep learning.
2.2.3. Splitting methods and shears
Splitting methods are very popular time-integration methods that can be easily applied to preserve geometric structure of the underlying ODE problems, e.g. symplecticity. The idea of splitting and composition is simply to split the vector field in the sum of two (or more) vector fields, to integrate separately each of the parts and compose the corresponding flows. For example splitting a Hamiltonian vector field in the sum of two Hamiltonian vector fields and composing their flows results into a symplectic integrator. If the individual parts are easy to integrate exactly, the resulting time-integration method has often low computational cost. We refer to [mclachlan02sm] for an overview on splitting methods. An ODE on is a shear if there exist a basis of in which the ODE takes the form
A diffeomorphism on is called a shear if it is the flow of a shear. Splitting vector fields into the sum of shears allows to compute their individual flows exactly simply applying the forward Euler method.
Consider the shears :
In the case of autonomous, separable, Hamiltonian systems, the symplectic Euler method (22) can be seen as the composition of two such maps where
Another popular example is the Störmer-Verlet integrator or leapfrog integrator which is also the composition of two shears. It is possible to represent quite complicated functions with just two shears. The “standard map” (also known as Chirikov–Taylor map) is an area preserving, chaotic shear map much-studied in dynamics and accelerator physics.
Shears are useful tools to construct numerical time-integration methods that preserve geometric features. As already mentioned symplecticity is one such property (if and are gradients), another is the preservation of volume and, as we will see in section 4, shears can be used to obtain invertible neural networks.
2.3. Features evolving on Lie groups or homogeneous manifolds
If the data belongs naturally to a differentiable manifold of finite dimension and in (7) is a vector field on for all then
. Concrete examples of data sets in this category are manifold valued images and signals. One example is diffusion tensor imaging consisting of tensor data which at each pointin space corresponds to a matrix symmetric and positive definite, [cook06cos]. If is the number of voxels in the image then . Another example is InSAR imaging data taking values on the unit circle , and where , [rocca97aoo]
. In both these examples neural networks, e.g. denoising autoencoders[vincent10sda], can be used to learn image denoising. The loss function (4) can take the form
where is the noisy image is the source image and is a distance function on or respectively. For example in the case of with , a possible choice of Riemannian metric is
where and are symmetric and is symmetric and positive definite, and with denoting the trace of matrices. With this metric is a Riemannian symmetric space and is geodesically complete [weinmann14tvr].
A third example concerns classification tasks where the data are Lie group valued curves for activity recognition, here with the number of points where the curve is sampled and is the group of rotations, [CMU]. A loss function can be built using for example the following distance between two sampled curves , ,
where is the Lie algebra of , is a norm deduced from an inner product on , and denotes the matrix logarithm. An important feature of this distance is that it is re-parametrisation invariant and taking the infimum over all (discrete) parametrisations of the second curve one obtains a well defined distance for curves independent on their parametrisation, see [celledoni16sao] for details.
The ODE (7) should in this setting be adapted to be an ODE on . If is a submanifold of , , then the ODE (7) can be modified adding constraints, alternatively one could consider intrinsic manifold formulations which allow to represent the data with degrees of freedom instead of . The numerical integration of this ODE must then respect the manifold structure.
2.4. Open problems
2.4.1. Geometric properties of Hamiltonian models
Autonomous Hamiltonian systems and their numerical solution are by now very well understood. For such models, one has conservation of energy that is attractive when considering stability of the neural network map. The same cannot, to our knowledge, be said about the non-autonomous case. One can approach this problem in different ways. One is to consider canonicity in the sense of [asorey83gct] and study the geometric properties of canonical transformations via the extended autonomous Hamiltonian system. This is however not so straightforward for a number of reasons. One issue is that every level set of the extended Hamiltonian will be non-compact. Another issue is that the added conjugate momentum variable, denoted above, is artificial and is only used to balance the time varying energy function.
One should also consider the effect of regularisation, since one may expect that smoothing of the parameters will cause the system to behave more similarly to the autonomous problem. See subsection 3.2 of this paper.
2.4.2. Measure preserving models
The most prevalent example of measure to preserve is the phase space volume. In invertible networks some attention is given to volume preserving maps, see [rezendenormalizingflows, dinh2014nice] and also section 4 of this paper. For the ODE model this amounts to the vector field being divergence free, and there are several ways to parametrise such vector fields. All Hamiltonian vector fields are volume preserving, but the converse is not true. Volume preserving integrators can be obtained, for example, via splitting and composition methods [mclachlan02sm].
2.4.3. Manifold based models
A generalisation of most of the approaches presented in this section to the manifold setting is still missing. For data evolving on manifolds the ODE (7) should be adapted to be an ODE on . Just as an example, a gradient flow on analogous to (16) can be considered starting from defining a function using the antiderivative of the activation function , and the Riemannian metric. The Hamiltonian formulations of section 2.1.2 could be also generalised to manifolds in a similar way, starting from the definition of the Hamiltonian function.
An appropriate numerical time-discretisation of the ODEs for deep learning algorithms must guarantee that also the evolution through the layers remains on the manifold so that one can make use of the Riemannian structure of and obtain improved convergence of the gradient descent optimisation, see also section 6.2. The numerical time discretisation of this ODE must then respect the manifold structure and there is a vast literature on this subject, see for example [hairer2006geometric, Ch. IV]. For numerical integration methods on Lie groups and homogeneous manifolds including symplectic and energy-preserving integrators see [iserles00lgm, celledoni14ait].
3. Deep Learning meets Optimal Control
As outlined in the introduction supervised deep learning with the ResNet architecture can be seen as solving a discretised optimal control problem. This observation has been made in [haber2017stable, E2017deepode, Ciccone2018ode, Lu2018ode] with further extensions to PDEs are described in [Ruthotto2019pde].
where the neural network is either defined by a recursion like (5) or the solution at the final time of an ODE (7). Another approach is to view the training as an optimisation problem over where is the space of the dynamics , i.e.
3.1. Training algorithms
The discrete or continuous optimal control problem can be solved in multiple ways, some of which we will discuss next.
3.1.1. Derivative-based algorithms
The advantage of the reduced problem formulation (23) is that it is a smooth and unconstrained optimisation problem such that if is a Hilbert space, derivative-based algorithms such as ”gradient” descent are applicable. Due to the nonlinearity of we can at most expect to converge to a critical point with a derivative-based algorithm. The most basic algorithm to train neural networks is stochastic gradient descent (SGD) [robbins1951stochastic]. Given an initial estimate , SGD consists of iterating two simple steps. First, sample a set of data points and then iterate
Other first-order algorithms used for deep learning are the popular Adam [kingma2014adam] but also the Gauss–Newton method has been used [haber2017stable]. A discussion on the convergence of SGD is out of the scope of this paper. Instead we focus on how to compute the derivatives in the continuous model which is the central building block for all first-order methods. This following theorem is very common and the main idea dates back to Pontryagin [pontryagin1987mathematical]. This formulation is inspired by [Bonnans2019, Lemma 2.47].
Assume that and are of class and that is Lipschitz with respect to . Let be the solution of the ODE (24b) with initial condition and the solution of the adjoint equation
Then the Fréchet derivative of at is the linear map ,
For finite dimensional a similar theorem can be proven which dates back to Grönwall in 1919, see [Gronwall1919ode, Hairer1993ode1]. Defining neural networks via ODEs and computing gradients via continuous formulas similar to Theorem 3.1 has been first proposed in [chen2018neural] with extensions in [Gholami2019anode] and [Dupont2019anode]. In the deep learning community this is being referred to as Neural ODEs.
Similar to Theorem 3.1 a discrete version can be derived when the ODE (7) is discretised with a Runge–Kutta method is given in [Benning2019ode], see also [hager2000runge] for a related discussion about this topic. For simplicity we just state the special case of the explicit Euler discretisation (ResNet (5)) here.
Theorem 3.2 ([Benning2019ode]).
Let be the solution of the ResNet (5) with initial condition . If satisfies the recursion
then the derivative of is given by .
Some readers will spot the similarity between Theorem 3.2 and what is called backpropagation in the deep learning community. This observation was already made in the 1980’s, e.g. [lecun1988theoretical].
If all functions in (27) are discretised by constant functions on , then the gradient of the discrete problem (28) approximates the Fréchet derivative of the continuous problem (27). In more detail, let be supported on and with if and 0 else. If we denote by the data fit using the ODE solution (7) and the data fit with the ResNet (5), then
In other words, in this case of piecewise constant functions, discretise-then-optimise is the same as the optimise-then-discretise.
3.1.2. Method of successive approximation
Another strategy to train neural networks has been recently proposed by [Li2018pmpdeep] and is not based on the gradients of the reduced formulation (23) but on the necessary conditions of (24) also known as Pontryagin’s maximum principle [pontryagin1987mathematical] instead. Given an initial estimate of , each iteration of the method of successive approximation(MSA) has three simple steps. First, solve (24b) with which we denote by , i.e.
Then solve the adjoint equation (26) with and denote the solution by , i.e.