Approximation Capabilities of Neural Ordinary Differential Equations

Neural Ordinary Differential Equations have been recently proposed as an infinite-depth generalization of residual networks. Neural ODEs provide out-of-the-box invertibility of the mapping realized by the neural network, and can lead to networks that are more efficient in terms of computational time and parameter space. Here, we show that a Neural ODE operating on a space with dimensionality increased by one compared to the input dimension is a universal approximator for the space of continuous functions, at the cost of loosing invertibility. We then turn our focus to invertible mappings, and we prove that any homeomorphism on a p-dimensional Euclidean space can be approximated by a Neural ODE operating on a (2p+1)-dimensional Euclidean space.

Authors

• 32 publications
• 2 publications
• 1 publication
• 3 publications
• Neural Operator: Graph Kernel Network for Partial Differential Equations

The classical development of neural networks has been primarily for mapp...
03/07/2020 ∙ by Zongyi Li, et al. ∙ 22

• Augmented Neural ODEs

We show that Neural Ordinary Differential Equations (ODEs) learn represe...
04/02/2019 ∙ by Emilien Dupont, et al. ∙ 16

• Graph Neural Ordinary Differential Equations

We extend the framework of graph neural networks (GNN) to continuous tim...
11/18/2019 ∙ by Michael Poli, et al. ∙ 0

• Dissecting Neural ODEs

Continuous deep learning architectures have recently re-emerged as varia...
02/19/2020 ∙ by Stefano Massaroli, et al. ∙ 14

• Invertible Residual Networks

Reversible deep networks provide useful theoretical guarantees and have ...
11/02/2018 ∙ by Jens Behrmann, et al. ∙ 16

• Overcoming the curse of dimensionality for approximating Lyapunov functions with deep neural networks under a small-gain condition

We propose a deep neural network architecture for storing approximate Ly...
01/23/2020 ∙ by Lars Grüne, et al. ∙ 0

• Neural Ordinary Differential Equations for Semantic Segmentation of Individual Colon Glands

Automated medical image segmentation plays a key role in quantitative re...
10/23/2019 ∙ by Hans Pinckaers, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A neural network block is a function

that maps an input vector

to output vector , and is parameterized by a vector . We require that is almost everywhere differentiable with respect to both of its arguments, allowing the use of gradient-based methods for tweaking the weights based on training data and some optimization criterion, and for passing the gradient to preceding network layers.

One type of neural building blocks that has received attention in recent years is a residual block [HZRS16], where , with being some differentiable, nonlinear, possibly multi-layer transformation. Input and output dimensionality of a residual block are the same, , and such blocks are usually stacked in a sequence, . Often, the functional form of is the same for all blocks in the sequence, where . Then, we can represent the sequence through , where consists of trainable parameters for all blocks in the sequence; the second argument, , allows us to pick the proper subset of parameters, . If we allow arbitrary , a sequence of residual blocks can, in principle, model arbitrary mappings , where we define to be the result of applying the sequence of residual blocks to the initial input . A recent result [LJ18]

shows that a linear layer preceeded by a deep sequence of residual blocks with only one neuron in the hidden layer is a universal approximator for Lebesque-integrable functions

.

1.1 Neural Ordinary Differential Equations

Neural ODEs (ODE-Nets) [CRBD18] are a recently proposed class of differentiable neural network building blocks. ODE-Nets were formulated by observing that processing an initial input vector through a sequence of residual blocks can be seen as evolution of in time . Then, a residual block (eq. 1) is a discretization of a continuous-time system of ordinary differential equations (eq. 2)

 xt+1−xt =fΘ(xt,t), (1) dxtdt=limδt→0xt+δt−xtδt =fΘ(xt,t). (2)

The transformation taking into realized by an ODE-Net for some chosen, fixed time is not specified directly through a functional relationship for some neural network , but indirectly, through the solutions to the initial value problem (IVP) of the ODE

 xT=ϕT(x0)=x0+∫T0fΘ(xt,t)dt (3)

involving some underlying neural network with trainable parameters . By a -ODE-Net we denote an ODE-Net that takes a -dimensional sample vector on input, and produces a -dimensional vector on output. The underlying network must match those dimensions on its input and output, but in principle can have arbitrary internal architecture.

The adjoint sensitivity method [PMBG62] based on reverse-time integration of an expanded ODE allows for finding gradients of the IVP solutions with respect to parameters and the initial values . This allows training ODE-Nets using gradient descent, as well as combining them with other neural network blocks.

Benefits of ODE-Nets compared to residual blocks include improved memory and parameter efficiency, ease of modeling phenomena with continuous time dynamics, out-of-the-box invertibility (), and simplified computations of normalizing flows [CRBD18]. Since their introduction ODE-Nets have seen improved implementations [RIM19] and enhancements in training and stability [GKB19, ZYG19]. The question of their approximation capabilities remains, however, unresolved.

1.2 Limitations of Neural ODEs

Unlike a residual block, a Neural ODE on its own does not have universal approximation capability. Consider a continuous, differentiable, invertible function on . There is no ODE defined on that would result in . Informally, in ODEs, paths between the initial value and final value have to be continuous and cannot intersect in for two different initial values, and paths corresponding to and would need to intersect. By contrast, in a residual block sequence, a discrete dynamical system on , we do not have continuous paths, only points at unit-time intervals, with an arbitrary transformation between points; finding a ResNet for is easy.

1.3 Our Contribution

We analyze the approximation capabilities of ODE-Nets. The results most closely related to ours have been recently provided by the authors of ANODE [DDT19], who focus on a -ODE-Net followed by a linear layer. They provide counterexamples showing that such an architecture is not a universal approximator of functions. However, they show empirical evidence indicating that expanding the dimensionality and using -ODE-Net for instead of a -ODE-Net has positive impact on training.

Here, we prove that setting is enough to turn Neural ODE followed by a linear layer into a universal approximator. Next, we focus our attention to invertible functions – homeomorphisms – by exploring pure -ODE-Nets, not capped by a linear layer. We go beyond the example, and show a class of invertible mappings that cannot be expressed by Neural ODEs defined on . Our main result is a proof that any homeomorphism , for , can be modeled by a Neural ODE operating on an Euclidean space of dimensionality that embeds as a linear subspace.

2 Neural ODEs are Universal Approximators

We show, through a simple construction, that a Neural ODE followed by a linear layer can approximate functions equally well as any traditional feed-forward neural network. Since networks with shallow-but-wide fully-connected architecture

[Cyb89, Hor91], or narrow-but-deep ResNet-based architecture [LJ18] are universal approximators, so are ODE-Nets.

Theorem 1.

Consider a neural network that approximates a Lebesque-integrable function , with being a compact subset. For any , , there exists a linear layer-capped -ODE-Net that can perform the mapping .

Proof.

Set . Let be a neural network that takes input vectors111We use upper subscript to denote dimensionality of vectors; that is, . and produces -dimensional output vectors , where is the desired transformation. is constructed as follows: use to produce , ignore , and always output . Consider a -ODE-Net defined through . Let the initial value be . The ODE will not alter the first dimensions throughout time, hence for any , . After time , we will have

 xT=x0+∫10G(xt)dt=[x(p),0(r)]+∫10[0(p),y(r)]dt=[x(p),F(x(p))].

Thus, for any , the output can be recovered from the output of the ODE-Net by a simple, sparse linear layer that ignores all dimensions except the last one, which it returns. ∎

ODE-Nets have two main advantages compared to traditional architectures: improved computational and space efficiency, and out-of-the-box invertibility. The construction above nullifies both, and thus is of theoretical interest only. This introduces two new open problems: can Neural ODEs be universal approximators while showing improved efficiency compared to traditional architectures, and can Neural ODEs model any invertible function , assuming and are continuous. The main focus of this work is to address the second problem.

3 Background on ODEs, Flows, and Embeddings

This section recapitulates standard material, for details see [Utz81, Lee01, BS02, You10].

3.1 Flows

A mapping is a homeomorphism if is a one-to-one mapping of onto itself, and both and its inverse are continuous. Here, we will assume that for some , and we will use the term -homeomorphism where dimensionality matters.

A topological transformation group or a flow [Utz81] is an ordered triple involving an additive group with neutral element 0, and a mapping such that and for all , all . Further, mapping is assumed to be continuous with respect to the first argument. The mapping gives rise to a parameteric family of homeomorphisms defined as , with the inverse being .

Given a flow, an orbit or a trajectory associated with is a subspace . Given , either or ; two orbits are either identical or disjoint, they never intersect. A point is a fixed point if .

A discrete flow is defined by setting . For arbitrary homeomorphism of onto itself, we easily get a corresponding discrete flow, an iterated discrete dynamical system, , , .

A type of flow relevant to Neural ODEs is a continuous flow, defined by setting , and adding an assumption that the family of homeomorphisms, the function , is differentiable with respect to its second argument, , with continuous . The key difference compared to a discrete flow is that the flow at time , , is now defined for arbitrary , not just for integers. We will use the term -flow to indicate that .

Informally, in a continuous flow the orbits are continuous, and the property that orbits never intersect has consequences for what homeomorphisms can result from a flow. Unlike in the discrete case, for a given homeomorphism there may not be a continuous flow such that for some . We cannot just set , what is required is a continuous family of homeomorphisms such that and is identity – such family may not exist for some . In such case, a Neural ODE would not be able to model the mapping .

3.2 Correspondence between Flows and ODEs

Given a continuous flow one can define a corresponding ODE operating on by defining a vector for every such that . Then, the ODE

 dxtdt =V(xt), ϕ(T−S)(xS) =xS+∫TSV(xt)dt,

corresponds to continuous flow . Indeed, is identity, and . Thus, for any homeomorphism family defining a continuous flow, there is a corresponding ODE that, integrated for time , models the flow at time , .

The vectors of derivatives for all are continuous over and are constant in time, and define a continuous vector field over . The ODEs evolving according to such a time-invariant vector field, where the right-hand side of eq. 2 depends on but not directly on time , are called autonomous ODEs, and take the form of .

Any time-dependent ODE (eq. 2) can be transformed into an autonomous ODE by removing time from being a separate argument of , and adding it as part of the vector . Specifically, we add an additional dimension222We use to denote -th component of vector . to vector , with . We equate it with time, , by including in the definition of how acts on , and including in the initial value . In defining , explicit use of as a variable is being replaced by using the component of vector . The result is an autonomous ODE.

Given time and an ODE defined by , , the flow at time , may not be well defined, for example if diverges to infinity along the way. However, if is well-behaved, the flow will exist at least locally around the initial value. Specifically, Picard–Lindelöf theorem states that if an ODE is defined by a Lipschitz-continuous function , then there exists such that the flow at time , , is well-defined and unique for . If exists, is a homeomorphism, since the inverse exists and is continuous; simply, is the inverse of .

3.3 Flow Embedding Problem for Homeomorphisms

Given a -flow, we can always find a corresponding ODE. Given an ODE, under mild conditions, we can find a corresponding flow at time , , and it necessarily is a homeomorphism. Is the class of -flows equivalent to the class of -homeomorphisms, or only to its subset? That is, given a homeomorphism , does a -flow such that exist? This question is referred to as the problem of embedding the homeomorphism into a flow.

For a homeomorphism , its restricted embedding into a flow is a flow such that for some ; the flow is restricted to be on the same domain as the homeomorphism. Studies of homeomorphisms on simple domains such as a 1D segment [For55] or a 2D plane [And65] already showed that a restricted embedding not always exists.

An unrestricted embedding into a flow [Utz81] is a flow on some space of dimensionality higher than . It involves a homeomorphism that maps into some subset , such that the flow on results in mappings on that are equivalent to on for some , that is, . While a solution to the unrestricted embedding problem always exists, it involves a smooth, non-Euclidean manifold . For a homeomorphism , the manifold , variously referred to as the twisted cylinder [Utz81], or a suspension under a ceiling function [BS02], or a mapping torus [Bro66], is a quotient space defined through the equivalence relation . The flow that maps at to at and at involves trajectories in in the following way: for going from 0 to 1, the trajectory tracks in a straight line from to , which in the quotient space is equivalent to . Then, for going from 1 to 2, the trajectory proceeds from to .

The fact that the solution to the unrestricted embedding problem involves a flow on a non-Euclidean manifold makes applying it in the context of gradient-trained ODE-Nets difficult.

4 Approximation of Homeomorphisms by Neural ODEs

In exploring the approximation capabilities of Neural ODEs for -homeomorphisms, we will assume that the neural network on the right hand side of the ODE is a universal approximator and thus can be made large enough to approximate arbitrary function arbitrarily well. Thus, our concern is with what flows can be modeled assuming ODE-Net can have arbitrary internal dimensionality, depth, and architecture. We only care about the input-output dimensionality of the -ODE-Net. We consider two scenarios, , and .

4.1 Restricting the Dimensionality Limits Capabilities of Neural ODEs

We show a class of functions that a Neural ODE cannot model, a class that generalizes the one-dimensional example.

Theorem 2.

Let , and let be a set that partitions into two or more disjoint, connected subsets , for . Consider a mapping that

• is an identity transformation on , that is, ,

• maps some into , for .

Then, no -ODE-Net can model .

Proof.

A -ODE-Net can model if a restricted flow embedding of exists. Suppose that it does, a continuous flow can be found for such that the trajectory of is continuous on with and for some , for all .

If maps some into , for , the trajectory from to crosses – there is such that for some . From uniqueness and reversibility of ODE trajectories, we then have . From additive property of flows, we have .

Since is identity over and , thus . That is, the trajectory over time is a closed curve starting and ending at , and for any . Specifically, . Thus, . We arrive at a contradiction with the assumption that and are in two disjoint subsets of separated by . Thus, no -ODE-Net can model .

The result above shows that Neural ODEs applied in the most natural way, with , are severely restricted in the way distinct regions of the input space can be rearranged in order to learn and generalize from the training set, and the restrictions go well beyond requiring invertibility and continuity.

4.2 Neural ODEs with Extra Dimensions are Universal Approximators for Homeomorphisms

If we allow the Neural ODE to operate on an Euclidean space of dimensionality , we can approximate arbitrary -homeomorphism , as long as is high enough. Here, we show that is suffices to take . We construct a mapping from the original problem space, into that

• preserves as a -dimensional linear subspace consisting of vectors ,

• leads to an ODE on that maps .

Thus, we provide a solution with a structure that is convenient for out-of-the-box training and inference using Neural ODEs – it is sufficient to add dimensions, all zeros, to the input vectors. Our main result is the following.

Theorem 3.

For any homeomorphism , , there exists a -ODE-Net for such that for any .

Proof.

We prove the existence in a constructive way, by showing a vector field in , and thus an ODE, with the desired properties. Let be defined as

 δx =h(x)−x, zx =r(x),

where is bounded away from zero, and is a smooth, strictly monotonic function. It is applied to a vector entry-wise; in Fig. 1 we used .

We start with the extended space with a variable corresponding to time added as the last dimension, as in the construction of an autonomous ODE from time-dependent ODE. We then define a mapping . For , the mapping (see Fig. 1) is defined trough

 y(x,τ) =[x+1−cosπτ2δx,zx(1−cos2πτ),sin2πτ]. (4)

The mapping indeed just adds dimensions of 0 to at time , and at time it gives the result of the homeomorphism applied to , again with dimensions of 0

 y(x,0)) =[x,0(p),0], y(x,1) =[x+δx,0(p),0]=[h(x),0(p),0]=y(h(x),0).

We can use these properties to define the mapping for , by setting ; for example, . Intuitively, the mapping will provide the position in of the time evolution for duration of an ODE on starting from a position corresponding to .

For , for any given , we have , since is a one-to-one mapping – it was defined by a strictly monotonic function . Thus, in , paths starting from two distinct points do not intensest at the same point in time. Intuitively, we have added enough dimensions to the original space so that we can reroute all trajectories without intersections.

We have correspond directly to time, that is, and for . The mapping has continuous derivative with respect to , defining a vector field over the image of , a subset of

 dydt =[πδx2sinπt,2πzxsin2πt,2πcos2πt].

We can verify that the vector field defined through derivatives of with respect to time has the same values for and for any

 dydt(x,0) =[0(p),0(p+1),2π], dydt(x,1) =[0(p),0(p+1),2π],

Thus,

 dydt(x,1) =dydt(h(x),0),

the vector field is well-behaved at – it is continuous over the whole image of . The vector field above is defined over a closed subset of , and can be (see [Lee01], Lemma 8.6) extended to the whole . A -ODE-Net with a universal approximator network on the right hand side can be designed to approximate the vector field arbitrarily well. The resulting ODE-Net approximates to . ∎

Based on the above result, we now have a simple method for training a Neural ODE to approximate a given continuous, invertible mapping and, for free, obtain also its continuous inverse . On input, each sample is augmented with zeros. For a given , the output of the ODE-Net is split into two parts. The first

dimensions are connected to a loss function that penalizes deviation from

. The remaining dimensions are connected to a loss function that penalizes for any deviation from 0. Once the network is trained, we can get by using an ODE-Net with instead of used in the trained ODE-Net.

Acknowledgments

T.A. is supported by NSF grant IIS-1453658.

Appendix A

We briefly note that the quotient space from Section 3.3, the twisted cylinder, can be smoothly embedded in an as its submanifold, and the flow on then extended to a flow on that Euclidean space. The twisted cylinder is a smooth manifold. By virtue of the strong Whitney embedding theorem [Whi44], it can be embedded in -dimensional Euclidean space. To obtain a smooth embedding that additionally preserves as a linear subspace involving the first dimensions, , we can reuse the construction from Theorem 3, with one change. We need to be one-to-one, that is, , instead of a weaker condition . This can be achieved by re-defining to be a mapping, such that and are not only different for , but also not co-linear. It can be easily achieved by keeping the mapping as before for the first dimensions, and adding some nonlinear, smooth, positive-valued function of the first dimensions of as the dimension, for example the squared norm. If and are co-linear in the first dimensions, they will not be co-linear in the last dimension. Since now and are not co-linear, multiplying them by a trigonometric function as is done in Eq. 4 does not make them equal anywhere except for . But at , the first dimensions of are just , and are different for . Hence in one-to-one smooth mapping, as required by the conditions for a smooth embedding. The rest of the proof proceeds as in Theorem 3.