A neural network block is a function
that maps an input vectorto output vector , and is parameterized by a vector . We require that is almost everywhere differentiable with respect to both of its arguments, allowing the use of gradient-based methods for tweaking the weights based on training data and some optimization criterion, and for passing the gradient to preceding network layers.
One type of neural building blocks that has received attention in recent years is a residual block [HZRS16], where , with being some differentiable, nonlinear, possibly multi-layer transformation. Input and output dimensionality of a residual block are the same, , and such blocks are usually stacked in a sequence, . Often, the functional form of is the same for all blocks in the sequence, where . Then, we can represent the sequence through , where consists of trainable parameters for all blocks in the sequence; the second argument, , allows us to pick the proper subset of parameters, . If we allow arbitrary , a sequence of residual blocks can, in principle, model arbitrary mappings , where we define to be the result of applying the sequence of residual blocks to the initial input . A recent result [LJ18]
shows that a linear layer preceeded by a deep sequence of residual blocks with only one neuron in the hidden layer is a universal approximator for Lebesque-integrable functions.
1.1 Neural Ordinary Differential Equations
Neural ODEs (ODE-Nets) [CRBD18] are a recently proposed class of differentiable neural network building blocks. ODE-Nets were formulated by observing that processing an initial input vector through a sequence of residual blocks can be seen as evolution of in time . Then, a residual block (eq. 1) is a discretization of a continuous-time system of ordinary differential equations (eq. 2)
The transformation taking into realized by an ODE-Net for some chosen, fixed time is not specified directly through a functional relationship for some neural network , but indirectly, through the solutions to the initial value problem (IVP) of the ODE
involving some underlying neural network with trainable parameters . By a -ODE-Net we denote an ODE-Net that takes a -dimensional sample vector on input, and produces a -dimensional vector on output. The underlying network must match those dimensions on its input and output, but in principle can have arbitrary internal architecture.
The adjoint sensitivity method [PMBG62] based on reverse-time integration of an expanded ODE allows for finding gradients of the IVP solutions with respect to parameters and the initial values . This allows training ODE-Nets using gradient descent, as well as combining them with other neural network blocks.
Benefits of ODE-Nets compared to residual blocks include improved memory and parameter efficiency, ease of modeling phenomena with continuous time dynamics, out-of-the-box invertibility (), and simplified computations of normalizing flows [CRBD18]. Since their introduction ODE-Nets have seen improved implementations [RIM19] and enhancements in training and stability [GKB19, ZYG19]. The question of their approximation capabilities remains, however, unresolved.
1.2 Limitations of Neural ODEs
Unlike a residual block, a Neural ODE on its own does not have universal approximation capability. Consider a continuous, differentiable, invertible function on . There is no ODE defined on that would result in . Informally, in ODEs, paths between the initial value and final value have to be continuous and cannot intersect in for two different initial values, and paths corresponding to and would need to intersect. By contrast, in a residual block sequence, a discrete dynamical system on , we do not have continuous paths, only points at unit-time intervals, with an arbitrary transformation between points; finding a ResNet for is easy.
1.3 Our Contribution
We analyze the approximation capabilities of ODE-Nets. The results most closely related to ours have been recently provided by the authors of ANODE [DDT19], who focus on a -ODE-Net followed by a linear layer. They provide counterexamples showing that such an architecture is not a universal approximator of functions. However, they show empirical evidence indicating that expanding the dimensionality and using -ODE-Net for instead of a -ODE-Net has positive impact on training.
Here, we prove that setting is enough to turn Neural ODE followed by a linear layer into a universal approximator. Next, we focus our attention to invertible functions – homeomorphisms – by exploring pure -ODE-Nets, not capped by a linear layer. We go beyond the example, and show a class of invertible mappings that cannot be expressed by Neural ODEs defined on . Our main result is a proof that any homeomorphism , for , can be modeled by a Neural ODE operating on an Euclidean space of dimensionality that embeds as a linear subspace.
2 Neural ODEs are Universal Approximators
We show, through a simple construction, that a Neural ODE followed by a linear layer can approximate functions equally well as any traditional feed-forward neural network. Since networks with shallow-but-wide fully-connected architecture[Cyb89, Hor91], or narrow-but-deep ResNet-based architecture [LJ18] are universal approximators, so are ODE-Nets.
Consider a neural network that approximates a Lebesque-integrable function , with being a compact subset. For any , , there exists a linear layer-capped -ODE-Net that can perform the mapping .
Set . Let be a neural network that takes input vectors111We use upper subscript to denote dimensionality of vectors; that is, . and produces -dimensional output vectors , where is the desired transformation. is constructed as follows: use to produce , ignore , and always output . Consider a -ODE-Net defined through . Let the initial value be . The ODE will not alter the first dimensions throughout time, hence for any , . After time , we will have
Thus, for any , the output can be recovered from the output of the ODE-Net by a simple, sparse linear layer that ignores all dimensions except the last one, which it returns. ∎
ODE-Nets have two main advantages compared to traditional architectures: improved computational and space efficiency, and out-of-the-box invertibility. The construction above nullifies both, and thus is of theoretical interest only. This introduces two new open problems: can Neural ODEs be universal approximators while showing improved efficiency compared to traditional architectures, and can Neural ODEs model any invertible function , assuming and are continuous. The main focus of this work is to address the second problem.
3 Background on ODEs, Flows, and Embeddings
A mapping is a homeomorphism if is a one-to-one mapping of onto itself, and both and its inverse are continuous. Here, we will assume that for some , and we will use the term -homeomorphism where dimensionality matters.
A topological transformation group or a flow [Utz81] is an ordered triple involving an additive group with neutral element 0, and a mapping such that and for all , all . Further, mapping is assumed to be continuous with respect to the first argument. The mapping gives rise to a parameteric family of homeomorphisms defined as , with the inverse being .
Given a flow, an orbit or a trajectory associated with is a subspace . Given , either or ; two orbits are either identical or disjoint, they never intersect. A point is a fixed point if .
A discrete flow is defined by setting . For arbitrary homeomorphism of onto itself, we easily get a corresponding discrete flow, an iterated discrete dynamical system, , , .
A type of flow relevant to Neural ODEs is a continuous flow, defined by setting , and adding an assumption that the family of homeomorphisms, the function , is differentiable with respect to its second argument, , with continuous . The key difference compared to a discrete flow is that the flow at time , , is now defined for arbitrary , not just for integers. We will use the term -flow to indicate that .
Informally, in a continuous flow the orbits are continuous, and the property that orbits never intersect has consequences for what homeomorphisms can result from a flow. Unlike in the discrete case, for a given homeomorphism there may not be a continuous flow such that for some . We cannot just set , what is required is a continuous family of homeomorphisms such that and is identity – such family may not exist for some . In such case, a Neural ODE would not be able to model the mapping .
3.2 Correspondence between Flows and ODEs
Given a continuous flow one can define a corresponding ODE operating on by defining a vector for every such that . Then, the ODE
corresponds to continuous flow . Indeed, is identity, and . Thus, for any homeomorphism family defining a continuous flow, there is a corresponding ODE that, integrated for time , models the flow at time , .
The vectors of derivatives for all are continuous over and are constant in time, and define a continuous vector field over . The ODEs evolving according to such a time-invariant vector field, where the right-hand side of eq. 2 depends on but not directly on time , are called autonomous ODEs, and take the form of .
Any time-dependent ODE (eq. 2) can be transformed into an autonomous ODE by removing time from being a separate argument of , and adding it as part of the vector . Specifically, we add an additional dimension222We use to denote -th component of vector . to vector , with . We equate it with time, , by including in the definition of how acts on , and including in the initial value . In defining , explicit use of as a variable is being replaced by using the component of vector . The result is an autonomous ODE.
Given time and an ODE defined by , , the flow at time , may not be well defined, for example if diverges to infinity along the way. However, if is well-behaved, the flow will exist at least locally around the initial value. Specifically, Picard–Lindelöf theorem states that if an ODE is defined by a Lipschitz-continuous function , then there exists such that the flow at time , , is well-defined and unique for . If exists, is a homeomorphism, since the inverse exists and is continuous; simply, is the inverse of .
3.3 Flow Embedding Problem for Homeomorphisms
Given a -flow, we can always find a corresponding ODE. Given an ODE, under mild conditions, we can find a corresponding flow at time , , and it necessarily is a homeomorphism. Is the class of -flows equivalent to the class of -homeomorphisms, or only to its subset? That is, given a homeomorphism , does a -flow such that exist? This question is referred to as the problem of embedding the homeomorphism into a flow.
For a homeomorphism , its restricted embedding into a flow is a flow such that for some ; the flow is restricted to be on the same domain as the homeomorphism. Studies of homeomorphisms on simple domains such as a 1D segment [For55] or a 2D plane [And65] already showed that a restricted embedding not always exists.
An unrestricted embedding into a flow [Utz81] is a flow on some space of dimensionality higher than . It involves a homeomorphism that maps into some subset , such that the flow on results in mappings on that are equivalent to on for some , that is, . While a solution to the unrestricted embedding problem always exists, it involves a smooth, non-Euclidean manifold . For a homeomorphism , the manifold , variously referred to as the twisted cylinder [Utz81], or a suspension under a ceiling function [BS02], or a mapping torus [Bro66], is a quotient space defined through the equivalence relation . The flow that maps at to at and at involves trajectories in in the following way: for going from 0 to 1, the trajectory tracks in a straight line from to , which in the quotient space is equivalent to . Then, for going from 1 to 2, the trajectory proceeds from to .
The fact that the solution to the unrestricted embedding problem involves a flow on a non-Euclidean manifold makes applying it in the context of gradient-trained ODE-Nets difficult.
4 Approximation of Homeomorphisms by Neural ODEs
In exploring the approximation capabilities of Neural ODEs for -homeomorphisms, we will assume that the neural network on the right hand side of the ODE is a universal approximator and thus can be made large enough to approximate arbitrary function arbitrarily well. Thus, our concern is with what flows can be modeled assuming ODE-Net can have arbitrary internal dimensionality, depth, and architecture. We only care about the input-output dimensionality of the -ODE-Net. We consider two scenarios, , and .
4.1 Restricting the Dimensionality Limits Capabilities of Neural ODEs
We show a class of functions that a Neural ODE cannot model, a class that generalizes the one-dimensional example.
Let , and let be a set that partitions into two or more disjoint, connected subsets , for . Consider a mapping that
is an identity transformation on , that is, ,
maps some into , for .
Then, no -ODE-Net can model .
A -ODE-Net can model if a restricted flow embedding of exists. Suppose that it does, a continuous flow can be found for such that the trajectory of is continuous on with and for some , for all .
If maps some into , for , the trajectory from to crosses – there is such that for some . From uniqueness and reversibility of ODE trajectories, we then have . From additive property of flows, we have .
Since is identity over and , thus . That is, the trajectory over time is a closed curve starting and ending at , and for any . Specifically, . Thus, . We arrive at a contradiction with the assumption that and are in two disjoint subsets of separated by . Thus, no -ODE-Net can model .
The result above shows that Neural ODEs applied in the most natural way, with , are severely restricted in the way distinct regions of the input space can be rearranged in order to learn and generalize from the training set, and the restrictions go well beyond requiring invertibility and continuity.
4.2 Neural ODEs with Extra Dimensions are Universal Approximators for Homeomorphisms
If we allow the Neural ODE to operate on an Euclidean space of dimensionality , we can approximate arbitrary -homeomorphism , as long as is high enough. Here, we show that is suffices to take . We construct a mapping from the original problem space, into that
preserves as a -dimensional linear subspace consisting of vectors ,
leads to an ODE on that maps .
Thus, we provide a solution with a structure that is convenient for out-of-the-box training and inference using Neural ODEs – it is sufficient to add dimensions, all zeros, to the input vectors. Our main result is the following.
For any homeomorphism , , there exists a -ODE-Net for such that for any .
We prove the existence in a constructive way, by showing a vector field in , and thus an ODE, with the desired properties. Let be defined as
where is bounded away from zero, and is a smooth, strictly monotonic function. It is applied to a vector entry-wise; in Fig. 1 we used .
We start with the extended space with a variable corresponding to time added as the last dimension, as in the construction of an autonomous ODE from time-dependent ODE. We then define a mapping . For , the mapping (see Fig. 1) is defined trough
The mapping indeed just adds dimensions of 0 to at time , and at time it gives the result of the homeomorphism applied to , again with dimensions of 0
We can use these properties to define the mapping for , by setting ; for example, . Intuitively, the mapping will provide the position in of the time evolution for duration of an ODE on starting from a position corresponding to .
For , for any given , we have , since is a one-to-one mapping – it was defined by a strictly monotonic function . Thus, in , paths starting from two distinct points do not intensest at the same point in time. Intuitively, we have added enough dimensions to the original space so that we can reroute all trajectories without intersections.
We have correspond directly to time, that is, and for . The mapping has continuous derivative with respect to , defining a vector field over the image of , a subset of
We can verify that the vector field defined through derivatives of with respect to time has the same values for and for any
the vector field is well-behaved at – it is continuous over the whole image of . The vector field above is defined over a closed subset of , and can be (see [Lee01], Lemma 8.6) extended to the whole . A -ODE-Net with a universal approximator network on the right hand side can be designed to approximate the vector field arbitrarily well. The resulting ODE-Net approximates to . ∎
Based on the above result, we now have a simple method for training a Neural ODE to approximate a given continuous, invertible mapping and, for free, obtain also its continuous inverse . On input, each sample is augmented with zeros. For a given , the output of the ODE-Net is split into two parts. The first
dimensions are connected to a loss function that penalizes deviation from. The remaining dimensions are connected to a loss function that penalizes for any deviation from 0. Once the network is trained, we can get by using an ODE-Net with instead of used in the trained ODE-Net.
T.A. is supported by NSF grant IIS-1453658.
- [And65] Stephen A Andrea. On homeomorphisms of the plane, and their embedding in flows. Bulletin of the American Mathematical Society, 71(2):381–383, 1965.
- [Bro66] William Browder. Manifolds with . Bulletin of the American Mathematical Society, 72(2):238–244, 1966.
- [BS02] Michael Brin and Garrett Stuck. Introduction to dynamical systems. Cambridge university press, 2002.
- [CRBD18] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.
Approximation by superpositions of a sigmoidal function.Mathematics of control, signals and systems, 2(4):303–314, 1989.
- [DDT19] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural ODEs. arXiv preprint arXiv:1904.01681, 2019.
- [For55] Marion Kirkland Fort. The embedding of homeomorphisms in flows. Proceedings of the American Mathematical Society, 6(6):960–967, 1955.
- [GKB19] Amir Gholami, Kurt Keutzer, and George Biros. ANODE: unconditionally accurate memory-efficient gradients for neural ODEs. arXiv preprint arXiv:1902.10298, 2019.
- [Hor91] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
- [HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
- [Lee01] John M Lee. Introduction to smooth manifolds. Springer, 2001.
- [LJ18] Hongzhou Lin and Stefanie Jegelka. ResNet with one-neuron hidden layers is a universal approximator. In Advances in Neural Information Processing Systems, pages 6169–6178, 2018.
- [PMBG62] Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathematical theory of optimal processes. 1962.
- [RIM19] Chris Rackauckas, Mike Innes, Yingbo Ma, Jesse Bettencourt, Lyndon White, and Vaibhav Dixit. DiffEqFlux.jl-a Julia library for neural differential equations. arXiv preprint arXiv:1902.02376, 2019.
- [Utz81] WR Utz. The embedding of homeomorphisms in continuous flows. In Topology Proc, volume 6, pages 159–177, 1981.
- [Whi44] Hassler Whitney. The singularities of a smooth -manifold in (2- 1)-space. Ann. of Math, 45(2):247–293, 1944.
- [You10] Laurent Younes. Shapes and diffeomorphisms, volume 171. Springer, 2010.
- [ZYG19] Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph Gonzalez, George Biros, and Michael Mahoney. ANODEV2: A coupled neural ODE evolution framework. arXiv preprint arXiv:1906.04596, 2019.
We briefly note that the quotient space from Section 3.3, the twisted cylinder, can be smoothly embedded in an as its submanifold, and the flow on then extended to a flow on that Euclidean space. The twisted cylinder is a smooth manifold. By virtue of the strong Whitney embedding theorem [Whi44], it can be embedded in -dimensional Euclidean space. To obtain a smooth embedding that additionally preserves as a linear subspace involving the first dimensions, , we can reuse the construction from Theorem 3, with one change. We need to be one-to-one, that is, , instead of a weaker condition . This can be achieved by re-defining to be a mapping, such that and are not only different for , but also not co-linear. It can be easily achieved by keeping the mapping as before for the first dimensions, and adding some nonlinear, smooth, positive-valued function of the first dimensions of as the dimension, for example the squared norm. If and are co-linear in the first dimensions, they will not be co-linear in the last dimension. Since now and are not co-linear, multiplying them by a trigonometric function as is done in Eq. 4 does not make them equal anywhere except for . But at , the first dimensions of are just , and are different for . Hence in one-to-one smooth mapping, as required by the conditions for a smooth embedding. The rest of the proof proceeds as in Theorem 3.