1 Introduction and Problem Formulation
Despite the empirical success of deep learning, one outstanding challenge is to develop a useful theoretical framework to understand its effectiveness by capturing the effect of sequential function composition in deep neural networks. In some sense, this is a distinguishing feature of deep learning that separates it from traditional machine learning methodologies.
One candidate for such a framework is the dynamical systems approach [weinan2017proposal, Haber2017, li2017maximum]
, which regards deep neural networks as a discretization of an ordinary differential equation. Consequently, the latter can be regarded as the object of analysis in place of the former. An advantage of this idealization is that a host of mathematical tools from dynamical systems, optimal control and differential equations can then be brought to bear on various issues faced in deep learning, and more importantly, shed light on the role of composition on function approximation and learning.
Since its introduction, the dynamical systems approach led to much progress in terms of novel algorithms [Parpas2019, Zhang2019], architectures [Haber2017, Chang2018, Ruthotto2018, pmlr-v80-lu18d, Wang2018, Tao2018, Sun2018] and emerging applications [Zhang2018a, Zhang2018b, Chen2018, Lu2019]. On the contrary, the present work is focused on the theoretical underpinnings of this approach. From the optimization perspective, it has been established that learning in this framework can be recast as an mean-field optimal control problem [li2017maximum, weinan2019mean], and local and global characterizations can be derived based on generalizations of the classical Pontryagin’s maximum principle and the Hamilton-Jacobi-Bellman equation. Other theoretical developments include continuum limits and connections to optimal transport [Sonoda2017, Thorpe2018, Sonoda2019]. Nevertheless, the other fundamental questions in this approach remain largely unexplored, especially when it comes to the function approximation properties of these continuous-time idealizations of deep neural networks. In this paper, we establish some basic results in this direction.
The Dynamical Systems Viewpoint.
To set the context, we first introduce the dynamical systems viewpoint of deep learning [weinan2017proposal, li2017maximum, Ruthotto2018, weinan2019mean]. For simplicity, we will discuss fully-connected residual neural networks for this exposition, but our main results will largely not depend on such explicit architectures.
Essentially, supervised learning seeks to approximate some function, which we call the oracle, from samples of data from this oracle function. The set is the input space (e.g. images, time series) and is the space of outputs (e.g. labels of images, next value of the time series). As typical in machine learning, to do this we define a hypothesis space
and learning amounts to finding a particular which closely approximates in some sense. For example, can be found as a solution of the optimization problem: , with some appropriately chosen norm; Alternatively, we can solve the empirical risk minimization problem to obtain where are samples from the input space with corresponding labels
. Just like classical linear basis models, support vector machines and so on, deep learning is yet another choice of hypothesis space, which we now describe.
Let us take as the input space and be the space of outputs. A deep, residual, fully-connected neural network applies a sequence of transformations to the inputs via the recursion
Here, is the input, is the hidden state at the layer and
The variables and are the weights and biases of the transformation layer respectively, with the hidden node size (typically, ). Together, they constitute the trainable parameters in the layer. The activation function. Other choices include the sigmoid : , the tanh: and so on. The number is known as the depth of the neural network and it can be quite large for modern architectures, e.g. on the order of hundreds. The output of the entire network is , which can then be compared, after a further transformation , to the label corresponding to . The transformation function is typically kept simple, say an affine function.
We can write (2) compactly by defining and to get
which gives the deep residual network hypothesis space
In contrast with traditional approaches, the distinguishing feature of the deep residual network hypothesis space is the presence of iterated transformations governed by (4). This property makes direct analysis difficult due to the lack of mathematical tools to handle function compositions.
That is, we replace the discrete layer numbers by a continuous variable , which results in a new continuous-time dynamics described by an ordinary differential equation (ODE). Note that for this approximation to be precise, one would need a slight modification of the right hand side (4) into for some small . The limit with held constant gives (6) with the identification . Empirical work shows that this modification is justified since for trained deep residual networks, tends to be small [Veit2016, Jastrzebski2017]. Consequently, the trainable variables is a now a indexed by a continuous variable . We will assume that each is a Lipschitz continuous function on , so that (6) admits unique solutions (see Proposition 3.1). For a terminal time , can be seen as a function of its initial condition . Thus, we obtain a function , mapping each to the solution at time horizon . is known as Poincaré map or the flow map of the dynamical system (6). As a result, we can replace the hypothesis space (5) by
with the terminal time playing the role of depth. In words, this hypothesis space contains functions which are composed with flow maps of a dynamical system in the form of an ODE. It is also convenient to consider the hypothesis space of arbitrarily deep continuous-time networks as the union
The key advantage of this viewpoint is that a variety of tools from continuous time analysis can be used to analyze various issues in deep learning. This was pursued for example, in [li2017maximum, pmlr-v80-li18b] for learning algorithms and [Ruthotto2018, Chang2017, Chang2018] on network stability. In this paper, we are concerned with the problem of approximation, which is one of the most basic mathematical questions we can ask given a hypothesis space. Let us outline the problem below.
The Problem of Approximation.
The problem of approximation essentially asks how big is. In other words, what kind of functions can we approximate using functions in ? Before we present our results, let us first distinguish the concept of approximation and that of representation.
We say that a function can be represented by if .
In contrast, we say that can be approximated by if for any , there exists a such that . Here, is some appropriately chosen norm.
Therefore, representation and approximation are mathematically distinct notions. The fact that some class of mappings cannot be represented by does not prevent it from being approximated by to arbitrary accuracy. For example, it is well-known that flow maps must be orientation-preserving (OP) homeomorphisms, which are a very small set of functions in the Baire Category sense [palis1974vector], but it is also known that OP homeomorphisms are dense in in dimensions larger than one [brenier2003p].
In this paper we will work mostly in continuous time. Nevertheless, it makes sense to ask what the results in continuous time imply for discrete dynamics. After all, the latter is what we can actually implement in practice as machine learning models. Observe that in the reverse direction, can be seen as a forward Euler discretization of (6). It is well-known that for finite time horizon and fixed compact domain, Euler discretization has global error in supremum norm of [Leveque2007Finite, Atkinson2013Theoretical]. In other words, any function in can be uniformly approximated by a discrete residual network provided the number of layers is large enough. Consequently, if a function can be approximated by , then it can be approximated by a sufficiently deep residual neural network corresponding to an Euler discretization. In this sense, we can see that approximation results in continuous time have immediate consequences for its discrete counterpart.
2 Main Results
In this section, we summarize our main results on the approximation properties of and discuss their significance with respect to related results in the literature. Throughout this paper, we will adopt the following notation:
Let be a measurable subset of . We denote by the space of real-valued continuous functions on , with norm . Similarly, for , denotes the space of -integrable measurable functions on , with norm . Vector-valued functions are denoted similarly.
A function on is called Lipschitz if holds for all . The smallest constant for which this is true is denoted as .
We denote by the closed ball of radius centered at . If is a point set, then we define .
Given a uniformly continuous function , We denote by its modulus of continuity, i.e. .
Let us begin with some definitions. Denote by the set of functions that constitute the right-hand-side of Equation (6):
This allows us to write compactly without explicit reference to the parameterization as
We will hereafter call a control family, since they control the dynamics induced by the differential equation (6). Unless specified otherwise, we assume contains only Lipschitz functions, which ensures existence and uniqueness of solutions to the corresponding ODEs (See. Proposition 3.1).
Next, we introduce the concept of approximation closure which is used throughout this paper.
Definition 2.1 (Approximation Closure).
Let be a collection of continuous functions from to . We denote by the approximation closure of , meaning that if for any compact set and , there exist depending on and , such that .
We also define the following shorthand for the approximation closure of the convex hull
where denotes the usual convex hull.
In constructing approximation dynamics, a fundamental role is played by a type of functions called well functions, which we now define.
Definition 2.2 (Well Function).
We say a Lipschitz function is a well function if there exists a bounded open convex set such that
Here the is the closure of in the usual topology on .
Moreover, we say that a vector valued function is a well function if each of its component is a well function in the sense above.
The name “well function” highlights the rough shape of this type of functions: the zero set of a well function is like the bottom of a well. Of course, the “walls” of this well need not always point upwards and we only require that they are never zero outside of .
We also define the notion of restricted affine invariance, which is weaker than the usual form of affine invariance.
Definition 2.3 (Restricted Affine Invariance).
Let be a set of functions from to . We say that is restricted affine invariant if implies , where is any vector, and , are any diagonal matrices, such that the entries of are or 0, and entries of are smaller than or equal to 1.
Now, let us state our main result on universal approximation of functions by flow maps of dynamical systems in dimension .
Theorem 2.4 (Sufficient Condition for Universal Approximation).
Let , and be some control family. Let the target function be continuous and be any compact set. Suppose that is Lipschitz and . Consider the hypothesis space
Assume satisfies the following conditions:
Then, for any there exists a such that
Theorem 2.4 establishes a sufficient condition for for which the induced flow maps form a universal approximating class. The covering assumption is in some sense necessary, for if the range of does not cover , say it misses a open subset , then no flow maps composed with it can approximate . Fortunately, this condition is very easy to satisfy. For example, any non-degenerate linear function is Lipschitz and onto.
The requirement is also necessary. In one dimension, the result is actually false, due to the topological constraint induced by flow maps of dynamical systems. More precisely, for one can show that each must be continuous and increasing, and furthermore that its approximation closure also contains only increasing functions. Hence, there is no hope in approximating any function that is strictly decreasing on an open interval. However, we can prove the next best thing in one dimension: any continuous and increasing function can be approximated by a dynamical system driven by the control family .
Theorem 2.5 (Sufficient Condition for Universal Approximation in 1D).
Let . Then, Theorem 2.4 holds under the additional assumption that is increasing.
In the proofs of these results, we rely on using the flow of the dynamical system (6) to “rearrange” the domain of the function so that it resembles . Here, the concept of well function plays a central role. It serves to induce some universally controllable dynamics: the portion for which the well function equals 0 leaves points invariant, whereas the portion for which it is non zero can drive, via the restricted affine invariance assumption, points to the desired locations. The combination of these effects is enough to rearrange the domain in an essentially arbitrary manner to achieve universal approximation. This gives a sketch of the proof of the main results in this paper.
Most existing theoretical work on the continuous-time dynamical systems approach to deep learning focus on optimization aspects in the form of mean-field optimal control [weinan2019mean, liu2019selection], or the connections between the continuous-time idealization to discrete time [thorpe2018deep, Sonoda2017, Sonoda2019]. The present paper focuses on the approximation aspects of continuous-time deep learning, which is less studied. One exception is the recent work of [zhang2019approximation], who derived some results in the direction of approximation. However, an important assumption there was that the driving force on the right hand side of ODEs (i.e. the control family ) are themselves universal approximators. Consequently, the results do not elucidate the power of composition and flows, since each “layer” is already so complex to approximate any arbitrary function, and there is no need for the flow to perform any additional approximation.
In contrast, the approximation results here do not require , or even , to be universal approximators. In fact, can be a very small set of functions, and the approximation power of these dynamical systems are by construction attributed to the dynamics of the flow. For example, assumption that contains a well function does not imply that drives the dynamical system is complex, since the former can be much larger than the latter. In the 1D ReLU control family, one can easily construct a well function with respect to the interval by averaging two ReLU functions: , but the control family is not complex enough to approximate arbitrary functions without further linear combinations. We will demonstrate in Section 4.1 that many other architectures induce control families that satisfy the conditions in Theorem 2.4 and Theorem 2.5, but the general statements derived above reveal some fundamental mechanics that may be at work in such deep models.
We also note that unlike results in [zhang2019approximation], the results here for do not require embedding the dynamical system in higher dimensions to achieve universal approximation. The negative results given in [zhang2019approximation] (and also [dupont2019augmented]), which motivated embedding in higher dimensions, are basically on limitation of representation: flow maps of ODEs are OP homeomorphisms and thus can only represent such mappings. However, these are not counter-examples for approximation, since an OP homeomorphism can approximate a mapping that is not OP to arbitrary accuracy in dimensions greater than or equal to two [brenier2003p].
In relation to classical approximation theory, one can observe from subsequent proofs and constructions that the function approximation process here is dynamical in nature, in that it relies on a sequence of transformations of the domain of the function. This makes it very different from truncations of a basis expansion that is typically encountered in traditional approximation theory [devore1998nonlinear]. For instance, suppose we take and to be a linear function in Theorem 2.4. Then, we may interpret as a linear combination of dictionary functions selected from the dictionary built from flow maps
In this sense, Theorem 2.4 is a statement about a type of nonlinear -term approximation. In classical nonlinear approximation [devore1998nonlinear], one usually have decaying to 0 as increases, but is non-zero for any finite . However, in the case of the flow map dictionary, Theorem 2.4 shows that as long as , the infimum is actually 0. Of course, this relies on the fact that we are considering arbitrarily large times in the evolution, so a natural question is how the approximation rate depends on . In section 4.1, we derive some results in this direction in the 1D case, which further highlight the distinguishing mechanics of this approximation process.
Although the present paper focuses on the continuous-time idealization, we should also discuss the results here in relation to the relevant work on the approximation theory of discrete deep neural networks. In this case, one line of work to establish universal approximation is to show that deep networks can approximate some other family of functions known to be universal approximators themselves, such as wavelets [mallat2016understanding] and shearlets [guhring2019error]. Another approach is to focus on certain specific architectures, such as in [lu2017expressive, lin2018resnet, zhou2018deep, bao2019approx, daubechies2019nonlinear, e_priori_2019], which sometimes allows for explicit asymptotic approximation rates to be derived for appropriate target function classes. Furthermore, non-asymptotic approximation rates for deep ReLU networks are obtained in [shen2019nonlinear, Shen2019Deep]. They are based on explicit constructions using composition, and hence is similar in flavor to the results here if we take an explicit control family and discretize in time.
With respect to these works, the main difference of the results presented here is that we study general properties of function composition via explicit constructions of approximating flows and formulate sufficient conditions for approximation. In particular, none of the approximation results we present here depend on reproducing some other function (polynomials, wavelets, etc) that is known to have universal approximation. Instead, we construct explicitly flow maps of dynamical systems to verify the approximation property. We also provide preliminary investigations into what kind of functions can be efficiently learned by a narrow and deep neural network in continuous time. In this sense, this approach is similar in flavor to the recently proposed Barron function framework for wide and deep networks [ma2019barron], inspired by the original approximation results of Barron [barron1994approximation] for shallow networks.
Lastly, the results here are also of relevance to mathematical control theory and the theory of dynamical systems. In fact, the problem of approximating functions by flow maps is closely related to the problem of controllability in the control theory [sussmann2017nonlinear]. However, there is one key difference: in the usual controllability problem on Euclidean spaces, our task is to steer one particular input to a desired output value . However, here we want to steer the entire set of input values in to by the same control
. This can be thought of as an infinite-dimensional function space version of controllability, which is a much less explored area and present controllability results in infinite dimensions mostly focus on the control of partial differential equations[Chukwu1991, Balachandran2002].
In the theory of dynamical systems, it is well known that functions represented by flow maps possess restrictions. For example, [palis1974vector] gives a negative result that the diffeomorphisms generated by vector fields are few in the Baire category sense. Some works also give explicit criteria for mappings that can be represented by flows, such as [fort1955embedding] in , [utz1981embedding] in , and more recently, [zhang2009embedding] generalizes some results to the Banach space setting. However, these are results are on exact representation, not approximation, and hence do not contradict the positive results presented in this paper. The results on approximation properties are fewer. A relevant one is [brenier2003p], who showed that every mapping can be approximated by orientation-preserving diffeomorphisms constructed using polar factorization and measure-preserving flows. The results of the current paper gives an alternative construction of a dynamical system whose flow also have such an approximation property. Moreover, Theorem 2.4 gives some weak sufficient conditions for any controlled dynamical system to have this property. In this sense, the results here further contribute to the understanding of the density of flow maps in .
The rest of the paper is organized as follows. In Section 3.1 we introduce some basic results in the theory of ordinary differential equations that we use throughout this paper. Section 3.2 introduces and establishes some preliminary results which leads to the proof of Theorem 2.5 in Section 4.1 first in 1D. This generally motivates the concept of well functions and their role in constructing rearrangement dynamics. Furthermore, we establish some simple results on the rates of approximation in specific cases. In Section 4.3, we prove Theorem 2.4 which generalizes the approximation result to higher dimensions.
3 Preliminary Results
In this section, we state and prove some preliminary results that are used to prove our main results in the next section.
3.1 Results on Ordinary Differential Equations
Throughout this paper, we use some elementary properties and techniques in classical analysis of ODEs. For completeness, we compile these results in this section. The proofs of well-known results are omitted and unfamiliar readers are referred to [Arnold1973Ordinary] for a comprehensive introduction.
Consider an ODE of the following form
where and is a Lipschitz function. An equivalent form of the ODE is the following integral form
Proposition 3.1 (Existence and Uniqueness).
The solution to (16) exists and is unique. Moreover, for each , is a continuous function of .
In the rest of this subsection, we only state and prove results in the one dimensional case, which is what we need in this paper. Some of these results can be generalized to higher dimensions, and the readers can refer to [Arnold1973Ordinary].
First, the following result demonstrate an important limitations of flow maps when it comes to representation: in one dimension, an ODE flow map must preserves order.
If and satisfy same equation, but with different initial value . Then for all .
Suppose not, we assume for some . Consider the following ODE:
Then both and are solutions to the above. By uniqueness we have , a contradiction. Since both and are continuous in , we have for all . ∎
More generally, in higher dimensions, any flow map must be an orientation preserving (OP) homeomorphism. The general definition of OP and the proof the previous statement is in [Arnold1973Ordinary]. For the results in this paper, we only need the one dimensional case proved in Proposition 3.2, where OP means continuous and increasing111Throughout this paper, increasing function means non-decreasing function, unless "strictly" is emphasized., with a continuous inverse. In higher dimensions, the OP property means that if you put a local coordinate chart onto some point, then under actions an OP mapping the coordinate chart will not change its local orientation. In particular, if is continuously differentiable, then is OP is roughly equivalent to at all points, where is the Jacobian of .
Next we introduce the well-known Grönwall’s Inequality.
Proposition 3.3 (Grönwall’s Inequality).
If satisfies , then .
Finally, we prove some practical results, which follow easily from classical results but are used in some proofs of the main body.
Let be the ODE of the type (16), with initial value . When is in some compact set , then the continuous modulus of finite time
converges to 0 as uniformly on .
We denote , By Proposition 3.2, we know that , thus is compact, so is . Suppose , we have
implying the result.
The following proposition shows that in one dimension, if we have a well function, we can transport one point into another if they are located in the same side of well function’s zero interval.
Suppose in . Then for . Consider the ODE:
Then ultimately the ODE system will reach , i.e., for some , .
Choose and define . We have
Set . If then we are done by continuity. Otherwise, we have
which by again implies our result by continuity. ∎
With these results on ODEs in mind, we now present the proofs of our main results.
3.2 From Approximation of Functions to Approximations of Domain Transformations
Now, we show that under mild conditions, as long as we can approximate any continuous domain transformation using flow maps, we can show that is an universal approximator. Consequently, we can pass to the problem of approximating an arbitrary by flow maps in establishing our main results.
Let be continuous and be Lipschitz. Let be compact and suppose . Then, for any and , there exists a continuous function such that
This follows from a general result on function composition proved in [li2019deepapprox]. We prove this in the special case here for completeness.
The set is compact, so for any we can form a partition with . By assumption, is non-empty for each , so let us pick . For each we define , which is bounded. By inner regularity of the Lebesgue measure, for any and for each we can find a compact with ( is the Lebesgue measure) and that ’s are disjoint. By Urysohn’s lemma, for each there exists a continuous function such that for all , on and on .
Now, we form the continuous function
We define the set , which is clearly compact and for all . Then, we have
We take small enough so that the last term is bounded by . Then, we have
Taking yields the result. ∎
We shall hereafter assume that , which as discussed earlier, is easily satisfied by taking to be any onto function. Hence we have the following immediate corollary.
Assume the conditions in Proposition 3.6. Let be some collection of continuous functions from to such that for any and any continuous function , there exists with . Then, there exists such that .
Using Proposition 3.6, there is a such that . Now take such that . Then,
3.3 Properties of Attainable Sets and Approximation Closures
Owing to Corollary 3.7, for the rest of the paper we will focus on proving universal approximation of continuous transformation functions from to by flow maps of the dynamical system
after which we can deduce universal approximation properties of via Corollary 3.7.
We now establish some basic properties of flow maps as well as approximation closures. In principle, in our hypothesis space (10) we allow to be any measurable mapping for any . However, it turns out that to establish approximation results, it is enough to consider the smaller family of piece-wise constant in time mappings, i.e. for . For a fixed , let denote the flow map of the following dynamics at time horizon :
The attainable set of a finite time horizon due to piece-wise constant in time controls, denoted as , is defined as
In other words, contains the flow map of an ODE, whose driving force is for , . It contains all the domain transformations that can be attained by an ODE by selecting a piece-wise constant in time driving force from up to a terminal time . The union of flow maps over all possible terminal times, , is the overall attainable set. In view of Corollary 3.7, to establish the approximation property of it is sufficient to prove that any continuous transformation can be approximated by mappings in .
Now, let us state some basic properties of the approximation closure defined in Definition 2.1, which are useful in the later sections. The proofs are immediate and hence omitted.
If is a family of continuous and increasing functions from to , then contains only increasing functions.
We have . Moreover, if , then .
Next, we state and prove an important property about approximation closures of control families: shares the same approximation ability as when used to drive dynamical systems. However, a convex hull of Lipschitz function family might not be a Lipschitz function family in general. Hence we adopt a slightly different description.
Let be a Lipschitz control family. Then, for any Lipschitz control family such that , we have
Proposition 3.10 is an important result concerning the effect of continuous evolution, which can be regarded as a continuous family of compositions: any function family driving a dynamical system is as good as its convex hull in driving the system, which can be an immensely larger family of functions. Similar properties of flows have been observed in the context of variational problems, see [Warga1962Relaxed]. This is a first hint at the power of composition on function approximation.
To prove Proposition 3.10 we need the following lemmas.
If and are attainable sets of and . Then we have
It suffices to show that , since , which implies the lemma.
where each is in . Fix a compact set and , we construct a function such that .
We prove by induction on . First, the case when is obvious since it is just the identity mapping. Suppose , where is composition of flow maps. For some (to be determined later) and we have some , such that . If there is a function such that . Consider two ODEs:
By subtracting, we have
By Grönwall’s inequality, we have
Then if we choose such that in , then (39) remains valid. Choosing appropriate and , we obtain , which concludes the proof. ∎
Suppose and , then we have .
We will show that can approximate arbitrarily well. The mapping is the solution of
Thus if satisfying:
Then we have
Recall that is the modulus of continuity defined in Proposition 3.4. Again, by Grönwall’s inequality we have
For any selected compact set , by Proposition 3.4, thus we obtain . ∎
, then .
Now, we are ready to prove Proposition 3.10.
4 Proof of Main Results
In this section, we prove the main results (Theorem 2.4 and 2.5). We start with the one dimensional case to gain some insights on how a result can be established in general, and in particular, elucidate the role of well functions (Definition 2.2) in constructing rearrangement dynamics. This serves to motivate the extension of the results in higher dimensions.
4.1 Approximation Results in One Dimension and the Proof of Theorem 2.5
Proposition 3.2, together with the fact that compositions of continuous and increasing functions are again continuous and increasing, implies that any function from must be continuous and increasing. We will adopt the short form “CI” for such functions. In 1D, this poses a restriction on the approximation power of as the following result shows:
Let and be a Lipschitz control family, whose attainable set is . Then