Universal Approximation Power of Deep Neural Networks via Nonlinear Control Theory

07/12/2020
by   Paulo Tabuada, et al.
Queen's University
0

In this paper, we explain the universal approximation capabilities of deep neural networks through geometric nonlinear control. Inspired by recent work establishing links between residual networks and control systems, we provide a general sufficient condition for a residual network to have the power of universal approximation by asking the activation function, or one of its derivatives, to satisfy a quadratic differential equation. Many activation functions used in practice satisfy this assumption, exactly or approximately, and we show this property to be sufficient for an adequately deep neural network with n states to approximate arbitrarily well any continuous function defined on a compact subset of R^n. We further show this result to hold for very simple architectures, where the weights only need to assume two values. The key technical contribution consists of relating the universal approximation problem to controllability of an ensemble of control systems corresponding to a residual network, and to leverage classical Lie algebraic techniques to characterize controllability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/06/2020

The universal approximation theorem for complex-valued neural networks

We generalize the classical universal approximation theorem for neural n...
06/16/2020

Minimum Width for Universal Approximation

The universal approximation property of width-bounded networks has been ...
01/11/2022

Deep Neural Network Approximation For Hölder Functions

In this work, we explore the approximation capability of deep Rectified ...
01/27/2021

Kähler Geometry of Quiver Varieties and Machine Learning

We develop an algebro-geometric formulation for neural networks in machi...
12/21/2018

On the Relative Expressiveness of Bayesian and Neural Networks

A neural network computes a function. A central property of neural netwo...
11/22/2021

On the Existence of Universal Lottery Tickets

The lottery ticket hypothesis conjectures the existence of sparse subnet...
10/07/2021

Universal Approximation Under Constraints is Possible with Transformers

Many practical problems need the output of a machine learning model to s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In the past few years, we have witnessed a resurgence in the use of techniques from dynamical and control systems for the analysis of neural networks. This recent development was sparked by the papers [Weinan, 2017, Haber and Ruthotto, 2017, Lu et al., 2018] establishing a connection between certain classes of neural networks, such as residual networks [He et al., 2016], and control systems. However, the use of dynamical and control systems to describe and analyze neural networks goes back at least to the 70’s. For example, Wilson-Cowan’s equations [Wilson and Cowan, 1972] are differential equations and so is the model proposed by Hopfield in [Hopfield, 1984]. These techniques have been used to study several problems such as weight identifiability from data [Albertini and Sontag, 1993, Albertini et al., 1993], controllability [Sontag and Qiao, 1999, Sontag and Sussmann, 1997], and stability [Michel et al., 1989, Hirsch, 1989].

The objective of this paper is to shed new light into the approximation power of deep neural networks. It has been empirically observed that deep networks have better approximation capabilities than their shallow counterparts and are easier to train [Ba and Caruana, 2014, Urban et al., 2017]

. An intuitive explanation for this fact is based on the different ways in which these types of networks perform function approximation. While shallow networks prioritize parallel compositions of simple functions (the number of neurons per layer is a measure of parallelism), deep networks prioritize sequential compositions of simple functions (the number of layers is a measure sequentiality). It is therefore natural to seek insights using control theory where the problem of producing interesting behavior by manipulating a few inputs over time, i.e., by sequentially composing them, has been extensively studied. Even though control-theoretic techniques have been utilized in the literature to showcase the controllability properties of neural networks, to best of our knowledge, this paper is the first to use tools from geometric control theory to analyze the function approximation properties of deep neural networks given an

ensemble of data points. As we illustrate, the latter makes the problem challenging, as we are required to design a single input, for the control system modeling a neural network, driving an ensemble of initial data points to target data points dictated by the function to be approximated, while guaranteeing function approximation in an appropriate norm.

1.1. Contributions

In this paper we focus on residual networks. This being said, as explained in [Lu et al., 2018], similar techniques can be exploited to analyze other classes of networks. It is known that deep residual networks have the power of universal approximation. What is less understood is where this power comes from. We show in this paper that it stems from the activation functions in the sense that when using a sufficiently rich activation function, even networks with very simple architectures and weights taking only two values suffice for universal approximation. It is the power of sequential composition, analyzed in this paper via geometric control theory, that unpacks the richness of the activation function into universal approximability. Surprisingly, the level of richness required from an activation function also has a very simple characterization; it suffices for activation functions (or a suitable derivative) to satisfy a quadratic differential equation. Most activation functions in the literature either satisfy this condition or can be suitably approximated by functions satisfying it.

More specifically, given an ensemble of data points, we cast the problem of designing weights for training a deep residual network as the problem of driving the state of an ensemble of initial points with a single open-loop control input to the ensemble of target points produced by the function to be learned when evaluated at the initial points. In spite of the fact that we only have access to a single open-loop control input, we prove that the corresponding ensemble of control systems is generically controllable. We then utilize this property to obtain universal approximability results for continuous functions in an sense.

1.2. Related work

Several papers have studied and established that residual networks have the power of universal approximation. This was done in [Lin and Jegelka, 2018]

by focusing on the particular case of residual networks with the ReLU activation function. It was shown that any such network with

states and one neuron per layer can approximate an arbitrary Lebesgue integrable function . The paper [Zhang et al., 2019] shows that the functions described by deep networks with states per layer, when these networks are modeled as control systems, are restricted to be homeomorphisms. The authors then show that increasing the number of states per layer to suffices to approximate arbitrary homeomorphisms . Note that the results in [Lin and Jegelka, 2018] do not model deep networks as control systems and, for this reason, bypass the homeomorphism restriction. There is also an important distinction to be made between requiring a network to exactly implement a function and to approximate it. The homeomorphism restriction does not prevent a network from approximating arbitrary functions; it just restricts the functions that can be implemented as a network. Closer to this paper are the results in [Li et al., 2019] establishing universal approximation based on a general sufficient condition satisfied by several examples of activation functions. Deep networks are modeled as control systems and the sufficient conditions are placed on regarded as a family of functions parametrized by the control input . It is proved that if this family is sufficiently rich, e.g., by requiring a certain closure of its convex hull to contain a well function, then universal approximability follows. These results are a major step forward in identifying what is needed for universal approximability, as they are not tied to specific architectures or activation functions. In this paper, we go further by deriving sufficient conditions directly on the activation function. The key difference is that all the work required to unpack the richness of the activation function into universal approximability is done by suitably sequentially composing the simple functions , i.e., by training the network. In contrast, the sufficient conditions proposed in [Li et al., 2019] require some of this work to be done when determining if such conditions are satisfied. In other words, determining if a certain closure of the convex hull of the family contains a well function can be seen as testing if well functions can be written as a combination of the simpler functions .

Other papers, e.g., [Lu et al., 2017, Daubechies et al., 2019] have used different metrics to compare the approximation power of deep networks with shallow networks or other classes of universal approximation schemes and are less related to our results.

At the technical level, our results build upon the controllability properties of deep networks studied in this paper for the first time. Earlier work on controllability of differential equation models for neural networks, e.g., [Sontag and Qiao, 1999], assumed the weights to be constant and that an exogenous control signal was fed into the neurons. In contrast, we regard the weights as control inputs and that no additional control inputs are present. These two different interpretations of the model lead to two very different technical problems. More recent work in the control community includes [Agrachev and Caponigro, 2009], where it is shown that any orientation preserving diffeomorphism on a compact manifold, can be obtained as the flow of a control system when using a time-varying feedback controller. In the context of this paper those results can be understood as: residual networks can represent any orientation preserving diffeomorphism provided that we can make the weights depend on the state. Although quite insightful, such results are not applicable to the standard neural network models where the weights are not allowed to depend on the state. Another relevant topic is ensemble control. Most of the work on the control of ensembles, see for instance [Li and Khaneja, 2006, Helmke and Schönlein, 2014, Brockett, 2007]

, considers parametrized ensembles of vector fields. In other words, the individual systems that drive the state of the whole ensemble are different, whereas in our setting the ensemble consists of exact copies of the same system, albeit initialized differently. In this sense, our work is most closely related to the setting of 

[Agrachev and Sarychev, 2020] where controllability results for ensembles of infinitely many control systems are provided. In this paper, in contrast, we are concerned with only finitely many systems and the specific structure of the problem at hand allows us to provide sharp controllability conditions.

2. Control-theoretic view of residual networks

2.1. From residual networks to control systems and back

We start by providing a control system perspective on residual neural networks. We mostly follow the treatment proposed in [Weinan, 2017, Haber and Ruthotto, 2017, Lu et al., 2018], where it was suggested that residual neural networks with an update equation of the form:

(2.1)

where indexes each layer, , and , can be interpreted as a control system when is viewed as indexing time. In (2.1), , , and are the weights functions assigning weights to each time instant , and is of the form , where is an activation function. By drawing an analogy between (2.1) and Euler’s forward method to discretize differential equations, one can interpret (2.1) as the time discretization of the continuous-time control system:

(2.2)

where and ; in what follows, and in order to make the presentation simpler, we sometimes drop the dependency on time. To make the connection between the discretization and (2.2) precise, let be a solution of the control system (2.2) for the control input , where . Then, given any desired accuracy and any norm in , there exists a sufficiently small time step so that the function defined by:

approximates the sequence with error , i.e.:

for all . Intuitively, any statement about the solutions of (2.2) holds for the solutions of (2.1) with arbitrarily small error , provided that we can choose the depth to be arbitrarily large since by making small we increase the depth, given by .

2.2. Network training and controllability

Given a function and a finite set of samples , the problem of training a residual network so that it maps to can be phrased as the problem of constructing an open-loop control input so that the resulting solution of (2.2) takes the sates to the states . It should then come as no surprise that the ability to approximate a function is tightly connected with the control-theoretic problem of controllability: given, one initial state and one final state , when does there exist a finite time and a control input so that the solution of (2.2) starting at at time ends at at time ?

To make the connection between controllability and the problem of mapping every to clear, it is convenient to consider the ensemble of copies of (2.2) given by the matrix differential equation:

(2.3)

where for time the th column of the matrix , denoted by , is the solution of the th copy of (2.2) in the ensemble. If we now index the elements of as , where is the cardinality of , and consider the matrices and , we see that the existence of a control input resulting in a solution of (2.3) starting at and ending at , i.e., controllability of (2.3), is equivalent to existence of an input for (2.2) so that the resulting solution starting at ends at , for all .

Note that achieving controllability of (2.3) is especially difficult, since all the copies of (2.2) in (2.3) are identical and they all use the same input. Therefore, to achieve controllability, we must have sufficient diversity in the initial conditions to overcome the symmetries present in (2.3), see [Aguilar and Gharesifard, 2014]. Our controllability result, Theorem 4.2, describes precisely such diversity. As mentioned in the introduction, this observation also distinguishes the problem under study here from the classical setting of ensemble control [Li and Khaneja, 2006, Helmke and Schönlein, 2014], with the exception of the recent work [Agrachev and Sarychev, 2020], where a collection of systems with different dynamics are driven by the same control input.

3. Problem formulation

Our starting point is the control system:

(3.1)

a slightly simplified version of (2.2), where , , and the input in (2.2) is now the scalar-valued function ; as we will prove in what follows, this model is enough for universal approximation. In fact, we will later see111See the discussion after the proof of Theorem 4.2 that it suffices to let assume two arbitrary values only (one positive and one negative). Moreover, for certain activation functions, we can dispense with altogether.

We make the following assumptions regarding the model (3.1):

  • The function is defined as , where the activation function , or a suitable derivative of it, satisfies a quadratic differential equation, i.e., with , , and for some . Here, denotes the derivative of of order and .

  • The activation function is Lipschitz continuous and defined above is injective.

Several activation functions used in the literature are solutions of quadratic differential equations as can be seen in Table 1. Moreover, activation functions that are not differentiable can also be handled via approximation. For example, the ReLU function defined by can be approximated by , as , which satisfies the quadratic differential equation given in Table 1.

Function name Definition Satisfied differential equation
Logistic function
Hyperbolic tangent
Soft plus
Table 1. Activation functions and the differential equations they satisfy.

The Lipschitz continuity assumption is made to simplify the presentation and can be replaced with local Lipschitz continuity, which then does not need to be assumed, since is analytic in virtue of being the solution of an analytic (quadratic) differential equation. Moreover, all the activation functions in Table 1 are Lipschitz continuous and injective.

To formally state the problem under study in this paper, we need to discuss a different point of view on the solutions of the control system (3.1) given by flows. A continuously differentiable curve is said to be a solution of (3.1) under the piecewise continuous input if it satisfies (3.1). Under the stated assumptions on , given a piecewise continuous input and a state , there is one and at most one solution of (3.1) satisfying . Moreover, solutions are defined for all . We can thus define the flow of (3.1) under the input as the map given by the assignment . In other words, is the point reached at time by the unique solution starting at at time . When the time is clear from context, we denote a flow simply by .

We will use flows to approximate arbitrary continuous functions . Since flows have the same domain and co-domain, and may not, we first lift to a map . When , we lift to , where is the injection given by . In this case . When , we lift to , where is the projection . In this case . Although we could consider factoring through a map , i.e., to construct so that as done in, e.g., [Li et al., 2019], the construction of requires a deep understanding of , since a necessary condition for this factorization is . Constructing so as to contain on its image requires understanding what is and this information is not available in learning problems. Given this discussion, in the remainder of this paper we directly assume we seek to approximate a map .

The final ingredient we need before stating the problem solved in this paper is the precise notion of approximation. Throughout the paper, we will utilize approximation in the sense of norms, where , i.e.,

where is the compact set over which is the approximation is going to be conducted and222The choice of was made for simplicity of presentation and the results still hold for any other norm on the finite dimensional space . . For some results we will also use the infinity norm defined as follows for the previously mentioned function :

We are now ready to state the problem that we study in this paper.

Problem 3.1.

Let be a continuous function, be a compact set, and be the desired approximation accuracy. Under what conditions on the activation function of control system (3.1) does there exist a time and an input so that the flow defined by the solution of (3.1) under the said input satisfies:

for ?

In the next section, we will show the answer is remarkably simple. It suffices for the activation function to satisfy a quadratic differential equation. As we argued in the previous section, several activation functions satisfy this assumption exactly or approximately.

4. Function approximation through controllability

4.1. Outline of the technical arguments

We approximate the function on the compact set by a flow generated by the control system (3.1) in several steps. We first build a flow by constructing a discrete set that approximates and by crafting so that it approximately satisfies for all . In order to control the mismatch between and at points in but not in , we build another flow of (3.1) that takes most points in but not in to points sufficiently close to . Points not taken to by will form a set that can be made arbitrarily small and thus have a small contribution to the approximation error in the sense for . We then show that approximates . This strategy was already employed in [Li et al., 2019], however, our construction of differs significantly from the one proposed in [Li et al., 2019]. The technique employed in [Li et al., 2019] consists of directly constructing using the so-called well functions, whereas we only use the fact that satisfies a quadratic differential equation and leverage tools from geometric control theory to establish controllability of the ensemble (2.3).

Let us take , where is the number of copies of the control system (3.1) in (2.3). If the ensemble control system (2.3) is controllable on a manifold , then given the initial state and the final state , we can find a time and an input so that the solution of (2.3) satisfies and . This, in turn, implies that the flow defined by the solution of (3.1) under the input takes every to . It is simple to see that controllability of (2.3) cannot hold on all of , since if the initial state satisfies for some , we must have for all ; as we will see shortly, for the previously described purpose of function approximation, such scenarios will not be an issue. Remarkably, we show in Theorem 5.1 that controllability only fails in cases similar to the one we just described.

4.2. Main results

Our first result establishes that the controllability property holds for the ensemble control system (2.3) on a dense and connected sub-manifold of , independently of the (finite) number of copies , as long the activation function satisfies a quadratic differential equation. Before stating this result, we recall the formal definition of controllability.

Definition 4.1.

A point is said to be reachable from a point for the control system (2.3) if there exist and a control input so that the solution of (2.3) under said input satisfies and . Control system (2.3) is said to be controllable on a submanifold of if any point in is reachable from any point in .

Theorem 4.2.

Consider the control system given by (2.3) and let be the set defined by:

Suppose that is injective and satisfies the quadratic differential equation with . If , then the ensemble control system (2.3) is controllable on .

We postpone the proof of this result to the Appendix. The following corollary of Theorem 4.2 on reachability is useful to prove one of our latter results; its proof can also be found in the Appendix.

Corollary 4.3.

Consider the control system given by (2.3) and let be the manifold defined in Theorem 4.2. Under assumptions of Theorem 4.2, any point in is reachable from a point for which:

holds for all , where .

Some remarks are in order. The assumptions above on can be relaxed; in particular, it is enough for to be injective and to satisfy the mentioned quadratic differential equation for some . Moreover, Theorem 4.2 and Corollary 4.3 do not directly apply to the ReLU activation function, defined by , since this function is not differentiable. However, the ReLU is approximated by the activation function:

as . In particular, as the ensemble control system (2.3) with converges to the ensemble control system (2.3) with and thus the solutions of the latter are arbitrarily close to the solutions of the former whenever is large enough. Moreover, satisfies and for and thus showing that is an increasing function and, consequently, injective.

The conclusions of Theorem 4.2 and Corollary 4.3 also hold if we weaken the assumptions on the inputs of (3.1). It suffices for the entries of and to take values on a set with two elements, see the discussion after the proof of Theorem 4.2

for details. Moreover, when the activation function is and odd function, i.e.,

, as is the case for the hyperbolic tangent, the conclusions of Theorem 4.2 hold for the simpler version of (3.1), where we fix to be .

It is worthwhile comparing the assumption in Theorem 4.2 with the assumptions in Theorem 2.3, page 8, of [Li et al., 2019]; the reader not familiar with the latter result can safely pass this technical point. To make the comparison transparent, we consider and the family of vector fields defined by:

The proof of Theorem 4.2 shows this family to be controllable333See the discussion before the proof of Theorem 4.2 for a formal definition of controllability for a family of vector fields.. In particular, the conclusions of Theorem 4.2 hold for any network for which there exists a choice of constant weights that, when used, make the right hand side of (3.1) become one of these vector fields. If we apply Theorem 2.3 of [Li et al., 2019] to this family, we would first compute the corresponding convex hull of , given by:

and would then require to either contain a well-function or to contain a sequence of functions converging to a well-function. For our purposes it suffices to recall that if is a well-function (see [Li et al., 2019] for the exact definition) then its zero set, i.e., the set , is bounded and contains an open set. When is the ReLU, this assumption is not satisfied. The reasoning for this is as follows. Consider , which is of the form:

for some as prescribed above. There are four cases to be considered: 1) and ; in this case, the first component of which we denote by is given by . The zero set of this function is non-empty when and, in such case, it is not bounded. Hence, is not a well-function; 2) and ; in this case . If then the zero set of does not contain an open set since it consists of a point. When we are back to case 1). Hence, we conclude that is not a well function; 3) and ; same as case 2); 4) and ; in this case and similar arguments show the zero set of is either empty, a single point, or unbounded.

We now state our main results. The first asserts that any continuous function that is the Cartesian product of strictly monotone functions , i.e., and implies for , can be approximated with any desired accuracy on any compact set with respect to the infinity norm. The second result does not rely on the monotonicity assumption, however, weakens the infinity norm to the norm.

Proposition 4.4.

Assume there exists so that is injective and satisfies a quadratic differential equation with . Then, for every continuous function that is the Cartesian product of strictly monotone functions, for every compact set , and for every there exist a time and an input so that the flow defined by the solution of (3.1) under the said input satisfies:

The proof of this result and the next are provided in the appendix.

Theorem 4.5.

Let and assume there exists so that is injective and satisfies a quadratic differential equation with . Then, for every continuous function , for every compact set , and for every there exist a time and an input so that the flow defined by the solution of (3.1) under the said input satisfies:

This result generalizes that of [Li et al., 2019] in the sense that it does not require the well-function assumption and the sufficient condition for approximability is stated directly in terms of the activation function. Moreover, the conclusions of Theorem 4.5 hold for a class of neural networks larger than (3.1), as discussed after the proof of Theorem 4.2.

References

  • [Agrachev and Caponigro, 2009] Agrachev, A. and Caponigro, M. (2009). Controllability on the group of diffeomorphisms. Annales de l’Institut Henri Poincare (C) Non Linear Analysis, 26(6):2503 – 2509.
  • [Agrachev and Sarychev, 2020] Agrachev, A. and Sarychev, A. (2020). Control in the spaces of ensembles of points. SIAM Journal on Control and Optimization, 58(3):1579–1596.
  • [Aguilar and Gharesifard, 2014] Aguilar, C. and Gharesifard, B. (2014). Necessary conditions for controllability of nonlinear networked control systems. In American Control Conference, pages 5379–5383, Portland, OR.
  • [Albertini et al., 1993] Albertini, D., Sontag, E. D., and Maillot, V. (1993). Uniqueness of weights for neural networks. Artificial Neural Networks for Speech and Vision, pages 115–125.
  • [Albertini and Sontag, 1993] Albertini, F. and Sontag, E. D. (1993). For neural networks, function determines form. Neural Networks, 6(7):975 – 990.
  • [Ba and Caruana, 2014] Ba, L. J. and Caruana, R. (2014). Do deep nets really need to be deep? In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2654–2662, Cambridge, MA, USA. MIT Press.
  • [Brockett, 2007] Brockett, R. W. (2007). Optimal control of the liouville equation. AMS IP Studies in Advanced Mathematics, 39:23.
  • [Daubechies et al., 2019] Daubechies, I., DeVore, R., Foucart, S., Hanin, B., and Petrova, G. (2019). Nonlinear approximation and (deep) ReLU networks.
  • [Haber and Ruthotto, 2017] Haber, E. and Ruthotto, L. (2017). Stable architectures for deep neural networks. Inverse Problems, 34(1).
  • [He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 770–778.
  • [Helmke and Schönlein, 2014] Helmke, U. and Schönlein, M. (2014). Uniform ensemble controllability for one-parameter families of time-invariant linear systems. Systems & Control Letters, 71:69–77.
  • [Hirsch, 1989] Hirsch, M. W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2(5):331 – 349.
  • [Hopfield, 1984] Hopfield, J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81(10):3088–3092.
  • [Jurdjevic, 1996] Jurdjevic, V. (1996). Geometric Control Theory. Cambridge Studies in Advanced Mathematics. Cambridge University Press.
  • [Krattenthaler, 2001] Krattenthaler, C. (2001). Advanced determinant calculus. In The Andrews Festschrift, pages 349–426. Springer.
  • [Li and Khaneja, 2006] Li, J.-S. and Khaneja, N. (2006). Control of inhomogeneous quantum ensembles. Physical Review A, 73(3):030302.
  • [Li et al., 2019] Li, Q., Lin, T., and Shen, Z. (2019). Deep learning via dynamical systems: An approximation perspective. arXiv preprint arXiv:1912.10382.
  • [Lin and Jegelka, 2018] Lin, H. and Jegelka, S. (2018). ResNet with one-neuron hidden layers is a universal approximator. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pages 6172–6181, Red Hook, NY, USA. Curran Associates Inc.
  • [Lu et al., 2018] Lu, Y., Zhong, A., Li, Q., and Dong, B. (2018). Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In

    International Conference on Machine Learning

    , pages 3276–3285.
  • [Lu et al., 2017] Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural networks: A view from the width. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 6232–6240, Red Hook, NY, USA. Curran Associates Inc.
  • [Michel et al., 1989] Michel, A. N., Farrell, J. A., and Porod, W. (1989). Qualitative analysis of neural networks. IEEE Transactions on Circuits and Systems, 36(2):229–243.
  • [Smith, 2008] Smith, H. (2008). Monotone Dynamical Systems: An Introduction to the Theory of Competitive and Cooperative Systems. Mathematical Surveys and Monographs. American Mathematical Society.
  • [Sontag and Qiao, 1999] Sontag, E. and Qiao, Y. (1999).

    Further results on controllability of recurrent neural networks.

    Systems & Control Letters, 36(2):121 – 129.
  • [Sontag and Sussmann, 1997] Sontag, E. D. and Sussmann, H. (1997). Complete controllability of continuous-time recurrent neural networks. Systems & Control Letters, 30(4):177–183.
  • [Urban et al., 2017] Urban, G., Geras, K. J., Kahou, S. E., Aslan, Ö., Wang, S., Mohamed, A., Philipose, M., Richardson, M., and Caruana, R. (2017). Do deep convolutional nets really need to be deep and convolutional? In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  • [Weinan, 2017] Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5.
  • [Wilson and Cowan, 1972] Wilson, H. and Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12(1):1–24.
  • [Zhang et al., 2019] Zhang, H., Gao, X., Unterman, J., and Arodz, T. (2019). Approximation capabilities of neural ODEs and invertible residual networks.

5. Proofs

The proof of Theorem 4.2 is based on two technical results. The first characterizes the rank of a certain matrix that will be required for our controllability result. In essence, the proof of this result follows from [Krattenthaler, 2001, Proposition 1], however, we provide a proof for completeness.

Lemma 5.1.

Let be a function that satisfies the quadratic differential equation:

where . Suppose that derivatives of of up to order exist at points . Then, the determinant of the matrix:

(5.1)

is given by:

(5.2)
Proof.

We assume that the elements of the set are distinct, as otherwise, the determinant is clearly zero. We also assume that to exclude the trivial case. First, by the Vandermonde determinant formula, we have that:

(5.3)

Our proof technique is to use elementary row operations to construct the determinant of from (5.3). To illustrate the idea, let us use (5.3) to show that:

For later use, we denote by the determinant of the matrix constructed by substituting rows to in by derivatives of order to , respectively. First, note that multiplying the third row of by leads to:

Moreover, by the fact that the determinant is unchanged by adding a constant multiple of a row to another row, using rows one and two for this purpose, we have that:

which yields that:

proving the claim. The idea of the proof is to use this same procedure, row by row, to construct in the entry of the matrix. In order to proceed, however, we need to find a formula for , where . Note that, for , we have that:

and , as a polynomial in , is of degree . We now make an observation that finishes the proof. In particular, in the computation of and in order to construct in the third row, we only needed to know the coefficient of the highest degree monomial, in terms of , that constitutes . In other words, the lower degree terms do not contribute to the determinant, as they can be constructed, without changing the determinant, from previous rows. Using this observation, the term in the expansion of does not contributed to , as it can be added from the previously constructed rows. Using this reasoning for all , we conclude that the determinant of is independent of , and . Substituting and , since , we have that:

as claimed. ∎

Our second technical result is stated next.

Proposition 5.2.

Let be the set defined by:

The set is an open and dense sub-manifold of which is connected when .

Proof.

Note that is a finite union of vector subspaces of , hence topologically closed. Therefore, is an open and dense subset of and thus a sub-manifold of dimension . It remains to show that is connected.

Let , and assume that . We prove that there exists a continuous curve connecting to , i.e., and . Since there exists so that . Similarly, since there exists so that . We first consider the case where (which is possible since ). Without loss of generality assume that and and let be defined as:

where denotes the th row of . We now define the curve by:

and note that for all . This is because, by definition, there exists at least one index such that . When , we can choose to be because . When , we can choose to be because . Since is the composition of continuous functions, it is continuous. Moreover, by construction, and .

We now consider the case where . Since , we can choose so that with and . By the previous argument, there is a continuous curve connecting to without leaving and there is also a continuous curve connecting to without leaving . Therefore, their concatenation produces the desired continuous curve connecting to and the proof is finished. ∎

The proof of Theorem 4.2 uses several key ideas from geometric control that we now review. A collection of vector fields on a manifold is said to be controllable if given , there exists a finite sequence of times so that:

where and is the flow of . When the vector fields are smooth, is smooth and connected, and the collection satisfies: