1 Introduction & Related Work
The most general Neural ODEs are nonlinear dynamical systems of the form,
parameterized by and evolving over an input space . The observation that Euler integration of this ODE,
resembles residual blocks in ResNets establishes a simple but profound connection between the worlds of deep learning and differential equations(Chen et al., 2018; Haber and Ruthotto, 2017). The evolution of an initial condition from to is given by the integral expression,
The corresponding flow operator defined by,
is a parametric map from
. As such, it provides a hypothesis space for function estimation in machine learning, and may be viewed as the continuous limit of ResNet-like architectures(He et al., 2016).
Reversible deep architectures enable a layer’s activations to be re-derived from the next layer’s activations, eliminating the need to store them in memory (Gomez et al., 2017)
. For a Neural ODE, by construction a reversible map, loss function gradients can be computed via the adjoint sensitivity method with constant memory cost independent of depth. This decoupling of depth and memory has major implications for applications involving large video and 3D datasets.
When time dependence is dropped from Eqn 1, the system becomes autonomous (Khalil and Grizzle, 2002). Irrespective of number of parameters, an autonomous Neural ODE cannot be a universal approximator since two trajectories cannot intersect, a consequence of each being uniquely associated to a with no time-dependence. As a result, simple continuous, differentiable and invertible maps such as cannot be represented by the flow operators of autonomous systems (Dupont et al., 2019). Note that this is a price of continuity: residual blocks which are discrete dynamical systems can generate discrete points at unit-time intervals side-stepping trajectory crossing.
For continuous systems, it is easy to see that allowing flows to be time-varying is sufficient to resolve this issue (Zhang et al., 2019). Such non-autonomous systems turn out to be universal and can equivalently be expressed as autonomous systems evolving on an extended input space with dimensionality increased by one. This idea of augmenting the dimensionality of the input space of an autonomous system was explored in (Dupont et al., 2019), which further highlighted the representational capacity limitations of purely autonomous systems. Despite the crucial role of time in Neural ODE approximation capabilities, the dominant approach in the literature is simply to append time to other inputs, giving it no special status. Instead, in this work, we:
Introduce new, explicit constructions of non-autonomous Neural ODEs (NANODEs) of the form
where hidden units are rich functions of time with their own parameters, . This non-autonomous treatment frees the weights for each hidden layer to vary in complex ways with time , allowing trajectories to cross. (Sec. 2.3). We explore a flexible mechanism for varying expressiveness while adhering to a given memory limit. (Sec. 4.1).
We then use the above framework to outperform previous Neural ODE variants and standard ResNet and CNN baselines on CIFAR classification and video prediction tasks.
2 Methods: Neural ODEs & Time
2.1 Resnets & Autonomous Neural ODEs
Consider the case of a standard Neural Network with hidden states given by
. The linear transformation of each layer,, is a matrix multiplication between a weight matrix
and a vector.
In a Deep Neural Network, the weight matrices are composed of scalar weights, . The hidden dynamics can be rewritten as , where are learned parameters encoding the weight matrices.
Unconstrained ResNet (Uncon-Resnet): Residual Neural Networks (ResNets) iteratively apply additive non-linear residuals to a hidden state:
where and are the parameters of each Residual Block. This can be viewed as a discretization of the Initial Value Problem (IVP):
Constrained ResNet (Con-Resnet): In the Neural ODE used in the classification experiments in (Chen et al., 2018), a function specifies the derivative and is approximated by a given Neural Network block. This block is defined independent of time, so weights are shared across steps. Through a dynamical systems lens, this Neural ODE approximates an autonomous nonlinear dynamical system. This Neural ODE is analogous to a Constrained ResNet with shared weights between blocks.
2.2 Non-Autonomous Neural ODEs - Time Appended State
By contrast to an autonomous system, consider the general non-autonomous system of the form, , where are the parameters. For simplicity, let us discuss the case where
is specified by a single linear neural net layer with an activation function.
Recall that in an autonomous system there is no time dependence, so at each time , , and for .
Time Appended (AppNODE): In works by Chen et al. (2018) and Dupont et al. (2019), we see a limited variant of a non-autonomous system that one might call semi-autonomous as time is simply added in: . In this case, . Each layer can take the node corresponding to and decide to use it to adjust other weights, but there is no explicit requirement or regularization to force or encourage the network to do this.
Let us consider an alternative: making the weights themselves explicit functions of time. For clarity, everything hereon is our novel contribution unless otherwise noted.
2.3 Non-Autonomous ODEs - Weights as a Function of Time
As illustrated in Figure 1, in a Neural ODE, the discretization of the ODE solver is roughly analogous to depth in a standard Neural Network. This connection is most intuitive in the discrete, as opposed to adaptive, ODE solver case, where the integral in Equation 4 is approximated by a discretization:
For Non-Autonomous Neural ODE (NANODE) where weights are themselves functions of time, , with parameters (Figure 1), the question arises of what kinds of functions to use to specify . We consider the following framings to be natural to explore.
2.3.1 Bases for Time Varying Dynamics
Framed in terms of a dense network block , we can make the weight matrix a function of time, by associating each element in the time-dependent weight matrix with a function . This function, , can be defined by numerous bases.
Bucketed Time (B-NANODE): Here, we consider piecewise constant weights. We initialize a vector to represent each over (i.e., fixing ). In the simplest case, if our discretization (depth) and are the same, then we can map each time to an index in to select distinct parameters over time. For , we can group parameters between successive times to have partial weight-sharing.
Polynomial (Poly-NANODE): We define as the output of a ()-degree polynomial with learned coefficients , i.e.
where is the monomial basis or a better conditioned basis like Chebyshev or Legendre polynomials. For the monomial basis, we have:
As we increase , each function’s expressiveness increases, allowing to vary in complex ways over time.
Note that using 1-degree polynomials is analogous to augmenting the state with scalar value t. Augmenting the state in this way is therefore strictly less general than the above non-autonomous construction. If we were to reframe our non-autonomous system to be autonomous, where instead of , by simply letting and , we arrive at the augmented case (Dupont et al., 2019). We note that trajectories can now cross over each other.
While this standard polynomial construction is intuitive, we find it difficult to train, for reasons given in Sec. 2.4. This motivates the trigonometric construction below.
Trigonometric Polynomials (T-NANODE): We define as a finite linear combination of basis functions and , with where :
with per learnable coefficients:
The proposed trigonometric polynomial scheme is also mathematically equivalent to learning random features of the familiar form:
where is a different feature map:
and and are
-dimensional random vectors drawn from appropriate Gaussian and uniform distributions respectively. When we add a regularizer by penalizing theL2-norm of or , we are dampening higher frequencies in a spectral decomposition of . As increases and the regularization is lowered, we effectively have arbitrary weights at each time point, mimicking Unconstrained ResNets.
Hypernetworks (Hyper-NANODE): As leveraged in (Grathwohl et al., 2018), to construct continuous normalizing flows, we take inspiration from Hypernetworks (Ha et al., 2016) and define each itself to be the output of a neural net:
where (which represents a learned scalar or a vector state embedding for a given weight ) is augmented with and passed to Neural Network . Unsatisfying, perhaps, this method offers us no clear way to vary the expressiveness of the time-dependent dynamics.
In contrast, we could learn a gating mechanism that combines different potential hidden dynamics in a proportion defined by a sigmoid. This results in
with . This is inspired by the gating mechanism briefly discussed in (Chen et al., 2018) for Continuous Normalizing Flows, but is at the level of combining hidden kernels , rather than serving as a learned weighting on each fixed hidden unit of a single kernel.
2.4 Optimization over compact manifolds, and the design of bases for time-varying dynamics
In theory, any arbitrarily expressive basis can be used to model time-varying dynamics and parameterize . However, when viewed in the context of parameterizing time/depth varying weights for a Neural Network, additional matters must be considered. Here we explore how the choice of basis interacts with the larger system and affects the optimization of .
Much work in deep learning has examined conditions under which neural network training is stable, and designed mechanisms to improve stability. This work spans the design of activation functions, initialization and regularization schemes, and other constrained optimization techniques (Helfrich et al., 2017; Glorot and Bengio, 2010; Miyato et al., 2018; Nair and Hinton, 2010). These constructions largely share the motivation of preventing vanishing and exploding gradients issues, introduced by Hochreiter (1991).
Gradient explosion arises when computation pushes the norm of the hidden state to increase exponentially fast with , or vanish with , for weight matrices . Both effects hamper learning by impeding optimization methods’ ability to traverse down a cost surface or via disrupting credit assignment during backpropagation, respectively (Bengio et al., 1994). Orthogonal can alleviate these gradient norm challenges. Here is called the orthogonal group. Since orthogonal linear operators are -norm preserving, the norm of the intermediate state gradient can be made approximately constant. Lezcano-Casado and Martínez-Rubio (2019) propose methods for preserving the orthogonality while performing unconstrained optimization over Euclidean space by leveraging Lie group theory (Lee, 2012) and maps such as the exponential or Cayley transform (Helfrich et al., 2017). These studies demonstrate that neural networks suffer from vanishing/exploding gradients.
The issue, therefore, with an arbitrary basis function is that we have no guarantees on the magnitude of resulting weights . Also, higher-order variants of these functions, e.g. polynomials, can be very sensitive to small changes in , expanding or contracting dramatically, as shown in Fig. 2.
With specifically crafted or coefficient scaling, we could get more complex and controlled behavior, but that would require careful engineering. More generally, we have guarantees that a matrix computed by a given will be well-conditioned after projecting it onto the orthogonal manifold.
2.5 Orthogonal Projection via Householder Reflections
We adopt the following scheme for orthogonal manifold projection. Given a set of unconstrained parameters and , we map them onto the orthogonal group, , with the following.
Lemma 1 (Mhammedi et al., 2017).
For define a function
is a Householder reflection matrix parameterized by . Then is a surjective function into .
We use as learnable parameters and take
for simplicity. The presented Householder reflection approach has been evaluated in the context of recurrent neural networks(Mhammedi et al., 2017) and normalizing flows (van den Berg et al., 2018).
We test this reparameterization scheme on an 4-degree Poly-NANODE and compare to standard batch-norm in Figure 3, finding that it significantly improves stability.
Although multiplying a vector by is of time complexity, we find that sequential application of Householder reflections is slower in practice than efficient matrix-vector multiplication product when using unconstrained weights. Therefore, while the orthogonal reparameterization approach of Householder reflections resolves scale and conditioning issues (ensuring stability), trigonometric polynomials (which can be interpreted as a special case of direct optimization along the orthogonal manifold) are preferable.
3 Theoretical Results
3.1 Trigonometric Polynomials and Evolution on Compact Manifolds
Equipped with different methods, for constructing time-varying weights, we will now establish an interesting connection between some of these techniques and flows on compact manifolds.
We need a few definitions. Recall that an orthogonal group is defined as: . Manifold was already a key actor in the mechanism involving Householder reflections (see: Sec. 2.5). Interestingly, it appears also in the context of proposed parameterizations with trigonometric polynomials, as we show below.
Denote by linear space tangent to in (see: Lee (2012)). It can be proven that:
Space can be interpreted as a local linearization of in the neighborhood of .
The geodesics on passing through fixed and tangent to are of the form: . Let for be defined as: , and for other . Geodesics corresponding to the canonical basis of have very special form, namely:
where is the so-called Givens rotation - a matrix of the two-dimensional rotation in the subspace spanned by two canonical vectors and with angle .
Thus Givens rotations can be thought of as encoding curvilinear coordinates corresponding to the canonical basis of tangent spaces to .
That observation is a gateway to establishing the tight connection between time-varying weights encoded by trigonometric polynomials in our architectures and walks on compact manifolds. Notice that coordinate-aligned walks on can be encoded via products of Givens rotations as:
Now notice that for or , for and
equals the identity matrix on other entries.
Thus we conclude (using standard trigonometric formulae) that if the walk starts at , and then the entry of is of the form:
for some coefficients: .
Note that if we take and , then we get the formula for weights that we obtained through trigonometric polynomials. We see that our proposed mechanism leveraging trigonometric polynomials can be utilized to parameterize weight matrices evolving on the orthogonal group, where polynomial degree encodes number of steps of the walk on and time corresponds to step sizes along curvilinear axes (geodesics’ lengths). These observations extend to rectangular weight matrices by conducting analogous analysis on the Stiefel Manifold (Lee, 2012).
3.2 Stability of Neural ODEs with Time-Varying Weights
Analysis of gradient stability in the context of neural ODEs with time-varying weights sheds a light on the intriguing connection between stable NANODEs and flows on compact matrix manifolds as we now discuss.
Consider a Neural ODE of the form introduced in Eqn. 1.
Learning this Neural ODE entails optimizing a loss function summed over a collection of initial conditions,
where indexes the training data. At any final time
, by the chain rule, the per-example gradient is given by
As a function of time, the spectrum of the Jacobian of with respect to dictates how much the gradient is amplified or diminished. Denoting
this Jacobian satisfies sensitivity equations (Khalil and Grizzle, 2002) given by the matrix differential equation,
3.2.1 Linear Time Varying Analysis
Let . Then, we can write Eqn. 17 as,
where and (with some notation abuse). The solution to such an Linear Time Varying system (LTV) is given by (Kailath, 1980),
where is the associated state transition matrix (STM, Kailath, 1980). For sensitivity equations, . Hence, we are interested in the spectrum of,
As , if we experience vanishing gradients and if , exploding gradients.
3.2.2 The State Transition Matrix
The STM, , is the solution to the matrix differential equation,
In general, there is no analytical expression for the STM. Under some conditions though, we have some simplifications. Suppose commutes with . Then we have
This is true when is diagonal or when and commute for all . See (Kailath, 1980) for details.
3.2.3 Time-varying Neural ODEs
With the machinery developed in previous sections, we are finally ready to turn our attention back to NANODEs. Let us consider Neural ODEs of the form,
Even though analysis of the spectrum of matrix from Eq. 18 is a challenging problem, we make several observations. We conjecture that constructing time-dependent weights in such a way that corresponding matrices belong to spaces tangent in to certain compact matrix manifolds (those tangent spaces are also called Lie algebras if the corresponding manifolds are Lie groups (Lee, 2012)) helps to stabilize.
belongs to the linear space of skew-symmetric matrices(Hairer, 1999). This implies in particular that from Eq. 18 is bounded. Therefore gradients do not explode if is bounded. We leave further analysis of the connections between time-varying parameterizations and stability to future work.
We conduct a suite of experiments to demonstrate the broad applicability of our NANODE approach.
4.1 Image Classification
We first consider the task of image classification using residual flow architectures defined by successive bottleneck residual blocks. As baselines, we trained two ResNet variants: 1) Uncon-ResNet and 2) Con-ResNet (described in Section 2.1). Uncon-ResNet is a standard ResNet architecture where the weight of each ResNet block are not tied to the weights of other ResNet blocks. Con-ResNet is a ResNet architecture where the weights of each ResNet block are constrained to all utilize the same set of parameters, resembling an autonomous Neural ODE, where the weight at each step are fixed (see Figure 1).
In addition to these baselines, we train several NANODE variants, shown in Figures 5 and 5. Each NANODE has bases of varying orders parameterizing their hidden unit dynamics. Our experiments demonstrate that by making the hidden unit dynamics non-autonomous, we can retain much of the memory benefit of an autonomous ODE (Auto) while achieving performance comparable to that of an Unconstrained ResNet. Furthermore, the memory efficiency benefits granted via the adjoint method allow us to train models significantly ”deeper” than the Unconstrained ResNets and outperform them, as shown in Table 1.
|MODEL||CIFAR10 Acc (%)||CIFAR100 Acc (%)||Act. MEM (GB)||Param Mem (GB)|
In Figures 5 and 5, we show how we can leverage the order of to vary the NANODE’s representational capacity. This allows us to elegantly trade off between expressiveness and parameter-efficiency. It is worth noting that parameters typically require far less memory than activations in the CNN or ResNet context. For a reversible architecture such as our NANODEs, with activation memory complexity , we only need to store our parameters, and the activations of a single Block. However, a standard ResNet with activation memory complexity must store activations for all layers. As shown in Table 1, this means that, for a given memory budget, we can train much wider and deeper neural networks.
In Figure 5, we also compare Bucket and Trigonometric time treatments, the two best performing variants. We find that the trigonometric treatment outperforms the piece-wise, Bucket treatment for order less than the discretization, , suggesting the benefits of smoothness. Note, we also experimented with Hypernetwork variants, specified in Eqn. 11 and 12, but obsevered no performance gains over autonomous Neural ODEs, as discussed further in the supplementary materials .
4.2 Video Prediction
Leveraging this memory scaling advantage, we consider the problem of video prediction, whereby a model is tasked with generating future frames conditioned upon some initial observation frames. In deterministic settings, e.g. an object sliding with a fixed velocity, a model has to infer the speed and direction from the prior frames to accurately extrapolate. Standard video prediction models can be memory intensive to train. Video tensor activations must contain an additional time dimension that scales linearly as the number of conditioning or generation frames increases. This makes it difficult to simultaneously parameterize powerful models with many filters, and learn long-horizon dynamics.
|MODEL||MNIST (-ELBO)||BRP (-ELBO)||Act. Mem (GB)||Param Mem (GB)|
Experiments are conducted using the Moving MNIST (Srivastava et al., 2015) and BAIR Robot Pushing Small (Finn et al., 2016) video datasets. We train an Encoder-Decoder Stochastic Video Generation (SVG) model with a learned prior, based on Denton and Fergus (2018). The baseline architecture contains VGG blocks (Simonyan and Zisserman, 2014) of several dimension preserving CNN blocks with filters and stride, followed by max-pooling operations with stride . For our NANODE alternative, we replace the 2 out of every 3 CNN layers with a single NANODE block composed of trigonometric polynomial basis of discretization and order 3. We train this reversible model on a single GPU, and are able to achieve faster convergence and lower final loss on both tasks compared to the more memory intensive baseline. Figure 6 and Table 2 illustrate the min loss achieved and memory usage of the respective architectures. There is much potential to further improve these video prediction architectures by pairing ideas regarding optimal width vs. depth scaling (Tan and Le, 2019) with the arbitrary depth scaling ability and expressiveness that NANODEs provide.
This work explores various constructions of non-autonomous Neural ODEs (NANODE). Treating the weights of these Neural ODE as well-conditioned functions of time enables the weights to vary in complex ways over the course of integration. The class of non-autonomous Neural ODEs presented in this work are strictly more expressive than autonomous ODEs with fixed weights at every time point while enjoying the same small memory footprint. We have also shown that with specific constructions of time dependent weight matrices, performance of non-autonomous ODEs can match unconstrained equivalent ResNets, and even outperform them in memory constrained environments.
Furthermore, we discover an intriguing connection between the stability of NANODEs and flows on compact manifolds. We suggest in future work it would be interesting to further explore designing stable NANODE architectures by leveraging techniques from matrix manifold theory. We also note the potential to apply NANODEs to larger scale video prediction and 3D tasks as a promising direction.
- Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §2.4.
- Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583. Cited by: §1, §2.1, §2.2, §2.3.1.
- Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687. Cited by: §4.2.
- Augmented neural odes. In Advances in Neural Information Processing Systems, pp. 3134–3144. Cited by: §1, §1, §2.2, §2.3.1.
- Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pp. 64–72. Cited by: §4.2.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §2.4.
- The reversible residual network: backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224. Cited by: §1.
- Ffjord: free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367. Cited by: §2.3.1.
- Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §2.3.1.
- Stable architectures for deep neural networks. Inverse Problems 34 (1), pp. 014004. Cited by: §1.
- Numerical geometric integration. Cited by: §3.2.3.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1.
- Orthogonal recurrent neural networks with scaled cayley transform. arXiv preprint arXiv:1707.09520. Cited by: §2.4, §2.4.
- Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München 91 (1). Cited by: §2.4.
- Linear systems. Vol. 156, Prentice-Hall Englewood Cliffs, NJ. Cited by: §3.2.1, §3.2.2.
- Nonlinear systems. Vol. 3, Prentice hall Upper Saddle River, NJ. Cited by: §1, §3.2.
- Introduction to smooth manifolds. 2nd revised ed. Vol. 218. External Links: Cited by: §2.4, §3.1, §3.1, §3.2.3, §3.2.3.
- Cheap orthogonal constraints in neural networks: a simple parametrization of the orthogonal and unitary group. arXiv preprint arXiv:1901.08428. Cited by: §2.4.
- Efficient orthogonal parametrisation of recurrent neural networks using householder reflections. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2401–2409. Cited by: §2.5, Lemma 1.
- Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §2.4.
- Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §2.4.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.
- Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §4.2.
Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §4.2.
- Sylvester normalizing flows for variational inference. External Links: Cited by: §2.5.
- Approximation capabilities of neural ordinary differential equations. arXiv preprint arXiv:1907.12998. Cited by: §1.