Spline parameterization of neural network controls for deep learning

Based on the continuous interpretation of deep learning cast as an optimal control problem, this paper investigates the benefits of employing B-spline basis functions to parameterize neural network controls across the layers. Rather than equipping each layer of a discretized ODE-network with a set of trainable weights, we choose a fixed number of B-spline basis functions whose coefficients are the trainable parameters of the neural network. Decoupling the trainable parameters from the layers of the neural network enables us to investigate and adapt the accuracy of the network propagation separated from the optimization learning problem. We numerically show that the spline-based neural network increases robustness of the learning problem towards hyperparameters due to increased stability and accuracy of the network propagation. Further, training on B-spline coefficients rather than layer weights directly enables a reduction in the number of trainable parameters.

Authors

• 2 publications
• 13 publications
• 2 publications
• SplineCNN: Fast Geometric Deep Learning with Continuous B-Spline Kernels

We present Spline-based Convolutional Neural Networks (SplineCNNs), a va...
11/24/2017 ∙ by Matthias Fey, et al. ∙ 0

• Large-time asymptotics in deep learning

It is by now well-known that practical deep supervised learning may roug...
08/06/2020 ∙ by Carlos Esteve, et al. ∙ 0

• Exploiting Spline Models for the Training of Fully Connected Layers in Neural Network

The fully connected (FC) layer, one of the most fundamental modules in a...
02/12/2021 ∙ by Kanya Mo, et al. ∙ 0

• ProdSumNet: reducing model parameters in deep neural networks via product-of-sums matrix decompositions

We consider a general framework for reducing the number of trainable mod...
09/06/2018 ∙ by Chai Wah Wu, et al. ∙ 0

• Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks

Motivation Regulatory sequences are not solely defined by their nucleic...
03/16/2021 ∙ by Mohammadamin Barekatain, et al. ∙ 0

• Separable Convolutional Eigen-Filters (SCEF): Building Efficient CNNs Using Redundancy Analysis

The high model complexity of deep learning algorithms enables remarkable...
10/21/2019 ∙ by Samuel Scheidegger, et al. ∙ 0

• Polynomial-Spline Neural Networks with Exact Integrals

Using neural networks to solve variational problems, and other scientifi...
10/26/2021 ∙ by Jonas A. Actor, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Advances in algorithms and computational resources have recently fueled many practical applications of deep learning, including classical machine learning areas such as image recognition, classification or segmentation

[he2016deep, ronneberger2015u][goldberg2017neural], as well as applications in the new and rapidly evolving field of scientific machine learning [rackauckas2020universal, HornungEtAl2020, HeyEtAl2020, spears2018deep, roscher2020explainable]. While successful applications are reported frequently, advancements on the theoretical side that provide guidance to design and steer new deep learning applications are only at an early stage. Reaching a desired accuracy for new applications is often a delicate and cumbersome task, requiring domain experts to choose between various network architectures and learning models, a process that often involves hundreds of training runs to scan over various hyperparameters, such as network width and depth, type of network layers, learning rate and adaptation, parameter initialization, etc. The need for a theoretical foundation that provides guidance for designing new deep learning applications is of particular importance in scientific machine learning which often demand greater accuracy and robustness [raissi2019physics].

A promising pathway to build upon a theoretical baseline is provided by the recently made interpretation of feed-forward residual networks as discretizations of nonlinear dynamical systems [weinan2017proposal, benning2019deep]

. Instead of considering a neural network as a concatenation of discrete network layers whose parameters are learned during training, the continuous interpretation takes the infinitesimal limit of residual layers to yield a network that is described by a system of nonlinear ordinary differential equations (ODEs), and is driven by a continuous control function:

 (1) ∂x(t)∂t=f(x(t),θ(t)),∀t∈(0,T).

Here, denotes the propagated state of the input data with at time , and the final time is associated to the depth of the network. denotes a control function that represents the network parameters and is learned during training. A standard choice for the right-hand side is = , where

is an activation function applied elementwise (

, ReLU, etc),

is a linear transformation matrix representing e.g. a dense parameterized matrix or a convolution operation and

is a bias vector. We write the control function

. The continuous ODE network models the derivative of the network state in the right hand side of an ODE, casting learning into an optimal control problem that optimizes for the control function to achieve the desired network output:

 (2) minimizeθ(t)ℓ(x(T),ydata) +γ∫T0R(θ(t))dt (3) subject to∂x(t)∂t =σ(W(t)x(t)+b(t)),∀t∈(0,T), (4) x(0) =xdata.

where

denotes a loss function that measures the mismatch between the predicted network output

and the truth for given input-output data , and denotes a regularization term with parameter .

To solve the optimal control learning problem, the temporal domain is discretized into distinct time steps (layers), typically equally spaced with for a step size , where the network states and parameters are approximated numerically as , using a numerical time integration scheme. For example, a forward Euler time integration scheme to discretize (1) on a uniformly spaced grid yields

 (5) xi+1=xi+hσ(Wixi+bi),∀i=0,…N−1,

with step size parameter . In this case, choosing yields the standard ResNet architecture [he2016deep]. Each time step corresponds to one layer of a network with depth and layer spacing . Each layer is associated with one set of network parameters that have to be learned during training. Hence, increasing the accuracy of the time stepping scheme for example by reducing the step size and increasing

accordingly results in a linear increase in the number of trainable parameters increasing the dimensionality and complexity of the optimization problem. On the other hand, larger time step sizes may lead to unstable forward propagation with eigenvalues outside of the stability region of the chosen discretization scheme. When each layer (at time step

) is coupled to a set of trainable parameters , finding an appropriate step size that ensures stable forward (and backward) propagation is therefore challenging.

The main intent of this paper is to explore the benefits of decoupling the discretization of the ODE network states from the parameterization of the network control function . We propose a spline-based network that parameterizes the control by a set of fixed basis functions:

 (6) θ(t)=L−1∑l=−dαlBdl(t)

whose coefficients are the trainable parameters of the network. We choose to be a B-spline basis function of degree as defined in Section 2. Instead of equipping each layer at , with a set of parameters that are learned during training, the spline-based network is trained for the set of coefficients where is defined as the number of “time knots” to construct B-spline bases. The number of coefficients can be significantly smaller that the number of layers (i.e. the number of time steps) and, most importantly, can be chosen independent of the time integration scheme used to discretize (1). In fact, a given set of network parameters , for example those from a previous training runs, can be readily evaluated on other network architectures. The ability to evaluate the network control at any point for a fixed set of network parameters allows one to investigate, and potentially increase the accuracy of the discretized network propagation, e.g. by re-discretizing (1) with a integration scheme of higher order, or by choosing smaller, or adaptive time step sizes to discretize . While the latter entails an increase in the number of layers , favoring very deep networks for accuracy, it does not increase the number of trainable parameters which instead is independent of and can be tuned to account for a desired approximation power of the network. Further, parameterizing the network control with spline basis functions achieves smoothness of the network parameters across layers by design, rendering additional regularization terms on smoothness of the parameters such as minimizing the derivative of across layers as in [haber2017stable] unnecessary. Instead, the degree of the chosen basis functions controls the regularity of . Avoiding discontinuities in the controls contributes to more accurate and stable forward (and backward) propagation through the discretized network such that small perturbations to the network input yield bounded perturbations of the network output - a requisite for successful training where controls are updated based on the network output sensitivities. We expect, and numerically verify, that the spline-based network therefore encounters greater regularity in terms of robustness with respect to hyperparameters, allowing for a greater range of hyperparameters that yield successful training.

B-spline basis functions, which have seen success when applied to the isogemetric analysis (IGA) of numerical partial differential equations

[Cottrell2009], provide a natural parameterization for the control function for a number of reasons. Firstly, the spline basis has local support, which is essential for computational efficiency. Additionally, the lowest-order (piecewise linear) B-splines can be used to recover a standard residual network if the spline knots are chosen to be coincident with the time step points, and increasing the degree of the B-spline functions increases the regularity of the control function, allowing the smoothness of the parameterization to be considered as a hyperparameter. Finally, the spline basis is hierarchical with respect to the polynomial degree, and thus is well-suited for adaptive methods and local refinement, a subject of future investigation.

1.1. Related work

In [queiruga2020continuous], a family of continuous-in-depth generalizations of residual networks was considered. These networks, termed ContinuousNets, can be integrated using any standard time integration method, e.g. Runge–Kutta methods, and can replace ResNet blocks in problem-specific neural architectures.

In [massaroli2020dissecting], Massaroli and colleagues considered two discretization approaches for the optimization problem posed in infinite-dimensional functional space. One method, referred to in that work as a spectral or Galerkin discretization, is based on parameterizing the control function with an orthogonal basis, and then truncating the resulting series to obtain a finite-dimensional problem. An alternative method considered in the same work is the piecewise-constant approximation of the control function, which is referred to as stacked neural ODEs.

2. SpliNet: Weight parameterization using spline basis functions

In order to decouple the trainable network parameters from the time-domain discretization of the ODE-based network (1), we introduce a class of networks termed “SpliNet” that parameterize the network weights and biases using B-spline basis functions:

 (7) W(t)=L−1∑l=−dωlBdl(t), (8) b(t)=L−1∑l=−dβlBdl(t),

where the coefficients are the trainable parameters of the neural network. have the same dimensions as respectively, and ’s are fixed one-dimensional B-spline basis functions of degree as defined below.

2.1. Definition of the B-spline basis functions

To define the one-dimensional B-spline basis functions, consider an equally spaced time grid of knots . Each B-spline basis function is defined as a polynomial of degree with local support in whose first derivatives are continuous across the knots. The basis functions can be constructed recursively (cf. the Cox–de Boor recursion formula [kincaid2009numerical]), starting with a degree-zero B-spline , defined as the indicator function on

 (9) B0l(t)={1,ift∈[τl,τl+1),0,otherwise.

The higher-degree B-splines are defined by the recursion

 (10) Bdl(t)=(t−τlτl+d−τl)Bd−1l(t)+(τl+d+1−tτl+d+1−τl+1)Bd−1l+1(t)

for . Figure 1 shows examples for B-spline basis functions of degree defined on . The resulting entries of the weight and bias functions are piecewise continuous polynomials of degree whose derivatives at knots are continuous up to the -th derivative.

2.2. Network propagation using spline-based parameters

Choosing a finite number of basis functions to parameterize the network control reduces the infinite dimensional learning problem of finding functions as in (2) to the finite dimensional problem of learning coefficients by solving

 (11) minimizeωl,βl,l=−d,…,L−1ℓ(x(T),ydata) +γ∫T0R(θ(t))dt (12) subject to∂x(t)∂t =σ(L−1∑l=−d(ωlx(t)+βl)Bdl(t)),∀t∈(0,T), (13) x(0) =xdata.

To train and evaluate a SpliNet, a temporal integration scheme for the network ODE needs to be chosen, obtaining, for example, (5) in case of a forward Euler time integration scheme. Each discrete time step then corresponds to one layer of the discretized neural network where weights and biases need to be computed by summing over the coefficients multiplied by spline basis functions evaluated at . Since each basis function has local support in , only basis functions are non-zero at any time step . In particular, given a time-point which lies in the interval , for one specific , only are non-zero and contribute to the evaluation of the weights and biases at :

 (14) W(ti) =k0∑l=k0−dωlBdl(ti),% forti∈[τk0,τk0+1), (15) b(ti) =k0∑l=k0−dβlBdl(ti),% forti∈[τk0,τk0+1),

for the current network control parameters . After gathering the weight matrix and bias at , the corresponding -th layer can be readily applied to the network states . Figure 2 depicts an example of a discretized neural network of dense layers of width , where each entry of the weight matrix at each layer results from evaluating a spline function that represents the weights over time.

A spline-based neural network can be integrated with modern deep learning software modules such as PyTorch

[NEURIPS2019_9015][tensorflow2015-whitepaper] with the only additional step being the summation for weight and bias elements before applying the layer. Software libraries for B-spline basis functions are readily available, e.g. through the python-based SciPy package [2020SciPy-NMeth], among others. Similar to standard ODE networks that parameterize each time step with a set of network parameters, a SpliNet can serve as a block within a larger network composed of sub-networks of different architectures.

Remark 1.

When B-spline knots align with time-discretization points (layers), we note that the first-order basis functions yield thus . Therefore, a degree-one SpliNet with is equivalent to a standard ODE network. Choosing further yields a ResNet architecture.

The SpliNet allows one to investigate and adapt accuracy and stability of the network propagation for a fixed set of network parameters. Accuracy can be increased by adapting the time step size and increasing the number of time steps . Concerning stability, it is natural to require that the network propagation steps lie in the stability region of the chosen time integration scheme. For a forward Euler discretization, the stability condition is given by

 (16) Λ(hdiag[σ′(W(ti)x(ti)+b(ti))]W(ti))⊂{z∈C:|1+z|≤1},∀i=0,…,N

where denotes the spectrum of the operator. In [haber2017stable], Ruthotto et al. proposed several variations of ODE-based neural networks which enforce (16) on the continuous level through constraints on . For example, replacing by yields anti-symmetric weights such that eigenvalues of the Jacobians are purely imaginary thus the continuous dynamics are stable. Furthermore, a negative multiple of the identity with can be added to the weights to assure that the eigenvalues have negative real parts. For the discretized problem, a SpliNet allows to adapt the step size to scale the eigenvalues into the stability region , for fixed network controls. As an example, in Figure 3 we parameterize an anti-symmetric weights network using SpiNet and show the distribution of eigenvalues at all the layers for a trained network using the test case (see Section LABEL:sec:sine_example). Furthermore, in Section LABEL:sec:numerical, a numerical experiment (see Figure LABEL:fig:sine_dttest) is conducted to justify the stability and accuracy of SpliNet as the step size tends to zero.

Remark 2.

We note that while [haber2017stable] has discussed the benefits of stable forward propagation, numerical results suggest that for some problems, successful training with ODE networks can be obtained even for operators whose spectrum is not entirely contained in the stability region of the time integration scheme. In the test cases considered in Section LABEL:sec:numerical, we do not enforce this property using the antisymmetric weight matrices, and instead allow for general weight matrices.

2.3. Backpropagation using spline-based parameters

To solve the training optimization problem, backpropagation is typically employed to compute the gradient of the loss at

with respect to the trainable parameters

and perform a gradient step to update the parameters. Backpropagation is equivalent to the adjoint method commonly used in ODE/PDE-constrained optimization where an adjoint differential equation is solved backwards in time propagating derivatives through the time domain. In the discretized setting, the adjoint method accumulates derivatives backwards from the network output and loss evaluation to the network input using the chain rule. Here, we briefly discuss how the adjoint method (backpropagation) is used to compute the gradient with respect to the B-spline coefficients, i.e. the real trainable parameters in a SpliNet. We derive the gradient for network inputs that are vectors, however, a similar derivation can be obtained for a tensor input.

Let denote a general layer-to-layer transformation, i.e. the right hand side of the time-discretized ODE, such as for example for a forward Euler discretization. The adjoint method equips each state with an adjoint state that is given by partial derivatives with respect to the network states:

 (17) zN =∂ℓ∂xN(xN,ydata)⊺, (18) zi =∂Φ∂xi(xi;Wi,bi)⊺zi+1,∀i=N−1,⋯,0.

Using the adjoint variables, the gradient of loss function with respect to B-Spline coefficients can be computed exploiting linearity between and in (14):