The training of a neural network can be summarized as an optimization problem which consists of making steps towards extrema of a loss function. Variants of the stochastic gradient descent (SGD) are generally used to solve this problem. They give surprisingly good results, even though the objective function is not convex in most cases. The adaptive gradient methods are a state-of-the-art variation of SGD. In particular, AdaGrad(Duchi et al., 2011)
, RMSProp(Tieleman and Hinton, 2012), and ADAM (Kingma and Ba, 2014) are widely used methods to train neural networks (Melis et al., 2017; Xu et al., 2015).
Despite the lack of theoretical justification of RMSProp and ADAM (Bernstein et al., 2018) and the case of non convergence in the setting of convex regret minimization (Reddi et al., 2018), these adaptive gradient methods quickly became popular. Soon after, provable versions of these adaptive gradient methods emerged, for example AMSGrad (Reddi et al., 2018) and AdaFom (Chen et al., 2019). Among others, a proof of convergence to a stationary point of AdaGrad is given in (Li and Orabona, 2019).
In most of the SGD methods, the rate of convergence depends on the Lipschitz constant of the gradient of the loss function with respect the parameters (Reddi et al., 2018; Li and Orabona, 2019). Therefore, it is essential to have an upper bound estimate on the Lipschitz constant in order to get a better understanding of the convergence and to be able to set an appropriate step-size. This is also the case for the training of deep neural networks (DNN) for which SGD variants are widely used.
In this paper, we provide a general and efficient estimate for the upper bound on the Lipschitz constant of the gradient of any loss function applied to a feed-forward fully connected DNN with respect to the parameters. Naturally, this estimate depends on the architecture of the DNN (i.e. the activation function, the depth of the NN, the size of the layers) as well as on the norm of the input and on the loss function.
As a concrete application, we show how our estimate can be used to set the (hyper-parameters of the) step size of the AdaGrad (Li and Orabona, 2019) SGD method, such that convergence of this optimization scheme is guaranteed (in expectation). In particular, the convergence rate of AdaGrad with respect to the Lipschitz estimate of the gradient of the loss function can be calculated.
In addition, we provide Lipschitz estimates for any neural network that can be represented as solution of a controlled ordinary differential equation (CODE) (Cuchiero et al., 2019). This includes classical DNN as well as continuously deep neural networks, like neural ODE (Chen et al., 2018), ODE-RNN (Rubanova et al., 2019) and neural SDE (Liu et al., 2019; Tzen and Raginsky, 2019; Jia and Benson, 2019). Therefore, having such a general Lipschitz estimate allows us to cover a wide range of architectures and to study their convergence behaviour.
Moreover, CODE can provide us with neural-network based parametrized families of invertible functions (cf. (Cuchiero et al., 2019)
), including in particular feed forward neural networks. Not only is it important to have precise formulas for their derivatives, these formulas also appear prominently in financial applications.
In Deep Pricing algorithms the prices of financial derivatives are encoded in feed forward neural networks with market factors and/or market model parameters as inputs. It is well known that sensitivities of prices with respect to those parameters or underlying factors are the crucial hedging ratios for building hedging portfolios. It is therefore important to control some of those sensitivities in the training process, which can be precisely done by our formulas.
CODE also provides a new view on models in mathematical finance, which are typically given by stochastic differential equations to capture the complicated dynamics of financial markets. If the control is a driving stochastic process, e.g. a semi-martingale, and the state describes some prices, then the output of CODE is, with respect to continuous depth, a price process. Derivatives and Lipschitz constants describe global properties of such models and can, again, be used to facilitate and shape the training process by providing natural bounds to it. This can for example be applied to the CODE that appears in (Cuchiero et al., 2020).
2 Related work
Very recently, several estimates of the Lipschitz constants of neural networks where proposed (Scaman and Virmaux, 2018; Combettes and Pesquet, 2019; Fazlyab et al., 2019; Jin and Lavaei, 2018; Raghunathan et al., 2018; Latorre et al., 2020). In contrast to our work, those estimates are upper bounds on the Lipschitz constants of neural networks with respect to the inputs and not with respect to the parameters as we provide here. In addition we give upper bounds on the Lipschitz constants of the gradient of a loss function applied to a DNN and not only on the DNN itself. Those other works are mainly concerned with the sensitivity of neural networks to their inputs, while our main goal is to study the convergence properties of neural networks.
To the best of our knowledge, neither for the classical setting of deep feed-forward fully connected neural networks, nor for the CODE framework, general estimates of the Lipschitz constants with respect to the parameters are available.
3 Ordinary deep neural network setting
3.1 Problem setup
The norm we shall use in the sequel is a natural extension of the standard Frobenius norm to finite lists of matrices of diverse sizes. Specifically, for any , , and , we let
Consider positive integers for . We construct a deep neural network (DNN) with layers of , neurons, each with an (activation) function , such that there exist , so that for all and all we have , and . This assumption is met by the classical sigmoid and
functions, but excludes the popular ReLU activation function. However, our main results of this section can easily be extended to allow for ReLU as well, as outlined in Remark3.6. For each , let be the weights and be the bias. Let and define for every
We denote for every the parameters , and by a slight abuse of notation, considering and
as flattened vectors, we writeand . Moreover, we define as the set of possible neural network parameters. Then we define the -layered feed-forward neural network as the function
By we denote the number of trainable parameters of .
We now assume that there exists a (possibly infinite) set of possible training samples , for , equipped with a sigma algebra
and a probability measure, the distribution of the training samples. Let
be a random variable following this distribution. We use the notationto emphasize the two components of a training sample
. In a standard supervised learning setup we have, where is the input and is the target. However, we also allow any other setup including , corresponding to training samples consisting only of the input, i.e. an unsupervised setting. Let
be a function which is twice differentiable in the first component. We assume there exist such that for all we have and . We use to define the cost function, given one training sample , as
Then we define the cost function (interchangeably called objective or loss function) as
where we denote by the expectation with respect to .
For a finite set of training samples , with equal probabilities (Laplace probability model) we obtain the standard neural network objective function .
3.2 Main results
The following theorems show that under standard assumptions for neural network training, the neural network , as well as the cost function , are Lipschitz continuous with Lipschitz continuous gradients with respect to the parameters . We explicitly calculate upper bounds on the Lipschitz constants. Furthermore, we apply these results to infer a bound on the convergence rate to a stationary point of the cost function. The proofs are given in Appendix A.
To simplify the notation, we define the following functions for a given training sample and :
First we derive upper bounds on the Lipschitz constants of the neural network and its gradient.
We assume that the space of network parameters is non-empty, open and bounded, that is, there exists some such that for every we have . For any fixed training sample we set . Then, for , each and its gradient are Lipschitz continuous with constants and and uniformly bounded with constants , , which can be upper bounded as follows:
Furthermore, the function and its gradient are Lipschitz continuous with constant and . This also implies that is uniformly bounded by and these constants can be estimated by
In the corollary below, we solve the recursive formulas to get simpler (but less tight) expressions of the upper bounds of the constants.
Let . The iteratively defined constants of Theorem 3.2 can be upper bounded for by
With Corollary 3.3 we see that our estimates of the Lipschitz constants are linear respectively quadratic in the norm of the neural network input but grow exponentially with the number of layers. Since this gives us only an upper bound for the Lipschitz constants, a natural question arising is, whether these constants are optimal. The answer is no. In particular, the factor respectively is too pessimistic as we discuss in Remark 3.4. On the other hand, the factor is needed for an approximation of as shown in Example 3.5. Similar examples can be constructed for
Revisiting the proof of Theorem 3.2 we see that in each step applying Lemma A.1, we set . However, in the -th layer, needs only to be a bound for , while the norm of the entire parameter vector has to satisfy . Therefore, we can replace by in the constants for the -th layer in Theorem 3.2. Doing this for all layers, including the last one, and become functions of , where the constraint has to be satisfied. Therefore, computing upper bounds of the Lipschitz constants now amounts to solving the optimization problems
Due to the iterative definition of and , both objective functions are complex polynomials in a high dimensional constraint space where the maximum is achieved at some boundary point, i.e. where . In particular, numerical methods have to be used to solve these optimization problems.
We assume to have a -dimensional input and use layers each with only hidden unit. Furthermore we use as activation function a smoothed version of
for some and . Then . We define the -dimensional weights and . Then we have for . We choose with , and with , , such that . Then we have
which is in line with Remark 3.4.
Let us finally explain in the remark below, how Theorem 3.2 can be extended to ReLU activation functions.
We made the assumption that the activation functions are twice differentiable and bounden. This includes the classical sigmoid and functions, but excludes the often used ReLU function . However, Theorem 3.2 can be extended to slight modifications of ReLU, which are made twice differentiable by smoothing the kink at . Indeed, if exists, the only part of the proof that has to be adjusted is the computation of . Since ReLU is either the identity or , the norm of its output is bounded by the norm of its input, which yields and . To take account for the smoothing of the kink, a small constant can be added, which equals the maximum difference between the smoothed and the original version of ReLU.
Now we derive upper bounds on the Lipschitz constants of the objective function and its gradient.
We assume that the space of parameters is non-empty, open and bounded by . Let be a random variable following the distribution of the training samples and assume that is a random variable in , i.e. . Here denotes the norm (1) and of Theorem 3.2 as replaced by . Then, the objective function and its gradient are Lipschitz continuous with constants and . This also implies that is uniformly bounded by . Using the constants of Theorem 3.2 we define
and get the following estimates for the above defined constants:
Note that since is now a random variable rather than a constant, , all depend on and therefore are random variables as well.
The assumption that is bounded is not very restrictive. Whenever a neural network numerically converges to some stationary point, which is the case for the majority of problems where neural networks can be applied successfully, the parameters can be chosen to only take values in a bounded region.
Furthermore, regularisation techniques as for example -regularisation outside a certain domain, can be used to essentially guarantee that the weights do not leave this domain, hence implying our assumption on . A similar approach was used for example in (Ge et al., 2016).
Assume that the (stochastic) gradient scheme in Algorithm 1 is applied to minimize the objective function .
In the following examples we use Theorem 3.7 to set the step sizes of gradient descent (GD) (Example 3.10) in the case of a finite training set (in particular using ) and of stochastic gradient descent (SGD) (Example 3.11) with adaptive step-sizes (a state-of-the-art neural network training method first introduced in (Duchi et al., 2011)) in the case of a general training set . In particular, the step sizes respectively hyper-parameters for the step sizes can be chosen depending on the computed estimates for , such that the GD respectively SGD method are guaranteed to converge (in expectation). At the same time, these Examples give bounds on the convergence rates.
Example 3.10 (Gradient descent).
Assume that with equal probabilities and that in each step of the gradient method the true gradient of is computed, i.e. gradient descent and not a stochastic version of it is applied. Furthermore, assume that there exists such that . Choosing the step sizes , the following inequality
is always satisfied as shown in Section 1.2.3 of (Nesterov, 2013). Furthermore, it follows that for every we have
where . In particular, for every tolerance level we have
Example 3.11 (Stochastic gradient descent).
Assume that the random variable lies in , i.e. . Furthermore, assume that there exists such that . Choose the adaptive step-sizes of the stochastic gradient method in Algorithm 1 as
for constants , that satisfy . One possibility is to use for some . Then there exists a constant depending on , where , such that for every ,
In particular, for every tolerance level we have
The exact value of the constant can be looked up in Theorem 4 and its proof in (Li and Orabona, 2019).
4 Deep neural networks as controlled ODEs
4.1 Framework & definitions
We introduce a slightly different notation than in the previous section. Let and . Let be the fixed dimension of the problem. In particular, if we want to define a neural network mapping some input of dimension to an output of dimension with the maximal dimension of some hidden layer , then we set . We use “zero-embeddings” to write (by abuse of notation) , i.e. we identify with . This is important, since we want to describe the evolution of an input through a neural network to an output by an ODE, which means that the dimension has to be fixed and cannot change. To do so, we fix and define for vector fields
which are càglàd in the second variable and Lipschitz continuous in . Furthermore, we define scalar càdlàg functions for , which we refer to as controls
which are assumed to have finite variation (also called bounded variation) and start at 0, i.e. . With these ingredients, we can define the following controlled ordinary differential equation (CODE)
where is the starting point, respectively input to the “neural network”. We fix some . is called a solution of (3), if it satisfies for all ,
Then (4) describes the evolution of the input through a “neural network” to the output . Here, the “neural network” is defined by and for .
The assumption on to have finite variation is needed for the integral (4) to be well defined. Indeed, a deterministic càdlàg function of finite variation is a special case of a semimartingale, whence we could also take to be semimartingales.
Before we discuss this framework, we define the loss functions similarly to Section 3.1. Let be the set of (-embedded) training samples, again equipped with a sigma algebra and a probability measure . Let be a random variable. For a fixed function , we define the loss (or objective or cost) function by
The framework (4) is much more general and powerful than the standard neural network definition. In particular, we give the following example showing that the neural network defined in (2) is a special case of the CODE solution (4).
We define as a step function and as a stepwise (with respect to its second parameter) vector field
Here, is the concatenation of all the weights needed to define the affine neural network layers, and and are defined as in Section 3.1. However, by abuse of notation, we assume that each , using “-embeddings” wherever needed and similar for . Evaluating (4), which amounts to computing the (stochastic) integral with respect to a step function, we get
where we use . Solving the sum iteratively, we get for ,
in particular, is the output of the layer of the neural network defined in (2).
This example clarifies why we speak of a solution of (4) as a “neural network”, respectively the evolution of the input through a neural network. If respectively are not pure step functions, (4) defines a neural network of “infinite depth”, which we refer to as continuously deep neural networks. Their output can be approximated using a stepwise scheme to solve ODEs. Doing this, the continuously deep neural network is approximated by a deep neural network of finite depth. Using modern ODE solvers with adaptive step sizes as proposed in (Chen et al., 2018), the depth of the approximation and the step sizes change depending on the wanted accuracy and the input.
Continuously deep neural networks are already used in practice. The neural ODE introduced in (Chen et al., 2018), is an example of such a continuously deep neural network that can be described in our framework by (4), when choosing and , i.e. . However, our framework allows to describe more general architectures, which combine jumps (as occurring in Example 4.2) with continuous evolutions as in (Chen et al., 2018). One example of such an architecture is the ODE-RNN introduced in (Rubanova et al., 2019). Furthermore, allowing to be semimartingales instead of deterministic processes of finite variation, neural SDE models as described e.g. in (Liu et al., 2019; Jia and Benson, 2019; Peluchetti and Favaro, 2019; Tzen and Raginsky, 2019) are covered by our framework (4).
4.2 Gradient & existence of solutions
Although we are in a deterministic setting, it is reasonable to make use of Itô calculus (also called stochastic calculus) in the above framework, since integrands are predictable. See for instance (Protter, 1992) for an extensive introduction. We make use of the typical differential notation that is common in stochastic calculus (as for example in (3)) and we treat our ODEs with methods for stochastic differential equations (SDEs). Again we emphasize that all could be general semimartingales.
First we note that by Theorem 7 of Chapter V in (Protter, 1992), a solution of (4) exists and is unique, given that all are Lipschitz continuous in . Starting from (4), we derive the ODE describing the first derivative of with respect to . For this, let us define , and for we use the standard notation and . Assuming that all required derivatives of exist, we have
Therefore, we obtain the following CODE (with differential notation)
4.3 Lipschitz regularity in the setting of CODE
Let us denote the total variation process (cf. Chapter I.7 (Protter, 1992)) of as . We then define , which is an increasing function of finite variation with . Furthermore, we define and note that .
With this we are ready to state our main results of this section. We start with the equivalent result to Theorem 3.2, giving bounds on the Lipschitz constants of the neural network and its gradient.
Let be non-empty and open. We assume that there exist constants and such that for all , , and we have
and similarly for , , and . We also assume that for any , and the map
is Lipschitz continuous with constants independent of and . Then, for any fixed training sample , the neural network output is uniformly bounded in by a constant and the map and its gradient
are Lipschitz continuous on with constants