Residual Networks: Lyapunov Stability and Convex Decomposition

03/22/2018 ∙ by Kamil Nar, et al. ∙ 0

While training error of most deep neural networks degrades as the depth of the network increases, residual networks appear to be an exception. We show that the main reason for this is the Lyapunov stability of the gradient descent algorithm: for an arbitrarily chosen step size, the equilibria of the gradient descent are most likely to remain stable for the parametrization of residual networks. We then present an architecture with a pair of residual networks to approximate a large class of functions by decomposing them into a convex and a concave part. Some parameters of this model are shown to change little during training, and this imperfect optimization prevents overfitting the data and leads to solutions with small Lipschitz constants, while providing clues about the generalization of other deep networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For most neural network architectures, the expressiveness of the network improves as the depth increases. However, training and test errors of most networks have been shown to deteriorate in practice if the depth exceeds a few layers (He et al., 2016). This discrepancy indicates a problem with the method used to train these networks. Given that gradient-based iterative algorithms are the almost exclusive choice for training, the most likely cause is the poor stability of the dynamical systems created by these algorithms in the sense of Lyapunov222Lyapunov stability is a property of the equilibria, but for ease of reading, we will refer to the dynamical systems and the algorithms as stable if their equilibria are stable.. This is further substantiated by the sharp falls observed in the training error if a time-varying step size is used (He et al., 2016).

It is known that the sensitivity of the loss function with respect to different parameters can become disproportionate while training a neural network. For example, some of the gradients might vanish while others explode, and this prevents an effective training. As a remedy, changing the geometry of the optimization was suggested and a regularized descent algorithm was introduced in

(Neyshabur et al., 2017). This algorithm was shown to converge in fewer iterations over the data, but it required the computation of a scaling constant for each node of the network at each time step, which depended on almost all other nodes if the network is fully connected.

Residual networks appear to be an exception: their training error improves with depth even when a standard gradient method is used (He et al., 2016). To explain their different behavior, linear versions of these networks have been shown to possess some crucial properties for optimization. In particular, it was shown that all local optima of linear residual networks are also the global optima, and the gradient of their cost function does not vanish away from the local optima (Hardt & Ma, 2016). Later, equivalent results were derived under some conditions for nonlinear residual networks as well (Bartlett et al., 2017a).

Lyapunov stability was used in the past to understand and improve the training of neural networks (Michel et al., 1988; Matsuoka, 1992; Man et al., 2006), but the success and the problems of the state-of-the-art networks have not been analyzed from this perspective. In this paper, we fill this gap and show that the residual networks are indeed the right architecture to be trained by gradient-based algorithms in terms of Lyapunov stability. More precisely, we show that given an arbitrary step size, the equilibria of the gradient descent algorithm are most likely to remain stable in the sense of Lyapunov for residual networks. We also reveal that the equilibria of the gradient descent algorithm could be unstable for most deep neural networks, in which case the algorithm might approach an optimum but not converge to it even if the algorithm is not stochastic, thereby providing some level of regularization. Our result is fundamentally different from the previous works (Saxe et al., 2013; Gunasekar et al., 2017) in that we address whether the local optima can actually be achieved by gradient-based methods rather than how well the local optima are.

We then introduce an architecture with a pair of residual networks that can be used to approximate a large class of functions by decomposing them into a convex and a concave part. This decomposition elucidates how each layer improves the approximation and provides an interpretable model. We analyze the properties of its local optima and show that the bias parameters are likely to change little during training. Though this seems like a problem, it in fact prevents overfitting and leads to solutions with low Lipschitz constants, which is also associated with generalization. These claims are verified by testing the suggested model on the MNIST data set with no explicit regularization.

The organization of the rest of the paper is as follows. The stability analysis is given in Section 2 and the model for decomposition is introduced in Section 3. The properties of the model are derived in Section 4 and the results of the test on the MNIST data set are given in Section 5. Lastly, the results are discussed and some future directions are provided in Section 6.

2 Stability in the Sense of Lyapunov

Given a discrete time system with state at time and the update rule , a point is called an equilibrium of the system

if it satisfies The equilibrium point is said to be stable in the sense of Lyapunov if for every , there exists some such that implies for all . That is, if the state of the system is close to the equilibrium initially, it always stays close to the equilibrium, as shown in Figure 1. If, in addition, the state converges to the equilibrium, is said to be asymptotically stable in the sense of Lyapunov.

Figure 1: An equilibrium stable in the sense of Lyapunov

As an example, consider the problem of finding an optimal solution to the ordinary least squares problem

via gradient descent, where and represent the input and the labels, respectively. Given the step size , the update rule for creates the dynamical system

Every equilibrium of this system is asymptotically stable and the gradient descent converges to any of them only if all eigenvalues of the matrix

have magnitude less than 1.

As seen from this example, there exists a critical value for the step size above which the gradient descent algorithm becomes unstable and the iterations cannot converge to a local optimum. While this critical value is the same for all equilibria of a linear system, it varies if the dynamical system is nonlinear. Since the gradient descent for deep networks leads to nonlinear dynamics as well, some of its equilibria might become unstable while others are still stable for the same step size, as the following proposition shows.

Proposition 1.

Let be a nonzero linear function, i.e., for all and . Given a set of points , assume that

is estimated as a multiplication of the scalar parameters

by minimizing

via gradient descent. Then an equilibrium with is asymptotically stable only if the step size satisfies

(1)

where .

Proof.

The update rule for is given as

(2)

By multiplying these update equations for all and denoting by , we obtain

where denotes the higher order terms in its argument. By Lyapunov’s indirect method of stability (Khalil, 2002), an equilibrium of a nonlinear system is asymptotically stable only if the linear approximation around that equilibrium is not unstable. Therefore, converges to zero only if

for the equilibrium

Note that the bound in (1) can become very small for an equilibrium with disproportionate parameters such as with ; and it attains its largest value

for the equilibria which satisfy for all . If the step size is even larger than , then the gradient descent algorithm cannot converge to any of the local optima.

Proposition 1 shows that for an arbitrarily chosen , the equilibria with are most likely to be stable. This suggests, for example, that if is known to be positive and is very large, setting for all is a good choice of initialization. In fact, the gradient descent converges exponentially fast with this initialization if is chosen appropriately, which is shown next.

Proposition 2.

Assume that and for all . If the step size is chosen to be less than or equal to

then for all , where

Proof.

Due to symmetry, for all for all . Denoting any of them by , we have

To show that converges to , we can write

where

If there exists some such that

(3)

then is always larger or always smaller than , and its distance to decreases by a factor of at each step. Since is a monotonic function in , the condition (3) holds for all if it holds only for and , which gives us and

The identical result also holds for if for all and the cardinality of the set

is odd. However, without knowing the sign of

, we cannot decide whether to include a in the initialization. We introduce a decomposition for linear functions to handle this problem in Section 3.2.

We can extend the results in Propositions 1 and 2 to higher dimensions as well. Assume that for all , and let be a nonzero matrix. If gradient descent is used with step size to solve

(4)

where for all , then the update rule for is

(5)

where for , , , for , , and .

Even though the evolution of each element of seems to depend on all the entries of the other matrices, the system described by (5) decomposes into independent systems under the assumptions given in Theorem 1.

Theorem 1.

Let denote the spectral radius of the matrix in (4). Assume that is diagonalizable with real eigenvalues, the initial matrices and

have the identical eigenspaces, and the matrix

. If the step size satisfies

then none of the equilibria that satisfy is stable, and the gradient descent cannot converge.

Proof.

There exists a common invertible matrix

that can diagonalize all matrices in (5): , for all . Then the update rule (5) turns into independent update rules for the diagonal elements of and . By Proposition 1, all of these systems can converge only if

for each eigenvalue of

Theorem 2.

Assume that is diagonalizable and all of its eigenvalues are real and positive. If   and for all and the step size satisfies

then each converges to exponentially fast.

Proof.

After bringing the update equation (5) into diagonal form, Proposition 2 can be applied to each of the systems involving the diagonal elements. Since in Proposition 2 is monotonically decreasing in , the bound for the maximum eigenvalue of guarantees linear convergence. ∎

If the matrix

is not the identity and the eigenvectors of

do not lie in the eigenspaces of , then the update rules for gradient descent remain coupled and the dynamics of the parameters become more complex. Furthermore, if a stochastic gradient method is used, then the update rules may still not decouple even if the input points are orthonormal, unless each lies in an eigenspace of . In this case, taking a batch of points which satisfy for some at each step of the gradient descent simplifies the dynamics.

It was shown in (Hardt & Ma, 2016) that when the matrices are close to the identity, the gradient cannot vanish unless is close to , which can also be seen from (5). However, the stability of the gradient descent algorithm was not addressed. From the proofs for Theorem 1 and Theorem 2, we observe that by keeping all eigenvalues of the matrices close to each other, it is possible to find a step size that will maintain the stability of the gradient descent while providing an effective convergence rate.

Note that if we used distinct unitary matrices instead of the identity, the gradients would still vanish only when , and therefore, every local optima would still be the global optima. In addition, gradients with respect to different parameters would likely not become disproportionate since unitary matrices do not amplify or attenuate the eigenvectors of other matrices either. Therefore, using unitary matrices instead of the identity could possibly yield some results comparable to residual networks, although the dynamics of the parameters would be harder to analyze.

3 Decomposition of Functions

In the previous section, we have seen that the parametrization of linear residual networks are well suited for optimization via gradient descent. In this section, we show that a large class of functions can actually be decomposed into two parts each of which can be approximated by a residual network.

3.1 Convex Decomposition for Nonlinear Functions

The following theorem from (Bartlett et al., 2017a) provides a set of sufficient conditions under which a function can be written as a sequence of functions all of which are close to the identity.

Theorem 3.

(Bartlett et al., 2017a) Consider a function on a bounded domain. Suppose that it is differentiable, invertible, and -smooth:

for all for some , where is the derivative and is the operator norm. Further assume that the inverse is Lipschitz, and for some . Then for all , there are  functions satisfying

and for all , where is the identity function and denotes the Lipschitz seminorm.

Theorem 3 provides an existence result in the function space without assuming any fixed structure for the estimator. If a neural network is used, for example, the width of the network might need to be very large or a certain nonlinearity might be needed at each layer depending on the function to be estimated. In the sequel, we show that a residual network that contains rectified linear units (ReLU) as nonlinearities could be used to approximate strictly convex functions.

Note that if is a twice differentiable, strictly convex function over a bounded domain , then its gradient satisfies all the conditions of Theorem 3. Therefore, we can represent with a residual network. First consider a univariate function which is continuously differentiable and strictly convex on its domain and has a strictly positive derivative at 0, i.e., . A first order approximation to around 0 is

(6)

However, given that is strictly convex, the approximation in (6) underestimates the function particularly at larger values of . To increase both the estimate and the derivative of the estimate for larger values, we can instead use

where , and denotes . The estimate is strictly increasing, and as gets large, the derivative of the estimate gradually increases to , provided that and are not too large. As a result, provides a better estimate for the original function.

Similarly, given a strictly convex, twice differentiable function with a domain of the form

and a gradient with positive coordinates at the origin, we can approximate it with

(7a)
(7b)
(7c)

where all of , , have nonnegative elements, has positive elements, and . The function described by (7a)–(7c) is convex because for each , every coordinate of is obtained by taking nonnegative combination and pointwise maximum of the coordinates of , both of which are operations that preserve convexity (Boyd & Vandenberghe, 2004).

To illustrate how a function described by (7a)–(7c) approximates a convex function, consider defined as

(8a)
(8b)

where , and . Gradient of near the origin is , and when or holds, the gradient gets an increment of or , respectively. Figure 2 shows the level curves of the function obtained.

Figure 2: Level curves of defined in (8a)–(8b) coincide with a bowl-shaped function.

Note that the level curves shown in Figure 2 resemble the level curves of a bowl-shaped function. If we used a different matrix

instead of the identity matrix, we would see that the columns of

determine the normals of the lines beyond which the gradient is incremented, while the elements of determine the distance of these lines to the origin.

Even though the sequence of functions given in (7a)–(7c) has been shown to represent only convex functions, it can be used as the building block to represent a much broader class of functions. To show this, let be a twice-differentiable function, and let denote the coordinate of . Then can be written as

where both and are strictly convex and their gradients at the origin have strictly positive coordinates, e.g.,

where

is a vector with positive elements and

is a positive constant large enough to make the Hessian of positive definite everywhere in the domain of . Then, a pair of residual networks (7a)–(7c) can be used to approximate each coordinate of , and consequently, pairs of residual networks can be used to approximate . This architecture is tested on the MNIST data set in Section 5.

Given that a single residual network has been shown to perform well in practice, one could question the necessity of decomposing the functions into two parts. A similar decomposition for linear mappings is shown in Section 3.2 to be necessary and to improve the convergence of the parameters.

3.2 Positive Eigenvalue Decomposition for Linear Functions

In Section 2, Theorem 2 was stated only for the linear mappings with positive eigenvalues. Even though we could argue that we might as well initialize some of the diagonal elements of the weight matrices with -1 to estimate negative eigenvalues, this is not possible without knowing the signs of the eigenvalues a priori. On the other hand, if all the weight matrices are initialized as the identity, then the diagonal elements corresponding to the negative eigenvalues converge to 0, not to a negative value, which is shown next.

Proposition 3.

Assume that and is used for all to initialize the gradient descent algorithm to solve

Then, each converges to 0 unless , where .

Proof.

Similar to the proof of Proposition 2, we can write the update rule for any weight as

which has one equilibrium at and another at . If and , it can be shown by induction that

for all . As a result, converges to 0. ∎

To resolve this issue of convergence for negative parameters, we can decompose the linear mappings. First consider

where minimization is over and for all . Denoting by , we can write the gradient update rule for and as

If for all , we obtain

(9a)
(9b)

where . Note that if , decreases and increases initially, bringing closer to zero. Even though the origin is still an equilibrium, it is unstable and cannot converge to it.

Based on the scalar case, we can build a double linear network to estimate a linear mapping in higher dimensions as well:

where for all , and the minimization is over the matrices . If the gradient descent algorithm is initialized with the identity matrices, we expect the convergence of the training error for the double network to be faster than that for the single network. To confirm this, we generated a set of random diagonalizable matrices in

with random eigenvectors drawn from the normal distribution

and random eigenvalues drawn from the uniform distribution on

. We compared the training errors for both networks by choosing . As expected, the convergence rate of the double network was consistently better than that of the single network for all the matrices generated. It was also observed that the gradient descent for the single network became unstable for some of the matrices, while the algorithm remained stable for the double network. Figure 3 shows a typical comparison of the training error for the two networks.

Figure 3: Comparison of training a single residual network and a double residual network.

4 Properties of the Local Optima

In the previous section, we showed that a pair of residual networks with ReLU nonlinearities could be used to approximate a large class of functions. In this section, we show that the solutions obtained by training the network described in (7a)–(7c) via gradient descent have low Lipschitz constants and are unlikely to overfit the data.

Consider a simplified version of the network given in (7a)–(7c) with and for all :

(10a)
(10b)
(10c)

where has strictly positive entries, and , have nonnegative entries for all .

Given a set of points and their labels , assume that we train a network described with (10a)–(10c) by minimizing the mean squared error:

(11)

Let and denote the parameters obtained as the local optimum of the problem (11). For any at which the function is differentiable, we can write

where

and satisfies

The matrix corresponds to the change in the gradient caused by the activated ReLU functions at layer for the point , and consequently, denotes the gradient of the estimate with respect to at point . Given that all elements of are strictly positive, the vector also has strictly positive elements for all and .

The first order local optimality condition for and dictates that

(12a)
(12b)

Note that the conditions (12a)–(12b) impose that the error vector

is orthogonal to vectors. Though these vectors are not necessarily linearly independent, the same indices of these vectors are zero for the points on the same affine piece of the function since the ReLU functions

are activated and deactivated simultaneously for these points. As a result, each affine piece of is likely to be a solution to a weighted-least-squares problem for the points corresponding to that piece.

If the bias parameters are distributed such that is spanned by the vectors that the error vector is orthogonal to, the estimator fits all the data points perfectly. However, this is unlikely to happen due to the optimization of the bias parameters. The gradient of the cost function (11) with respect to around a local optimum is

where is an indicator function. If the number of layers of the network is large, is expected to be very close to 0 for the residual network. In addition, the points for which the ReLU function is inactive do not contribute to the gradient. As a result, the gradient of the cost function with respect to the bias parameters is likely to vanish quickly, and consequently, the bias parameters change very little during training. This suggests that the final values of the bias parameters heavily depend on their initialization.

To verify these claims, we trained two estimators with identical architectures to approximate a piecewise affine function whose derivative is

The estimator , and similarly , were built as

where and were as described in (10a)–(10c) with 10 layers and scalar weight parameters. The initial values of were drawn from the uniform distribution on and for the estimators and , respectively. Figure 4 shows the function and the estimates and obtained by Nesterov’s accelerated gradient descent algorithm (Nesterov, 1983).

Figure 4: The estimates and have the same network architecture, but different initialization for the bias parameters. Depending only on the initialization of the bias parameters, the estimate fails to fit the data perfectly, in which case it provides an estimate with a low Lipschitz constant.

Initialized with a larger range for the bias parameters, fits all the data points. The estimate , on the other hand, fails to fit the data perfectly even though the network had 10 layers and it could have had 10 segments. Nevertheless, over the region where it fails to fit the data, it is close to a linear estimate of the points belonging to that region. Consequently, the Lipschitz constant of the estimate at a certain point is either the same as or less than that of the original function.

Small Lipschitz constant, and correspondingly, small spectral norm of an estimator is an indicator of its low excess risk (Bartlett et al., 2017b). Therefore, we expect an estimator with the decomposed structure to generalize well, which is confirmed on the MNIST data set in the next section.

5 Experiment on MNIST

We tested the decomposed model introduced in Section 3.1 on the MNIST data set, which contains images of handwritten figures . Since there are 10 classes, we constructed 10 pairs of residual networks in total. Instead of feeding the raw images into these networks, we first used one convolutional layer with 64 filters of size to extract the edges as the features, and then reduced the dimension of the output of this layer by taking the maximum of every non-overlaping window. The output of this layer was then given as the input to all of the residual networks, each of which had 3 layers.

We trained this network for 12 epochs and recorded its accuracy on the training and the test data after each epoch, which is plotted in Figure

5

. The number of parameters was much larger than standard networks since we used 20 residual networks in total. Furthermore, no explicit regularization or methods such as drop-out or batch-normalization was used. Nevertheless, the training and the test errors were remarkably close to each other throughout the training. After the 12th epoch, the training and the test accuracy were 97.91% and 97.58%, respectively.

Figure 5: Accuracy of the the decomposed network on the training and the test data with no explicit regularization, batch-normalization or drop-out.

6 Discussion

Analyzing the dynamics of the optimization algorithms is crucial for a complete understanding of deep neural networks. Step size of the gradient descent, for example, is a key factor for the Lyapunov stability of its equilibria and should not be disregarded: it determines whether the algorithm can converge to the local optima. Taking the effect of the step size into account, we showed that most of the local optima of deep linear networks actually cannot be discovered via gradient-based methods, even though all of them were known to be a global optimum (Kawaguchi, 2016).

Similarly, the equilibria of other networks could also be unstable for a given step size, and the gradient-based algorithms might only get close to a local minimum but not converge to it even if the algorithm is not stochastic. Consequently, an inexact but approximate solution is obtained for the optimization problem, and this naturally contributes to the lack of overfitting in deep neural networks despite their large number of parameters (Zhang et al., 2017).

We also observed that the cost function used for training a residual network can easily become insensitive to the bias parameters, and the final values of these parameters might heavily depend on their initialization. Though this seems like a problem, it is another factor contributing to the generalization. This happens only for the bias parameters in the residual networks, but it could happen to the weight parameters as well in other types of networks, and that is what we already know as the vanishing gradient problem. In this respect, the hardness of the optimization provides some level of regularization for deep neural networks.

We showed that the parametrization of residual networks allows the equilibria of the gradient descent algorithm to remain stable, thereby facilitating the optimization. If unitary matrices are used instead of the identity, similar results could possibly be obtained. In addition, using an orthonormal set of data points at each step of the gradient descent was seen to help decouple the dynamics of the parameters. This might be a partial explanation for the improvements provided by using batch-normalization in practice (Ioffe & Szegedy, 2015).

We proposed a network architecture which provides an understanding of how each layer improves the approximation in a deep neural network. We chose convexity as a property to decompose the functions into two parts. The fact that the function in each layer remained invertible with this decomposition was critical. Other decompositions could alternatively be generated and tested.

We showed that the architecture introduced generalizes very well on the MNIST data set. The training, however, was very slow due the large number of parameters and the small gradients of the bias parameters. The convergence could possibly be improved by using alternatives to gradient-based algorithms to update the bias parameters, although this might risk overfitting the data.

It was known that deep linear networks produce solutions with small Lipschitz constants under some conditions (Gunasekar et al., 2017), and the Lipschitz constant of the estimators could be used to explain their generalization (Bartlett et al., 2017b). We demonstrated that the solutions obtained with residual networks also have the same property, and hence, generalize well.

Lastly, enlarging the region of attraction of the equilibria by choosing a specific control is a standard problem in nonlinear control theory. Finding a state dependent step size to improve the convergence of the gradient descent for neural networks is an ongoing work.

References