Ill-Posedness and Optimization Geometry for Nonlinear Neural Network Training

02/07/2020 ∙ by Thomas O'Leary-Roseberry, et al. ∙ 15

In this work we analyze the role nonlinear activation functions play at stationary points of dense neural network training problems. We consider a generic least squares loss function training formulation. We show that the nonlinear activation functions used in the network construction play a critical role in classifying stationary points of the loss landscape. We show that for shallow dense networks, the nonlinear activation function determines the Hessian nullspace in the vicinity of global minima (if they exist), and therefore determines the ill-posedness of the training problem. Furthermore, for shallow nonlinear networks we show that the zeros of the activation function and its derivatives can lead to spurious local minima, and discuss conditions for strict saddle points. We extend these results to deep dense neural networks, showing that the last activation function plays an important role in classifying stationary points, due to how it shows up in the gradient from the chain rule.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Here, we characterize the optimization geometry of nonlinear least-squares regression problems for generic dense neural networks and analyze the ill-posedness of the training problem. Neural networks are a popular nonlinear functional approximation technique that are succesful in data driven approximation regimes. A one-layer neural network can approximate any continuous function on a compact set, to a desired accuracy given enough neurons

(Cybenko1989; Hornik1989). Dense neural networks have been shown to be able to approximate polynomials arbitrarily well given enough hidden layers (SchwabZech2019). While no general functional analytic approximation theory exists for neural networks, they are widely believed to have great approximation power for complicated patterns in data (PoggioLiao2018).

Training a neural network, i.e., determining optimal values of network parameters to fit given data, can be accomplished by solving the nonconvex optimization problem of minimizing a loss function (known as empirical risk minimization). Finding a global minimum is NP-hard and instead one usually settles for local minimizers (Bertsekas1997; MurtyKabadi1987). Here we seek to characterize how nonlinear activation functions affect the least-squares optimization geometry at stationary points. In particular, we wish to characterize the conditions for strict saddle points and spurious local minima. Strict saddle points are stationary points where the Hessian has at least one direction of strictly negative curvature. They do not pose a significant problem for neural network training, since they can be escaped efficiently with first and second order methods (DauphinPescanuGulcehre2014; JinChiGeEtAl2017; JinNetrapalliJordan2017; NesterovPolyak2006; OLearyRoseberryAlgerGhattas2019). On the other hand, spurious local minima (where the gradient vanishes but the data misfit is nonzero) are more problematic; escaping from them in a systematic way may require third order information (AnandkumarGe2016).

We also seek to analyze the rank deficiency of the Hessian of the loss function at global minima (if they exist), in order to characterize the ill-posedness of the nonlinear neural network training problem. Training a neural network is, mathematically, an inverse problem; rank deficiency of the Hessian often makes solution of the inverse problem unstable to perturbations in the data and leads to severe numerical difficulties when using finite precision arithmetic (Hansen98). While early termination of optimization iterations often has a regularizing effect (Hanke95; EnglHankeNeubauer96), and general-purpose regularization operators (such as or ) can be invoked, when to terminate the iterations and how to choose the regularization to limit bias in the solution are omnipresent challenges. On the other hand, characterizing the nullspace of the Hessian can provide a basis for developing a principled regularization operator that parsimoniously annihilates this nullspace, as has been recently done for shallow linear neural networks (ZhuSoudry2018).

We consider both shallow and deep dense neural network parametrizations. The dense parametrization is sufficiently general since convolution operations can be represented as cyclic matrices with repeating block structure. For the sake of brevity, we do not consider affine transformations, but this work can easily be extended to this setting. We begin by analyzing shallow dense nonlinear networks, for which we show that the nonlinear activation function plays a critical role in classifying stationary points. In particular, if the neural network can exactly fit the data, and zero misfit global minima exist, we show how the Hessian nullspace depends on the activation function and its first derivative at these points.

For linear networks, results about local minima, global minima, strict saddle points, and optimal regularization operators have been shown (BaldiHornik1989; ZhuSoudry2018). The linear network case is a nonlinear matrix factorization problem, given data matrices , one seeks to find such that they minimize

(1)

When the data matrix has full row rank, then by the Eckart-Young Theorem, the solution is given by the rank SVD of , which we denote with a subscript

(2)

The solution is non-unique since for any invertible matrix

(3)

is also a solution. We show that in addition to inheriting issues related to ill-posedness of matrix factorization, the nonlinear activation functions in the nonlinear training problem create ill-posedness and non-uniqueness.

We show that stationary points not corresponding to zero misfit global minima are determined by the activation function and its first derivative through an orthogonality condition. In contrast to linear networks, for which the existence of spurious local minima depends only on the rank of the training data and the weights, we show that for nonlinear networks, both spurious local minima and strict saddle points exist, and depend on the activation functions, the training data, and the weights.

We extend these results to deep dense neural networks where stationary points can arise from exact reconstruction of the training data by the network, or an orthogonality condition that involves the activation functions of each layer of the network and their first derivatives.

For nonlinear neural networks, some work exists on analyzing networks with ReLU activation functions; in particular Safran et. al. establish conditions for the existence of spurious local minima for two layer ReLU networks

(SafranShamir2017).

1.1 Notation and Definitions

For a given matrix

, its vectorization,

is an vector that is the columns of stacked sequentially. Given a vector , its diagonalization is the diagonal matrix with entry being component from . The diagvec operation is the composition

, this is sometimes shortened to dvec. The identity matrix in

is denoted . We use the notation to mean derivatives of a function with respect to a matrix , and when expressing derivatives with respect to a vectorized matrix : . For matrices and , the Kronecker product is the block matrix

(4)

For matrices , is the Hadamard (element-wise) product. For a matrix and a matrix , the expression means that the rows of are orthogonal to the columns of , and thus .

For a differentiable function , and a parameter , we say that is a first order stationary point if , we say that

is a strict saddle point if there exists a negative eigenvalue for the Hessian

. We say that is a local minimum if the eigenvalues of the Hessian are all nonnegative. We say that is a global minimum if for all .

2 Stationary Points of Shallow Dense Network

We start by considering a one layer dense neural network training problem. Given training data matrices , the shallow neural network architecture consists of an encoder weight matrix , a nonlinear activation function (which is applied element-wise), and then a decoder weight matrix . The training problem (empirical risk minimization) may then be stated as

(5)

We will begin by analyzing first order stationary points of the objective function .

Theorem 1.

The gradient of the objective function is given by

(6)
(7)

First order stationary points are characterized by two main conditions:

  1. A global minimum where the misfit is exactly zero: . The possibility for which depends on the representation capability of the network, and the data.

  2. A stationary point not corresponding to zero misfit: , and

Proof.

The partial derivatives of the objective function are derived in Lemma 2. At a first order stationary point of the objective function both partial derivatives must be zero:

(8)
(9)

In the case that then both terms are zero, and the corresponding choices of define a global minimum. This can be seen since is a nonnegative function, and in this case it is exactly zero.

Stationary points where are characterized by orthogonality conditions. If , then this means that ; that is, the rows of and the columns of are pairwise orthogonal. If , then this similarly means that

Corollary 1.1.

Any such that and correspond to first order stationary points of the objective function . In particular, any for which corresponds to a first order stationary point for all .

This result implies that points in parameter space where the activation function and its derivatives are zero can lead to sub-optimal stationary points. Note that if a zero misfit minimum is not possible, there may or may not be an actual global minimum (there will always be a global infimum), but since the misfit is not zero any such point will still fall into the second category. In what follows we characterize the optimization geometry of the objective function at global minima, and degenerate points of the activation function, i.e. points for which .

2.1 Zero misfit minima

Suppose that for given data , there exists such that . As was discussed in Theorem 1, such points correspond to a global minimum. In what follows we characterize Hessian nullspace at these points, and corresponding ill-posedness of the training problem.

Theorem 2.

Characterization of Hessian nullspace at global minimum. Given data suppose there exist weight matrices such that . Suppose further that and are full rank, then the Hessian nullspace is characterized by directions such that

(10)
(11)

In particular for any direction , such that the directional derivative is zero, the weight matrices

(12)

are in the nullspace of the Hessian matrix

Proof.

Since the misfit is zero, the Hessian is exactly the Gauss-Newton Hessian, which is derived in Lemma 3. The matrices are in the nullspace of the Hessian if

(13)
(14)

For this to be the case we need that and . The Hessian nullspace is fully characterized by points that satisfy these two orthogonality constraints. One way in which these constraints are satisfied is if

(15)
(16)

Subsituting (15) into (16) we have

(17)

The first term is nonzero if , since is assumed to be full rank. For the Hadamard product to be zero, the second term must be zero:

(18)

This is accomplished when . Since is full rank this condition reduces to . Suppose that satisfies this directional derivative constraint, then we can find a corresponding such that are in the Hessian nullspace from (15):

(19)

Note that is invertible since is assumed to be full rank.

This result shows that the Hessian may have a nontrivial nullspace at zero misfit global minima; in particular, if there are any local directions satisfying the directional derivative constraint , then the Hessian is guaranteed to have at least one zero eigenvalue. If the Hessian has at least one zero eigenvalue, then the candidate global minimum is not unique, and instead is on a manifold of global minima. Global minima are in this case weak minima.

This result is similar to the non-uniqueness of the linear network training problem, Equation (3). However in this case the linear rank constraints are obfuscated by the nonlinear activation function, and, additionally the zeros of the activation function lead to more possibility for Hessian rank-deficiency and associated ill-posedness.

For weak global minima, regularization schemes that annihilate the Hessian nullspace while leaving the range space unscathed can be used to make the training problem well-posed without biasing the solution. Furthermore, such regularization schemes will accelerate the asymptotic convergence rates of second order methods (Newton convergence deteriorates from quadratic to linear in the presence of singular Hessians), thereby making them even more attractive relative to first order methods.

2.2 Strict Saddle Points and Spurious Local Minima.

As was shown in Theorem 1 and Corollary 1.1, there are stationary points where the misfits are not zero. In this section we show that these points can be both strict saddle points as well as spurious local minima.

Suppose the gradient is zero, but the misfit is nonzero. As was discussed in condition 2 of Theorem 1 such minima require orthogonality conditions for matrices that show up in the gradient. Corollary 1.1 establishes that this result is achieved if . Many activation functions such as ReLU, sigmoid, softmax, softplus, tanh have many points satisfying these conditions (or at least approximately satisfying these conditions, i.e. for small , ). Such stationary points are degenerate due to the activation functions. In what follows we show that while these points are likely to be strict saddles, it is possible that some of them have no directions of negative curvature and are thus spurious local minima.

Theorem 3.

Negative Curvature Directions at Degenerate Activation Stationary Points. Let be arbitrary and suppose that is such that , negative curvature directions of the Hessian at such points are characterized by directions such that

(20)
Proof.

Since all of the terms in the Gauss-Newton Hessian are zero (see Lemma 3). Further, all of the off-diagonal non Gauss-Newton portions are also zero. In this case the only block of the Hessian that is nonzero is the non Gauss-Newton block (see Lemma 4). We proceed by analyzing an un-normalized Rayleigh quotient for this block in an arbitrary direction . From Equation 53 we can compute the quadratic form:

(21)

Expanding this term in a sum we have:

(22)

The result follows noting that . ∎

Directions that satisfy the negative curvature condition (20) are difficult to understand in their generality, since they depend on and . We discuss some example sufficient conditions.

Corollary 3.1.

Saddle point with respect to one data pair. Given , and a strictly convex activation function , suppose that . Suppose that there is a data pair with such that at least one negative component of . Then is a strict saddle point.

Proof.

If , and the , then the direction with all other components zero defines a direction of negative curvature.

(23)

Corollary 3.2.

Given and a strictly convex function and all elements of one row of are negative then is a strict saddle point.

Proof.

Let the row of satisfy this condition, then any choice of such that all rows other than are zero will define a direction of negative curvature. ∎

These conditions are rather restrictive, but demonstrate the nature of existence of negative curvature directions. As was stated before, the most general condition for a strict saddle is the existence of that satisfies Equation (20). We conjecture that such an inequality shouldn’t be hard to satisfy, but as it is a nonlinear inequality finding general conditions for the existence of such is difficult. We have the following result about how the zeroes of the activation function and its derivatives can lead to spurious local minima.

Corollary 3.3.

For a given , if , then the Hessian at this point is exactly zero and this point defines a spurious local minimum.

Such points exist for functions like ReLU, sigmoid, softmax, softplus etc. Any activation function that has large regions where it is zero (or near zero) will have such points. The question is then, how common are they? For the aforementioned functions, the function and its derivatives are zero or near zero when the argument of the function is sufficiently negative. For these functions, and a given tolerance there exists a constant such that for all , , and . For ReLU (which does not have any derivatives at zero) . In one dimension this condition is true for roughly half of the real number line for each of these functions. For the condition to be true for a vector it must be true elementwise. So for the condition

(24)

to hold for a given input datum ; the encoder array must map each component of into the strictly negative orthant of

. The probability of drawing a mean zero Gaussian random vector in

that is in the strictly negative orthant is . Furthermore for this condition to hold for all of means it must be true for each column of the matrix

. The probability of drawing a mean zero Gaussian random matrix in

such that each column resides in the strictly negative orthant is . In practice the linearly encoded input data matrix

is unlikely to have the statistical properties of a mean zero Gaussian, but this heuristic demonstrates that these degenerate points may be improbable to encounter. If the Hessian is exactly zero, one needs third order information to move in a descent direction

(AnandkumarGe2016).

3 Extension to Deep Networks

In this section we briefly discuss the general conditions for stationary points of a dense neural network. We consider the parameterization. In this case the weights for an layer network are , where , , and all other . The activation functions are arbitrary. The network parameterization is

(25)

We have the following general result about first order stationary points of deep neural networks.

Theorem 4.

Stationary points of deep dense neural networks The blocks of the gradient of the least squares loss function for the deep neural network (Equation (25)) are as follows:

(26)

Stationary points of the loss function are characterized by two main cases:

  1. The misfit is exactly zero. If such points are possible, then these points correspond to local minima

  2. For each block the following orthogonality condition holds:

    (27)

This result follows from Lemma 5. There are many different conditions on the weights and activation functions that will satisfy the orthogonality requirement in Equation (2). One specific example is analogous to the condition in Corollary 1.1.

Corollary 4.1.

Any weights such that

(28)
(29)

correspond to a first order stationary point for any .

This is the case since the term that is zero in Equation (28) shows up in the block of the gradient, and the term that is zero in Equation (29) shows up in every other block of the gradient via an Hadamard product due to the chain rule.

Analysis similar to that in Section 2 can be carried out to establish conditions for Hessian rank deficiency at zero misfit minima and corresponding ill-posedness of the training problem in a neighborhood, as well as analysis that may establish conditions for saddle points and spurious local minima. Due to limited space we do not pursue such analyses, but expect similar results. Specifically the last activation function and its derivatives seem to be critical in understanding the characteristics of stationary points, both their existence and Hessian rank deficiency. If the successive layer mappings prior to the last layer map into the zero set of the last activation and its derivatives then we believe spurious local minima are possible.

4 Conclusion

For dense nonlinear neural networks, we have derived expressions characterizing the nullspace of the Hessian in the vicinity of global minima. These can be used to design regularization operators that target the specific nature of ill-posedness of the training problem. When a candidate stationary point is a strict saddle, appropriately-designed optimization algorithms will escape it eventually (how fast they escape will depend on how negative the most negative eigenvalue of the Hessian is). The analysis in this paper shows that when the gradient is small, it can be due to an accurate approximation of the mapping , or it can be due to the orthogonality condition, Equation (2). Spurious local minima can be identified easily, since will be far from zero. Whether or not such points are strict saddles or local minima is harder to know specifically since this can depend on many different factors, such as the zeros of the activation function and its derivatives. Such points can be escaped quickly using Gaussian random noise (JinChiGeEtAl2017). When in the vicinity of a strict saddle point with a negative curvature direction that is large relative to other eigenvalues of the Hessian, randomized methods can be used to identify negative curvature directions and escape the saddle point at a cost of a small number of neural network evaluations (OLearyRoseberryAlgerGhattas2019).

Appendix A Shallow Dense Neural Network Derivations

a.1 Derivation of gradient

Derivatives are taken in vectorized form. In order to simplify notation we use the following:

misfit (30)

In numerator layout partial differentials with respect to a vectorized matrix are as follows:

(31)

First we have a Lemma about the derivative of the activation function with respect to the encoder weight matrix.

Lemma 1.

Suppose and is applied elementwise to the matrix , then

(32)
Proof.

We use the limit definition of the derivative to derive this result. Let be arbitrary. In the limit as we have the following:

(33)

Expanding this term and noting that , as well as , we have:

(34)

The result follows. ∎

Now we can derive the gradients of the objective function .

Lemma 2.

The gradients of the objective function are given by

(35)
(36)
Proof.

We derive in vectorized differential form, from which the matrix form derivatives can be extracted. First for the derivative with respect to we can derive via the matrix partial differential only with respect to :

(37)

Thus it follows that

(38)

The partial derivative is then:

(39)

We have then that the matrix form partial derivative with respect to is:

(40)

For the partial derivative with respect to , again we start with the vectorized differential form.

(41)

Applying Lemma 1 we have:

(42)

The partial derivative is then:

(43)

We have then that the matrix form partial derivative with respect to is:

(44)

a.2 Derivation of Hessian

We now derive the four blocks of the Hessian matrix. I will proceed again by deriving partial differentials in vectorized form. In numerator layout we have

(45)

The term involving only first partial derivatives of the misfit is the Gauss Newton portion which are already derived in section A.1.

Lemma 3.

Gauss-Newton portions

(46)
(47)
(48)
(49)
Proof.

This result follows from equations (38) and (42). ∎

We proceed by deriving the terms involving second partial derivatives of the misfit, by deriving their action on an arbitrary vector . The matrix is the commutation (perfect shuffle) matrix satisfying the equality for .

Lemma 4.

Non Gauss-Newton portions

(50)
(51)
(52)