A neural network consists of a sequence of operations (a.k.a. layers), each of which performs a linear transformation of its input, followed by a point-wise activation function, such as a sigmoid function or the rectified linear unit (ReLU)schalkoff1997artificial hornik1989multilayer ; lecun2015deep ; mousavi2015deep ; kamilov2016learning ; li2016unsupervised ; borgerding2017amp ; li2018learning .
One crucial property of neural networks is their ability to approximate nonlinear functions. It has been shown that even a shallow neural network (i.e., a network with only one hidden layer) with a point-wise activation function has the universal approximation ability hornik1989multilayer ; cybenko1989approximation
. In particular, a shallow network with a sufficient number of activations (a.k.a. neurons) can approximate continuous functions on compact subsets ofwith any desired accuracy, where is the dimension of the input data.
However, the universal approximation theory does not guarantee the algorithmic learnability of those parameters which correspond to the linear transformation of the layers. Neural networks may be trained (or learned) in an unsupervised manner, a semi-supervised manner, or a supervised manner which is by far the most common scenario. With supervised learning, the neural networks are trained by minimizing a loss function in terms of the parameters to be optimized and the training examples that consist of both input objects and the corresponding outputs. A popular approach for optimizing or tuning the parameters is gradient descent with the backpropagation method efficiently computing the gradientwerbos1974beyond .
Although gradient descent and its variants work surprisingly well for training neural networks in practice, it remains an active research area to fully understand the theoretical underpinnings of this phenomenon. In general, the training optimization problems are nonconvex and it has been shown that even training a simple neural network is NP-complete in general blum1989training . There is a large and rapidly increasing literature on the optimization theory of neural networks, surveying all of which is well outside our scope. Thus, we only briefly survey the works most relevant to ours.
In seeking to better understand the optimization problems in training neural networks, one line of research attempts to analyze their geometric landscape. The geometric landscape of an objective function relates to questions concerning the existence of spurious local minima and the existence of negative eigenvalues of the Hessian at saddle points. If the corresponding problem has no spurious local minima and all the saddle points are strict (i.e., the Hessian at any saddle point has a negative eigenvalue), then a number of local search algorithms lee2016gradient ; ge2015escaping ; jin2017escape ; lee2017first are guaranteed to find globally minimal solutions. Baldi and Hornik baldi1989neural showed that there are no spurious local minima in training shallow linear neural networks but did not address the geometric landscape around saddle points. Kawaguchi kawaguchi2016deep further extended the analysis in baldi1989neural and showed that the loss function for training a general linear neural network has no spurious local minima and satisfies the strict saddle property (see Definition 4 in Section 2) for shallow neural networks under certain conditions. Kawaguchi also proved that for general deeper networks, there exist saddle points at which the Hessian is positive semi-definite (PSD), i.e., does not have any negative eigenvalues.
With respect to nonlinear neural networks, it was shown that there are no spurious local minima for a network with one ReLU node tian2017analytical ; du2017convolutional . However, it has also been proved that there do exist spurious local minima in the population loss of shallow neural networks with even a small number (greater than one) of ReLU activation functions safran2017spurious . Fortunately, the number of spurious local minima can be significantly reduced with an over-parameterization scheme safran2017spurious . Soudry and Hoffer soudry2017exponentially proved that the number of sub-optimal local minima is negligible compared to the volume of global minima for multilayer neural networks when the number of training samples goes to infinity and the number of parameters is close to . Haeffele and Vidal haeffele2015global provided sufficient conditions to guarantee that certain local minima (having an all-zero slice) are also global minima. The training loss of multilayer neural networks at differentiable local minima was examined in soudry2016no . Yun et al. yun2017global very recently provided sufficient and necessary conditions to guarantee that certain critical points are also global minima.
|(baldi1989neural, , Fact 4)||no||
|(kawaguchi2016deep, , Theorem 2.3)||no||
|(lu2017depth, , Theorem 2.1)||no||and are of full row rank||✓||?|
|(laurent2017deep, , Theorem 1)||no||✓||?|
|(nouiehed2018learning, , Theorem 8)||no||no||✓||✓x|
|(zhu2017GlobalOptimality, , Theorem 3)||and conditions (4) and (5)||✓||✓|
|creftypecap 2||is of full row rank||✓||✓|
|creftypecap 3||no||is of full row rank||✓||✓|
A second line of research attempts to understand the reason that local search algorithms efficiently find a local minimum. Aside from standard Newton-like methods such as cubic regularization nesterov2006cubic and the trust region algorithm byrd2000trust , recent work lee2016gradient ; ge2015escaping ; jin2017escape ; lee2017first has shown that first-order methods also efficiently avoid strict saddles. It has been shown in lee2016gradient ; lee2017first that a set of first-order local search techniques (such as gradient descent) with random initialization almost surely avoid strict saddles. Noisy gradient descent ge2015escaping and a variant called perturbed gradient descent jin2017escape have been proven to efficiently avoid strict saddles from any initialization. Other types of algorithms utilizing second-order (Hessian) information agarwal2016finding ; carmon2016accelerated ; curtis2017exploiting can also efficiently find approximate local minima.
To guarantee that gradient descent type algorithms (which are widely adopted in training neural networks) converge to the global solution, the behavior of the saddle points of the objective functions in training neural networks is as important as the behavior of local minima.111 From an optimization perspective, non-strict saddle points and local minima have similar first-/second-order information and it is hard for first-/second-order methods (like gradient descent) to distinguish between them. However, the former has rarely been investigated compared to the latter, even for shallow linear networks. It has been shown in baldi1989neural ; kawaguchi2016deep ; lu2017depth ; laurent2017deep that the objective function in training shallow linear networks has no spurious local minima under certain conditions. The behavior of saddle points is considered in kawaguchi2016deep , where the strict saddle property is proved for the case where both the input objects and the corresponding outputs of the training samples have full row rank, has distinct eigenvalues, and . While the assumption on can be easily satisfied, the assumption involving implicity adds constraints on the true weights. Consider a simple case where , with and the underlying weights to be learned. Then the full-rank assumption on at least requires and . Recently, the strict saddle property was also shown to hold without the above conditions on , , , and , but only for degenerate critical points, specifically those points where (nouiehed2018learning, , Theorem 8).
In this paper we analyze the optimization geometry of the loss function in training shallow linear neural networks. In doing so, we first characterize the behavior of all critical points of the corresponding optimization problems with an additional regularizer (see (2)), but without requiring the conditions used in kawaguchi2016deep except the one on the input data . In particular, we examine the loss function for training a shallow linear neural network with an additional regularizer and show that it has no spurious local minima and obeys the strict saddle property if the input has full row rank. This benign geometry ensures that a number of local search algorithms—including gradient descent—converge to a global minimum when training a shallow linear neural network with the proposed regularizer. We note that the additional regularizer (in (2)) is utilized to shrink the set of critical points and has no effect on the global minimum of the original problem. We also observe from experiments that this additional regularizer speeds up the convergence of iterative algorithms in certain cases. Building on our study of the regularized problem and on (nouiehed2018learning, , Theorem 8), we then show that these benign geometric properties are preserved even without the additional regularizer under the same assumption on the input data. Table 1 summarizes our main result and those of related works on characterizing the geometric landscape of the loss function in training shallow linear neural networks.
Outside of the context of neural networks, such geometric analysis (characterizing the behavior of all critical points) has been recognized as a powerful tool for understanding nonconvex optimization problems in applications such as phase retrieval sun2016geometric ; qu2017convolutional , dictionary learning sun2015complete
, tensor factorizationge2015escaping , phase synchronization liu2016estimation and low-rank matrix optimization bhojanapalli2016lowrankrecoveryl ; park2016non ; li2017geometry ; ge2016matrix ; li2016symmetry ; zhu2017global ; zhu2017GlobalOptimality ; ge2017no . A similar regularizer (see (6)) to the one used in (2) is also utilized in park2016non ; li2017geometry ; zhu2017global ; zhu2017GlobalOptimality ; ge2017no for analyzing the optimization geometry.
We use the symbols andorthonormal matrices by . If a function has two arguments, and , then we occasionally use the notation where we stack these two matrices into a larger one via . For a scalar function with a matrix variable , its gradient is a matrix whose -th entry is for all . Here for any and is the -th entry of the matrix . Throughout the paper, the Hessian of is represented by a bilinear form defined via for any . Finally, we use to denote the smallest eigenvalue of a matrix.
2.2 Strict saddle property
Suppose is a twice continuously differentiable objective function. The notions of critical points, strict saddles, and the strict saddle property are formally defined as follows.
Definition 1 (Critical points)
is called a critical point of if the gradient at vanishes, i.e., .
Definition 2 (Strict saddles ge2015escaping )
We say a critical point is a strict saddle if the Hessian evaluated at this point has at least one strictly negative eigenvalue, i.e., .
In words, for a strict saddle, also called a ridable saddle sun2015complete , its Hessian has at least one negative eigenvalue which implies that there is a directional negative curvature that algorithms can utilize to further decrease the objective value. This property ensures that many local search algorithms can escape strict saddles by either directly exploiting the negative curvature curtis2017exploiting or adding noise which serves as a surrogate of the negative curvature ge2015escaping ; jin2017escape . On the other hand, when a saddle point has a Hessian that is positive semidefinite (PSD), it is difficult for first- and second-order methods to avoid converging to such a point. In other words, local search algorithms require exploiting higher-order (at least third-order) information in order to escape from a critical point that is neither a local minimum nor a strict saddle. We note that any local maxima are, by definition, strict saddles.
The following strict saddle property defines a set of nonconvex functions that can be efficiently minimized by a number of iterative algorithms with guaranteed convergence.
Definition 3 (Strict saddle property ge2015escaping )
A twice differentiable function satisfies the strict saddle property if each critical point either corresponds to a local minimum or is a strict saddle.
Intuitively, the strict saddle property requires a function to have a negative curvature direction—which can be exploited by a number of iterative algorithms such as noisy gradient descent ge2015escaping and the trust region method conn2000trust to further decrease the function value—at all critical points except for local minima.
nesterov2006cubic ; sun2015complete ; ge2015escaping ; lee2016gradient For a twice continuously differentiable objective function satisfying the strict saddle property, a number of iterative optimization algorithms can find a local minimum. In particular, for such functions,
creftypecap 1 ensures that many local search algorithms can be utilized to find a local minimum for strict saddle functions (i.e., ones obeying the strict saddle property). This is the main reason that significant effort has been devoted to establishing the strict saddle property for different problems kawaguchi2016deep ; sun2016geometric ; qu2017convolutional ; park2016non ; li2017geometry ; zhu2017global .
In our analysis, we further characterize local minima as follows.
Definition 4 (Spurious local minima)
We say a critical point is a spurious local minimum if it is a local minimum but not a global minimum.
In other words, we separate the set of local minima into two categories: the global minima and the spurious local minima which are not global minima. Note that most local search algorithms are only guaranteed to find a local minimum, which is not necessarily a global one. Thus, to ensure the local search algorithms listed in creftypecap 1 find a global minimum, in addition to the strict saddle property, the objective function is also required to have no spurious local minima.
In summary, the geometric landscape of an objective function relates to questions concerning the existence of spurious local minima and the strict saddle property. In particular, if the function has no spurious local minima and obeys the strict saddle property, then a number of iterative algorithms such as the ones listed in creftypecap 1 converge to a global minimum. Our goal in the next section is to show that the objective function in training a shallow linear network with a regularizer satisfies these conditions.
3 Global Optimality in Shallow Linear Networks
In this paper, we consider the following optimization problem concerning the training of a shallow linear network:
where and are the input and output training examples, and and are the model parameters (or weights) corresponding to the first and second layers, respectively. Throughout, we call , , and the sizes of the input layer, hidden layer, and output layer, respectively. The goal of training a neural network is to optimize the parameters and such that the output matches the desired output .
Instead of proposing new algorithms to minimize the objective function in (1), we are interested in characterizing its geometric landscape by understanding the behavior of all of its critical points.
3.1 Main results
We present our main theorems concerning the behavior of all of the critical points of problem (1). First, the following result shows that the objective function of (1) with an additional regularizer (see (2)) has no spurious local minima and obeys the strict saddle property without requiring any of the following conditions that appear in certain works discussed in Section 3.2: that is of full row rank, that , that has distinct eigenvalues, that , that (4) holds, or that (5) holds.
Assume that is of full row rank. Then for any , the following objective function
obeys the following properties:
has the same global minimum value as in (1);
any critical point of is also a critical point of ;
has no spurious local minima and the Hessian at any saddle point has a strictly negative eigenvalue.
creftypecap 2() states that the regularizer in (3) has no effect on the global minimum of the original problem, i.e., the one without this regularizer. Moreover, as established in creftypecap 2(), any critical point of in (2) is also a critical point of in (1), but the converse is not true. With the regularizer , which mostly plays the role of shrinking the set of critical points, we prove that has no spurious local minima and obeys the strict saddle property.
As our results hold for any and when , one may conjecture that these properties also hold for the original objective function under the same assumptions, i.e., assuming only that has full row rank. This is indeed true and is formally established in the following result.
Assume that is of full row rank. Then, the objective function appearing in (1) has no spurious local minima and obeys the strict saddle property.
The proof of creftypecap 3 is given in Section 4.2. creftypecap 3 builds heavily on creftypecap 2 and on (nouiehed2018learning, , Theorem 8), which is also presented in creftypecap 5. Specifically, as we have noted, (nouiehed2018learning, , Theorem 8) characterizes the behavior of degenerate critical points. Using creftypecap 2, we further prove that any non-degenerate critical point of is either a global minimum or a strict saddle.
3.2 Connection to previous work on shallow linear neural networks
As summarized in Table 1, the results in baldi1989neural ; lu2017depth ; laurent2017deep on characterizing the geometric landscape of the loss function in training shallow linear neural networks only consider the behavior of local minima, but not saddle points. The strict saddle property is proved only in kawaguchi2016deep and partly in nouiehed2018learning . We first review the result in kawaguchi2016deep concerning the optimization geometry of problem (1).
Theorem 4 implies that the objective function in (1) has benign geometric properties if and the training samples are such that and are of full row rank and has distinct eigenvalues. The recent work lu2017depth generalizes the first point of Theorem 4 (i.e., no spurious local minima) by getting rid of the assumption that has distinct eigenvalues. However, the geometry of the saddle points is not characterized in lu2017depth . In laurent2017deep , the authors also show that the condition on is not necessary. In particular, when applied to (1), the result in laurent2017deep implies that the objective function in (1) has no spurious local minima when . This condition requires that the hidden layer is narrower than the input and output layers. Again, the optimization geometry around saddle points is not discussed in laurent2017deep .
We now review the more recent result in (nouiehed2018learning, , Theorem 8).
In cases where the global minimum of is non-degenerate—for example when for some and such that is non-degenerate—creftypecap 5 implies that all degenerate critical points are strict saddles. However, we note that the behavior of non-degenerate critical points in these cases is more important from the algorithmic point of view, since one can always check the rank of a convergent point and perturb it if it is degenerate, but this is not possible at non-degenerate convergent points. Our creftypecap 3 generalizes creftypecap 5 to ensure that every critical point that is not a global minimum is a strict saddle, regardless of its rank.
Next, as a direct consequence of (zhu2017GlobalOptimality, , Theorem 3), the following result also establishes certain conditions under which the objective function in (1) with an additional regularizer (see (6)) has no spurious local minima and obeys the strict saddle property.
(zhu2017GlobalOptimality, , Theorem 3) Suppose . Furthermore, for any matrix with , suppose the following holds
for some positive and such that . Furthermore, suppose admits a solution which satisfies
Then for any , the following objective function
has no spurious local minima and the Hessian at any saddle point has a strictly negative eigenvalue with
where is the rank of , represents the smallest eigenvalue, and denotes the -th largest singular value.
-th largest singular value.
Corollary 1, following from (zhu2017GlobalOptimality, , Theorem 3), utilizes a regularizer which balances the energy between and and has the effect of shrinking the set of critical points. This allows one to show that each critical point is either a global minimum or a strict saddle. Similar to creftypecap 2(), this regularizer also has no effect on the global minimum of the original problem (1).
As we explained before, creftypecap 4 implicitly requires that and . On the other hand, creftypecap 1 requires and (4). When , the hidden layer is narrower than the input and output layers. Note that (4) has nothing to do with the underlying network parameters and , but requires the training data matrix to act as an isometry operator for rank- matrices. To see this, we rewrite
which is a sum of the rank-one measurements of .
Unlike creftypecap 4, which requires that is of full rank and , and unlike creftypecap 1, which requires (4) and , creftypecap 2 and creftypecap 3 only necessitate that is full rank and have no condition on the size of , , and . As we explained before, suppose is generated as , where and are the underlying weights to be recovered. Then the full-rank assumption of at least requires and . In other words, creftypecap 4 necessitates that the hidden layer is wider than the output, while creftypecap 2 and creftypecap 3 work for networks where the hidden layer is narrower than the input and output layers. On the other hand, creftypecap 2 and creftypecap 3 allow for the hidden layer of the network to be either narrower or wider than the input and the output layers.
Finally, consider a three-layer network with . In this case, (1) reduces to a matrix factorization problem where and the regularizer in (2) is the same as the one in (6). Theorem 4 requires that is of full row rank and has distinct singular values. For the matrix factorization problem, we know from creftypecap 1 that for any , has benign geometry (i.e., no spurious local minima and the strict saddle property) as long as . As a direct consequence of creftypecap 2 and creftypecap 3, this benign geometry is also preserved even when or for matrix factorization via minimizing
where (note that one can get rid of the regularizer by setting ).
4 Proof of Main Results
4.1 Proof of creftypecap 2
Suppose the row rank of is . Let
be a reduced SVD of , where is a diagonal matrix with positive diagonals. Then,
where and . Denote by a global minimum of :
Let be a reduced SVD of , where is a diagonal matrix with positive diagonals. Let and . It follows that
which implies that and
since . This further indicates that and have the same global optimum (since for any ).
In the rest of the proof of Theorem 2 we characterize the behavior of all the critical points of the objective function in (2). In particular, we show that any critical point of is also a critical point of , and if it is not a global minimum of (2), then it is a strict saddle, i.e., its Hessian has at least one negative eigenvalue.
To that end, we first establish the following result that characterizes all the critical points of .
Plugging (9) into the above equations gives
for any critical point of . This further implies that if is a critical point of , then it must also be a critical point of of since and both , so that
This proves creftypecap 2().
To further classify the critical points into categories such as local minima and saddle points, for any, we compute the objective value at this point as
where is a reduced SVD of as defined in (8), and and are defined in (10). Noting that is a constant in terms of the variables and , we conclude that is a global minimum of if and only if is a global minimum of
Any is a strict saddle of satisfying:
if , then
if , then
where is the largest singular value of that is strictly smaller than .
4.2 Proof of creftypecap 3
Our goal is to characterize the behavior of all critical points that are not global minima. In particular, we want to show that every critical point of is either a global minimum or a strict saddle.
Let be any critical point in . According to creftypecap 5, when is degenerate (i.e., ), then must be either a global minimum or a strict saddle. We now assume the other case that is non-degenerate.
Let be a reduced SVD of , where is a diagonal and square matrix with positive singular values, and and are orthonormal matrices of proper dimension. We now construct
The above constructed pair satisfies
Note that here (resp. ) have different numbers of columns (resp. rows) than (resp. ). We denote by
Since , we have and . It follows that
And similarily, we have
which together with the above inequation implies that is also a critical point of . Due to (17) which states that also satisfies (9), it follows from the same arguments used in creftypecap 1 and creftypecap 2 that is either a global minimum or a strict saddle of . Moreover, since has the same rank as which is assumed to be non-degenerate, we have that is a global minimum of if and only if
where the minimum of the right hand side is also achieved by the global minimum of according to (13). Therefore, if is a global minimum of , then is also a global minimum of .
Now we consider the other case when is not a global minimum of , i.e., it is a strict saddle. In this case, there exists such that
We consider the optimization landscape of the objective function in training shallow linear networks. In particular, we proved that the corresponding optimization problems under a very mild condition have a simple landscape: there are no spurious local minima and any critical point is either a local (and thus also global) minimum or a strict saddle such that the Hessian evaluated at this point has a strictly negative eigenvalue. These properties guarantee that a number of iterative optimization algorithms (especially gradient descent, which is widely used in training neural networks) converge to a global minimum from either a random initialization or an arbitrary initialization depending on the specific algorithm used. It would be of interest to prove similar geometric properties for the training problem without the mild condition on the row rank of .
Appendix 0.A Proof of creftypecap 1
We first prove the direction . Any critical point of satisfies , i.e.,