Optimizing non-convex functions has become the standard algorithmic technique in modern machine learning and artificial intelligence. It is increasingly important to understand the working of the existing heuristics for optimizing non-convex functions, so that we can design more efficient optimizers with guarantees. The worst-case intractability result says that finding a global minimizer of a non-convex optimization problem — or even just a degree-4 polynomial — is NP-hard. Therefore, theoretical analysis with global guarantees has to depend on the special properties of the target functions that we optimize. To characterize the properties of the real-world objective functions, researchers have hypothesized that many objective functions for machine learning problems have the property that
|all or most local minima are approximately global minima.||(1.1)|
Optimizers based on local derivatives can solve this family of functions in polynomial time (under some additional technical assumptions that will discussed below). Empirical evidences also suggest practical objective functions from machine learning and deep learning may have such a property. In this chapter, we formally state the algorithmic result that local methods can solve objective with property (1.1) in Section 2, and then rigorously prove that this property holds for a few objectives arising from several key machine learning problems: generalized linear models (Section 3
), principal component analysis (Section4.1), matrix completion (Section 4
), and tensor decompositions (Section5
). We will also briefly touch on recent works on neural networks (Section6).
2 Analysis Technique: Characterization of the Landscape
In this section, we will show that a technical and stronger version of the property (1.1) implies that many optimizers can converge to a global minimum of the objective function.
2.1 Convergence to a local minimum
We consider a objective function , which is assumed to be twice-differentiable from to . Recall that is a local minimum of if there exists an open neighborhood of in which the function value is at least : . A point is a stationary point if it satisfies . A saddle point is a stationary point that is not a local minimum or maximum. We use to denote the gradient of the function, and to denote the Hessian of the function ( is an matrix where ). A local minimum must satisfy the first order necessary condition for optimality, that is, , and the second order necessary condition for optimality, that is, . (Here denotes that is a positive semi-definite matrix.) Thus, A local minimum is a stationary point, so is a global minimum.
However, and is not a sufficient condition for being a local minimum. For example, the original is not a local minimum of the function even though and . Generally speaking, along those direction where the Hessian vanishes (that is, ), the higher-order derivatives start to matter to the local optimality. In fact, finding a local minimum of a function is NP-hard (Hillar and Lim, 2013).
Fortunately, with the following strict-saddle assumption, we can efficiently find a local minimum of the function . A strict-saddle function satisfies that every saddle point must have a strictly negative curvature in some direction. It assumes away the difficult situation in the example above where higher-order derivatives are needed to decide if a point is a local minimum.
For , we say is -strict saddle if every satisfies at least one of the following three conditions:
3. There exists a local minimum that is -close to in Euclidean distance.
This condition is conjectured to hold for many real-world functions, and will be proved to hold for various problems concretely. However, in general, verifying it mathematically or empirically may be difficult. Under this condition, many algorithms can converge to a local minimum of in polynomial time as stated below.222Note that in this chapter, we only require polynomial time algorithm to be polynomial in when is the error. This makes sense for the downstream machine learning applications because very high accuracy solutions are not necessary due to intrinsic statistical errors.
Suppose is a twice differentiable -strict saddle function from . Then, various optimization algorithms (such as stochastic gradient descent) can converge to a local minimum with error in Euclidean distance in time .
2.2 Local optimality vs global optimality
If a function satisfies the property that “all local minima are global” and the strict saddle property, we can provably find one of its global minima. (See Figure 1 for an example of functions with this property. )
Suppose satisfies “all local minima are global” and the strict saddle property in a sense that all points satisfying approximately the necessary first order and second order optimality condition should be close to a global minimum:
there exist and a universal constant such that if a point satisfies and , then is -close to a global minimum of .
Then, many optimization algorithms (including stochastic gradient descent and cubic regularization) can find a global minimum of up to error in norm in domain in time .
The technical condition of the theorem is often succinctly referred to as “all local minima are global”, but its precise form, which is a combination of “all local minima are global” and the strict saddle condition, is crucial. There are functions that satisfy “all local minima are global” but cannot be optimized efficiently. Ignoring the strict saddle condition may lead to misleadingly strong statements.
The condition of Theorem 2.3 can be replaced by stronger ones which may occasionally be easier to verify, if they are indeed true for the functions of interests. One of such conditions is that “any stationary point is a global minimum." The gradient descent is known to converge to a global minimum linearly, as stated below. However, because this condition effectively rules out the existence of multiple disconnected local minima, it can’t hold for many objective functions related to neural networks, which guarantees to have multiple local minima and stationary points due to a certain symmetry.
Suppose a function has -Lipschitz continuous gradients and satisfies the Polyak-Lojasiewicz condition: and such that for every ,
Then, the errors of the gradient descent with step size less than decays geometrically.
It can be challenging to verify the Polyak-Lojasiewicz condition because the quantity is often a complex function of . An easier-to-verify but stronger condition is the quasi-convexity. Intuitively speaking, quasi-convexity says that at any point the gradient should be negatively correlated with the direction pointing towards the optimum.
Definition 2.5 (Weak quasi-convexity).
We say an objective function is -weakly-quasi-convex over a domain with respect to the global minimum if there is a positive constant such that for all ,
The following one is another related condition, which is sometimes referred to as the restricted secant inequality (RSI):
We note that convex functions satisfy (2.2) with . Condition (2.3) is stronger than (2.2) because for smooth function, we have for some constant .333Readers who are familiar with convex optimization may realize that condition (2.3) is an extension of the strong convexity. Conditions (2.1), (2.2), and (2.3) all imply that all stationary points are global minimum because implies that or .
2.3 Landscape for manifold-constrained optimization
We can extend many of the results in the previous section to the setting of constrained optimization over a smooth manifold. This section is only useful for problems in Section 5 and casual readers can feel free to skip it.
Let be a Riemannian manifold. Let be the tangent space to at , and let be the projection operator to the tangent space . Let be the gradient of at on and be the Riemannian Hessian. Note that is a linear mapping from onto itself.
Theorem 2.6 (Informally stated).
Backgrounds on manifold gradient and Hessian. Later in Section 5, the unit sphere in -dimensional space will be our constraint set, that is, . We provide some further backgrounds on how to compute the manifold gradients and Hessian here. We view as the restriction of a smooth function to the manifold . In this case, we have , and . We derive the manifold gradient of on : where is the usual gradient in the ambient space . Moreover, we derive the Riemannian Hessian as
3 Generalized Linear Models
We consider the problem of learning a generalized linear model
and we will show that the loss function for it will be non-convex, but all of its local minima are global. Suppose we observedata points , where ’s are sampled i.i.d. from some distribution over . In the generalized linear model, we assume the label is generated from
is a known monotone activation function,are i.i.d. mean-zero noise (independent with ), and by .
Our goal is to recover approximately from the data. We minimize the empirical squared risk: Let be the corresponding population risk:
We will analyze the optimization of via characterizing the property of its landscape. Our road map consists of two parts: a) all the local minima of the population risk are global minima; b) the empirical risk has the same property.
When is the identity function, that is,
, we have the linear regression problem and the loss function is convex. In practice, people have taken
, e.g., to be the sigmoid function and then the objectiveis no longer convex.
Throughout the rest of the section, we make the following regularity assumptions on the problem. These assumptions are stronger than what’s necessary, for the ease of exposition. However, we note that some assumptions on the data are necessary because in the worst-case, the problem is intractable. (E.g., the generative assumption (3) on ’s is a key one.)
We assume the distribution and activation satisfy that
The vectors are bounded and non-degenerate: is supported in , and for some , where is the identity.
The ground truth coefficient vector satisfies , and .
The activation function is strictly increasing and twice differentiable. Furthermore, it satisfies the bounds
’s are mean zero and bounded: with probability 1, we have.
3.1 Analysis of the population risk
In this section, we show that all the local minima of the population risk are global minima. In fact, has a unique local minimum which is also global. (But still, may likely be not convex for many choices of .)
The objective has a unique local minimum, which is equal to and is also a global minimum. In particular, is weakly-quasi-convex.
The proof follows from directly checking the definition of the quasi-convexity. The intuition is that generalized linear models behave very similarly to linear models from the lens of quasi-convexity: many steps of the inequalities of the proof involves replacing be an identity function effectively (or replacing be 1.)
Using the property that
, we have the following bias-variance decomposition (which can be derived by elementary manipulation)
The first term is independent of , and the second term is non-negative and equals zero at . Therefore, we see that is a global minimum of .
Towards proving that is quasi-convex, we first compute :
where the last equality used the fact that . It follows that
Now, by the mean value theorem, and bullet 3 of Assumption 3.1, we have that
Using and for every , and the monotonicity of ,
where the last step uses the decomposition (3.1) of the risk . ∎
3.2 Concentration of the empirical risk
We next analyze the empirical risk . We will show that with sufficiently many examples, the empirical risk is close enough to the population risk so that also satisfies that all local minima are global.
Theorem 3.3 (The empirical risk has no bad local minimum).
Under the problem assumptions, with probability at least , for all with , the empirical risk has no local minima outside a small neighborhood of : for any such that , if , then
where are universal constants that do not depend on .
Theorem 3.3 shows that all stationary points of have to be within a small neighborhood of . Stronger landscape property can also be proved though: there is a unique local minimum in the neighborhood of .
The main intuition is that to verify quasi-convexity or restricted secant inequality for , it suffices to show that with high probability over the randomness of the data,
4 Matrix Factorization Problems
In this section, we will discuss the optimization landscape of two problems based on matrix factorization: principal component analysis (PCA) and matrix completion. The fundamental difference between them and the generalized linear models is that their objective functions have saddle points that are not local minima or global minima. It means that the quasi-convexity condition or Polyak-Lojasiewicz condition does not hold for these objectives. Thus, we need more sophisticated techniques that can distinguish saddle points from local minima.
4.1 Principal Component Analysis
One interpretation of PCA is approximating a matrix by its best low-rank approximation. Given a matrix , we aim to find its best rank- approximation (in either Frobenius norm or spectral norm). For the ease of exposition, we take and assume to be symmetric positive semi-definite with dimension by . In this case, the best rank-1 approximation has the form where .
There are many well-known algorithms for finding the low-rank factor . We are particularly interested in the following non-convex program that directly minimizes the approximation error in Frobenius norm.
We will prove that even though is not convex, all the local minima of are global. It also satisfies the strict saddle property (which we will not prove formally here). Therefore, local search algorithms can solve (4.1) in polynomial time.444In fact, local methods can solve it very fast. See, e.g., Li et al. (2017, Thereom 1.2)
Our analysis consists of two main steps: a) to characterize all the stationary points of the function
, which turn out to be the eigenvectors of; b) to examine each of the stationary points and show that the only the top eigenvector(s) of can be a local minimum. Step b) implies the theorem because the top eigenvectors are also global minima of . We start with step a) with the following lemma.
By elementary calculus, we have that
Therefore, if is a stationary point of , then , which implies that is an eigenvector of with eigenvalue equal to . ∎
Now we are ready to prove b) and the theorem. The key intuition is the following. Suppose we are at a point that is an eigenvector but not the top eigenvector, moving in either the top eigenvector direction or the direction of will result in a second-order local improvement of the objective function. Therefore, cannot be a local minimum unless is a top eigenvector.
Proof of Theorem 4.1.
By Lemma 4.2, we know that a local minimum is an eigenvector of . If is a top eigenvector of with the largest eigenvalue, then is a global minimum. For the sake of contradiction, we assume that is an eigenvector with eigenvalue that is strictly less than . By Lemma 4.2 we have . By elementary calculation, we have that
Let be the top eigenvector of with eigenvalue and with norm 1. Then, because , we have that
It’s a basic property of eigenvectors of positive semidefinite matrix that any pairs of eigenvectors with different eigenvalues are orthogonal to each other. Thus we have . It follows equation (4.4) and (4.3) that
|(by that has eigenvalue and that )|
|(by the assumption)|
which is a contradiction. ∎
4.2 Matrix completion
Matrix completion is the problem of recovering a low-rank matrix from partially observed entries, which has been widely used in collaborative filtering and recommender systems, dimension reduction, and multi-class learning. Despite the existence of elegant convex relaxation solutions, stochastic gradient descent on non-convex objectives are widely adopted in practice for scalability. We will focus on the rank-1 symmetric matrix completion in this chapter, which demonstrates the essence of the analysis.
4.2.1 Rank-1 case of matrix completion
Let be a rank-1 symmetric matrix with factor that we aim to recover. We assume that we observe each entry of with probability independently.666Technically, because is symmetric, the entries at and are the same. Thus, we assume that, with probability we observe both entries and otherwise we observe neither. Let be the set of entries observed.
Our goal is to recover from the observed entries of the vector up to sign flip (which is equivalent to recovering ).
A known issue with matrix completion is that if is “aligned” with standard basis, then it’s impossible to recover it. E.g., when where is the -th standard basis, we will very likely observe only entries with value zero, because is sparse. Such scenarios do not happen in practice very often though. The following standard assumption will rule out these difficult and pathological cases:
Assumption 4.3 (Incoherence).
W.L.O.G, we assume that . In addition, we assume that satisfies We will think of as a small constant or logarithmic in , and the sample complexity will depend polynomially on it.
In this setting, the vector can be recovered exactly up to a sign flip provided samples. However, for simplicity, in this subsection we only aim to recover with an norm error . We assume that which means that the expected number of observations is on the order of . We analyze the following objective that minimizes the total squared errors on the observed entries:
Here denotes the matrix obtained by zeroing out all the entries of that are not in . For simplicity, we only focus on characterizing the landscape of the objective in the following domain of incoherent vectors that contain the ground-truth vector (with a buffer of factor of 2)
We note that the analyzing the landscape inside does not suffice because the iterates of the algorithms may leave the set . We refer the readers to the original paper (Ge et al., 2016) for an analysis of the landscape over the entire space, or to the recent work (Ma et al., 2018) for an analysis that shows that the iterates won’t leave the set of incoherent vectors if the initialization is random and incoherent.
The global minima of are and with function value 0. In the rest of the section, we prove that all the local minima of are -close to .
In the setting above, all the local minima of inside the set are -close to either or .777It’s also true that the only local minima are exactly , and that has strict saddle property. However, their proofs are involved and beyond the scope of this chapter.
It’s insightful to compare with the full observation case when . The corresponding objective is exactly the PCA objective defined in equation (4.1). Observe that is a sampled version of the , and therefore we expect that they share the same geometric properties. In particular, recall that does not have spurious local minima and thus we expect neither does .
However, it‘s non-trivial to extend the proof of Theorem 4.1 to the case of partial observation, because it uses the properties of eigenvectors heavily. Indeed, suppose we imitate the proof of Theorem 4.1, we will first compute the gradient of :
Then, we run into an immediate difficulty — how shall we solve the equation for stationary points . Moreover, even if we could have a reasonable approximation for the stationary points, it would be difficult to examine their Hessians without using the exact orthogonality of the eigenvectors.
The lesson from the trial above is that we may need to have an alternative proof for the PCA objective (full observation) that relies less on solving the stationary points exactly. Then more likely the proof can be extended to the matrix completion (partial observation) case. In the sequel, we follow this plan by first providing an alternative proof for Theorem 4.1, which does not require solving the equation , and then extend it via concentration inequality to a proof of Theorem 4.4. The key intuition will be is the following:
Proofs that consist of inequalities that are linear in are often easily generalizable to partial observation case.
Here statements that are linear in mean the statements of the form
. We will call these kinds of proofs “simple” proofs in this section. Indeed, by the law of large numbers, when the sampling probabilityis sufficiently large, we have that
Then, the mathematical implications of are expected to be similar to the implications of , up to some small error introduced by the approximation.
What natural quantities about are of the form ? First, quantities of the form can be written as . Moreover, both the projection of and are of the form :
The concentration of these quantities can all be captured by the following theorem below:
Let and . Then, with high probability of the randomness of , we have that for all , where and .
We will provide two claims below, combination of which proves Theorem 4.1. In the proofs of these two claims, all the inequalities are of the form of LHS of equation (4.8). Following each claim, we will immediately provide its extension to the partial observation case.
Suppose satisfies , then .
By elementary calculation
Intuitively, a stationary point ’s norm is governed by its correlation with . ∎
The following claim is the counterpart of Claim 1f in the partial observation case.
Suppose satisfies , then .
If has positive Hessian , then .
By the assumption on , we have that . Calculating the quadratic form of the Hessian (which can be done by elementary calculus and is skipped for simplicity), we have
This implies that
If has positive Hessian , then .
Imitating the proof of Claim 2f, calculating the quadratic form over the Hessian at , we have
Then by Claim 1f again we obtain , and therefore .
Because , we have that
Then by Claim 1p again, we have which implies that . Now suppose , then we have
Therefore is close to . On the other hand, if , we can similarly conclude that is -close to . ∎
5 Landscape of Tensor Decomposition
In this section, we analyze the optimization landscape for another machine learning problem, tensor decomposition. The fundamental difference of tensor decomposition from matrix factorization problems or generalized linear models is that the non-convex objective function here has multiple isolated local minima, and therefore the set of local minima does not have rotational invariance (whereas in matrix completion or PCA, the set of local minima are rotational invariant.). This essentially prevents us to only use linear algebraic techniques, because they are intrinsically rotational invariant.
5.1 Non-convex optimization for orthogonal tensor decomposition and global optimality
We focus on one of the simplest tensor decomposition problems, orthogonal 4-th order tensor decomposition. Suppose we are given the entries of a symmetric 4-th order tensor which has a low rank structure in the sense that:
where . Our goal is to recover the underlying components . We assume in this subsection that are orthogonal vectors in with unit norm (and thus implicitly we assume .) Consider the objective function
The optimal value function for the objective is the (symmetric) injective norm of a tensor . In our case, the global maximizers of the objective above are exactly the set of components that we are looking for.
5.2 All local optima are global
We next show that all the local maxima of the objective (5.2) are also global maxima. In other words, we will show that are the only local maxima. We note that all the geometry properties here are defined with respect to the manifold of the unit sphere . (Please see Section 2.3 for a brief introduction of the notions of manifold gradient, manifold local maxima, etc.)
Towards proving the Theorem, we first note that the landscape property of a function is invariant to the coordinate system that we use to represent it. It’s natural for us to use the directions of together with an arbitrary basis in the complement subspace of as the coordinate system. A more convenient viewpoint is that this choice of coordinate system is equivalent to assuming are the natural standard basis . Moreover, one can verify that the remaining directions are irrelevant for the objective because it’s not economical to put any mass in those directions. Therefore, for simplicity of the proof, we make the assumption below without loss of generality:
Then we have that . We compute the manifold gradient and manifold Hessian using the formulae of and in Section 2.3,
where for a vector denotes the diagonal matrix with on the diagonal. Now we are ready to prove Theorem 5.2. In the proof, we will first compute all the stationary points of the objective, and then examine each of them and show that only can be local maxima.
Proof of Theorem 5.2.
We work under the assumptions and simplifications above. We first compute all the stationary points of the objective (5.2) by solving . Using equation (5.4), we have that the stationary points satisfy that
It follows that or . Assume that of the ’s are non-zero and thus take the second choice, we have that
This implies that , and or . In other words, all the stationary points of are of the form (where there are non-zeros) for some and all their permutations (over indices).
Next, we examine which of these stationary points are local maxima. Let for simplicity. This implies that . Consider a stationary point where . Let be a local maximum. Thus . We will prove that this implies . For the sake of contradiction, we assume . We will show that the Hessian cannot be negative semi-definite by finding a particular direction in which the Hessian has positive quadratic form.
6 Survey and Outlook: Optimization of Neural Networks
Theoretical analysis of algorithms for learning neural networks is highly challenging. We still lack handy mathematical tools. We will articulate a few technical challenges and summarize the attempts and progresses.
We follow the standard setup in supervised learning. Letbe a neural network parameterized by parameters .999E.g., a two layer neural network would be where and are some activation functions. Let be the loss function, and be a set of i.i.d examples drawn from distribution . The empirical risk is and the population risk is
The major challenge of analyzing the landscape property of or stems from the non-linearity of neural networks— is neither linear in , nor in . As a consequence, and are not convex in
. Linear algebra is at odds with neural networks—neural networks do not have good invariance property with respect to rotations of parameters or data points.
Linearized neural networks. Early works for optimization in deep learning simplify the problem by considering linearized neural networks: is assumed to be a neural networks without any activations functions. E.g., with would be a three-layer feedforward linearized neural network. Now, the model is still not linear in , but it is linear in . This simplification maintains the property that or are still nonconvex functions in , but allows the use of linear algebraic tools to analyze the optimization landscapes of or .
is a linearized feed-forward neural network (butdoes have degenerate saddle points so that it does not satisfy the strict saddle property). Hardt et al. (2016); Hardt and Ma (2017)
analyze the landscape of learning linearized residual and recurrent neural networks and show that all the stationary points (in a region) are global minima. We refer the readers toArora et al. (2018) and references therein for some recent works along this line.
There are various results on another simplification: two-layer neural networks with quadratic activations. In this case, the model is linear in and quadratic in the parameters, and linear algebraic techniques allow us to obtain relatively strong theory. See Li et al. (2017); Soltanolkotabi et al. (2018); Du and Lee (2018) and references therein.
We remark that the line of results above typically applies to the landscape of the population losses as well as the empirical losses when there are sufficient number of examples.101010Note that the former implies the latter when there are sufficient number of data points compared to the number of parameters, because in this case, the empirical loss has a similar landscape to that of the population loss due to concentration properties (Mei et al., 2017).
Changing the landscape, by, e.g., over-parameterization or residual connection.
Changing the landscape, by, e.g., over-parameterization or residual connection.Somewhat in contrast to the clean case covered in earlier sections of this chapter, people have empirically found that the landscape properties of neural networks depend on various factors including the loss function, the model parameterization, and the data distribution. In particular, changing the model parameterization and the loss functions properly could ease the optimization.
An effective approach to changing the landscape is to over-parameterize the neural networks — using a large number of parameters by enlarging the width, often not necessary for expressivity and often bigger than the total number of training samples. It has been empirically found that wider neural networks may alleviate the problem of bad local minima that may occur in training narrower nets (Livni et al., 2014). This motivates a lot of studies of on the optimization landscape of over-parameterized neural networks. Please see Safran and Shamir (2016); Venturi et al. (2018); Soudry and Carmon (2016); Haeffele and Vidal (2015) and the references therein.
We note that there is an important distinction between two type of overparameterizations: (a) more parameters than what’s needed for sufficient expressivity but still fewer parameters than the number of training examples, and (b) more parameters than the number of training examples. Under the latter setting, analyzing the landscape of empirical loss no longer suffices because even if the optimization works, the generalization gap might be too large or in other words the model overfits (which is an issue that is manifested clearly in the NTK discussion below.) In the former setting, though the generalization is less of a concern, analyzing the landscape is more difficult because it has to involve the complexity of the ground-truth function.
Two extremely empirically successful approaches in deep learning, residual neural networks (He et al., 2015)2015) are both conjectured to be able to change the landscape of the training objectives and lead to easier optimization. This is an interesting and promising direction with the potential of circumventing certain mathematical difficulties, but existing works often suffers from the strong assumptions such as linearized assumption in Hardt and Ma (2017) and the Gaussian data distribution assumption in Ge et al. (2018).
Connection between over-parametrized model and Kernel method: the Neural Tangent Kernel (NTK) view.
Connection between over-parametrized model and Kernel method: the Neural Tangent Kernel (NTK) view.Another recent line of work studies the optimization dynamics of learning over-parameterized neural networks at a special type of initialization with a particular learning rate scheme (Li and Liang, 2018; Du et al., 2018; Jacot et al., 2018; Allen-Zhu et al., 2018), instead of characterizing the full landscape of the objective function. The main conclusion is of the following form:when using overparameterization (with more parameters than training examples), under a special type of initialization, optimizing with gradient descent can converge to a zero training error solution.
The results can also be viewed/interpreted as a combination of landscape results and a convergence result: (i) the landscape in a small neighborhood around the initialization is sufficiently close to be convex, (ii) in the neighborhood a zero-error global minimum exists, and (iii) gradient descent from the initialization will not leave the neighborhood and will converge to the zero-error solution. Consider a non-linear model and an initialization . We can approximate the model by a linear model by Taylor expansion at