1 Learning through Optimization
A key desirable feature of automated learning algorithms is the ability to learn models directly from data with minimal need for direct intervention by the designer. This is generally achieved by parameterizing a family of models of sufficient explanatory power through a set of parameters and subsequently searching for the choice of that fits the data “well”, in the sense that:
In this formulation, the loss function denotes a measure of fit of the model for the random data . Hence, the desired model is defined as the one that results in the smallest expected risk, where the expectation is taken with respect to the distribution of the data . As we illustrate in a number of examples in the sequel, a vast majority of inference problems fit into the general framework (1
can be decomposed into a feature vectorand a label . When the target variable
is continuous, this is typically an estimation problem with the objective being to construct anestimator such that the error
is small in some sense with high probability. One popular choice for the loss functionin this case is the squared error loss:
On the other hand, when is scalar and discrete such as the binary case
, the problem becomes a (binary) classification problem, with the objective being to find a classifiersuch that with high probability . An example of a popular choice for the loss function in this case is the logistic loss:
∎ We note that while the choice of the loss function is generally informed by the distribution of the target variable , such as whether it is continuous or discrete, we still need to specify the dependence of on . Since in both examples (2) and (3), the loss depends on through , we can describe this dependence by parameterizing through .
[Modeling for supervised learning] The most immediate parametrization of corresponds to the set of linear mappings:
) leads to the logistic regression solution, both of which are convex optimization problems with efficient solution methodsSayed14. While convexity of the resulting problem (1) is an appealing property to have, the evident drawback of the linear parametrization (4) is its limited expressive power. Only mappings that correspond to linear combinations of the elements of the feature vector are captured by (4), while non-linear interactions are beyond the scope of this model. For this reason, recent years have seen an increased interest in the utilization of neural networks, which are nested models of the form lecun15:
where the denote matrices of appropriate dimensions and
denotes an element-wise activation function (usually nonlinear in form). We can collect inall parameters , i.e., and again recover an instance of (1) for both the quadratic (2) and logistic (3) losses. Models of the form (5), particularly for a suitable size and dimensions of hidden layers, are able to model well non-linear classification functions . However, note that any choice will generally result in a nonconvex loss surface (1). This necessitates the development of performance guarantees of algorithms for algorithms solving (1) under nonconvex environments. ∎
[Unsupervised learning] Not all learning problems present themselves as supervised problems where the objective is to learn a mapping from feature to label. One such example is in the design of recommender systems where users are implicitly clustered and receive recommendations based on the preferences of “similar” other users. A popular approach on this setting revolves around matrix factorizationKoren09. One such implementation results in:
where , and denotes the regularization weights. The matrices are generally chosen to be tall, so that has low rank, and (6) pursues a low-rank approximation of . ∎
2 Centralized Stochastic Optimization
we conclude that a large number of learning problems, including linear as well as non-linear regression and classification problems, and unsupervised formulations, can be recovered by specializing the general stochastic optimization problem (1). The task of designing an effective learning method then boils down to two related decisions: (a) the choice of the learning architecture, which determines the form of the loss
, and (b) the choice of the optimization strategy, which given realizations of the random variableyields a high-quality estimate for . For the remainder of this article we will consider the architecture, and hence , fixed and will focus on the latter challenge, namely providing performance guarantees for the quality of the estimate of produced by the optimization algorithm for general nonconvex problems. We let:
2.1 Notions of Optimality
Loosely speaking, the objective of any (stochastic) optimization algorithm is to produce “high-quality” estimates for the minimizer in (1). When the risk is strongly-convex there is little ambiguity in the quantification of the quality of an estimate, since for strongly-convex costs with constant we have (Vandenberghe04, Sec. 9.1.2):
If the risk additionally has -Lipschitz gradients, we similarly have (Vandenberghe04, Sec. 9.1.2):
By inspecting these two inequalities we conclude that all three measures of optimality, namely the squared deviation from the minimizer , the excess risk , and the squared gradient norm are essentially equivalent up to constants that depend on the strong-convexity and Lipschitz parameters and , respectively. This means that, as long as the problem is reasonably well-conditioned, meaning that the fraction does not grow too large, the choice of the performance measure is not particularly relevant, since high performance in one measure necessarily implies high performance in both other measures. In other words, any point with a small gradient norm , for strongly-convex problems, will essentially be globally optimal in the sense that both the excess risk and distance to the minimizer will be small.
In the nonconvex setting considered here, and hence in the absence of (8), this is no longer the case as we illustrate in the sequel. [-first-order stationarity] A point is -first-order stationary if:
These points are technically only approximately first-order stationary, since exact first-order stationarity would require . Since we generally refer to -first-order stationarity throughout this manuscript, we will drop “approximate” for convenience whenever it is clear from context. ∎ In light of relation (9), for costs with -Lipschitz gradients, -first-order stationarity is a necessary condition to ensure and . However, unless the cost is assumed to additionally be strongly convex, Definition 2.1 is not sufficient to guarantee that the point has small excess risk or small distance to the minimizer , since establishing sufficiency requires (8) which only holds for strongly-convex costs. In fact, the set of -first-order stationary points for nonconvex risk functions includes the set of local minima, maxima as well as saddle-points. Nevertheless, many studies of local descent algorithms in nonconvex environments establish performance guarantees by showing that the limiting points of the algorithm are approximately first-order stationary using variations of Definition 2.1 in both the single-agent and multi-agent settings Nesterov98; Bertsekas00; Reddi16; DiLorenzo16; Tatarenko17; Lian17; Tang18; Wang18. These results are reassuring, as first-order stationarity is a necessary condition for local optimality, and hence any algorithm that does not produce a first-order stationary point will necessarily not produce a point with small excess risk, or small distance to the minimizer. Nevertheless, these results cannot ensure that the limiting first-order stationary point does not correspond to a saddle-point, which have been identified as a bottleneck in many nonconvex problems of interest Choromanska14. This observation, following the works Nesterov06; Ge15; Lee16 motivates us to consider a stronger notion of optimality.
To formulate it, note that our objective is to converge towards points that are local minima and hence satisfy:
for all small . In other words, we would like to avoid approaching points where there exists such that:
By introducing the second-order Taylor expansion around , we can write:
where we dropped the linear term since, at first-order stationary points, . Hence, we shall say that is second-order locally optimal according to its second-order Taylor expansion if, and only if,
This requirement is equivalent to . We emphasize that is second-order locally optimal, since expression (2.1) is only an approximation of based on derivatives up to second-order. Therefore, approaching points where is desirable. Another way to see this is to note that we also have from (2.1):
where equality holds whenever
is the eigenvector ofcorresponding to , i.e., . It follows that whenever is negative, the larger its magnitude is, the less locally optimal is. In other words, points with significantly negative are highly undesirable limiting points of a local descent algorithm. Motivated by this discussion, we define the set of -second-order stationary points. [-second-order stationarity] A point is -second-order stationary if it is -first-order stationary following Definition 2.1 and additionally, for some ,
denotes the smallest eigenvalue of the Hessian matrix. ∎ We will be focusing on the case when is small. Intuitively, points that satisfy condition (16) are either local minima (e.g., when all eigenvalues of the Hessian matrix are positive) or they are weak saddle-points that are close to local minima (when the smallest eigenvalue is negative but only by a small amount). Returning to (2.1), we find that every -second-order stationary point satisfies:
Note that, as , the definition of -second-order stationarity corresponds to the definition of local optimality (11). The freedom to set any , rather than requiring , allows us to set an expectation of local optimality in the sense of (17). This quantity does not appear as a parameter of any of the algorithms presented in this work, but does appear in the expressions on the convergence time (Theorems 2.4 and 5) as meaning that a higher expectation of local optimality requires longer running time of the algorithms, which conforms with intuition. We conclude that, while for non-zero not all -second-order stationary points are locally optimal, any -second-order stationary is almost locally optimal for small in the sense of (17).
The set of second-order stationary points in Definition 2.1 is a subset of the set of first-order stationary points in Definition 2.1. Every second-order stationary point is also first-order stationary, but the additional restriction (16) allows for the exclusion of certain, undesirable, stationary points that do not satisfy (17), such as local maxima and saddle-points. Specifically, by choosing small enough, we are able to exclude any first-order stationary point where the smallest eigenvalue of the Hessian is negative and bounded away from zero. These points, which correspond to the complement of Definition 2.1, are frequently referred to as strict saddle-points in the literature due to the requirement for the smallest eigenvalue to be strictly negative. [-strict saddle-points] A point is a -strict saddle-point if it is -first-order stationary following Definition 2.1 and additionally:
Note that the only difference to Definition 2.1 is the reversal of inequality (16) to (18). As such, the set of -strict saddle-points is precisely the complement of the set of -second-order stationary points in the set of first-order stationary points. ∎ Note that, depending on the choice of the parameter , not all saddle-points of the cost need to be -strict saddle-points. If happens to have a saddle-point with , then this particular saddle-point would not be -strict , and in fact would fall under Definition 2.1 of a -second-order stationary point. Nevertheless, so long as is small, such saddle-points can intuitively be viewed as “weak” saddle-points, in the sense that they are almost locally optimal according to (17).
Under this formal definition, the set of strict saddle-points includes local maxima as well. In fact, if all eigenvalues of at a first-order stationary point were bounded from above by , then would be a local maximum. The set of strict saddle-points, however, is larger than the set of local maxima, since only one eigenvalue of the Hessian at strict saddle-points is required to be bounded from above by , while other eigenvalues are unrestricted. Hence, the incorporation of second-order information in the definition of stationarity allows us to distinguish between -second-order stationary points and
-strict saddle-points and allows for the exclusion of points with significant local descent direction from the set of potentially optimal points. Furthermore, for many loss functions commonly found in machine learning, such as tensor decompositionGe15, matrix completion Ge16, low-rank recovery Ge17
and some deep learning formulationsKawaguchi16, all saddle-points and local maxima have been shown to have a significant negative eigenvalue in the Hessian, and can hence be excluded from the set of second-order stationary points for sufficiently small, but finite, . For such risk functions, all -second-order stationary points for some small, but finite, correspond to local, or even global, minima.
This observation has motivated a number of works to pursue higher-order stationarity guarantees of local descent algorithms by means of second-order information Nesterov06; Curtis17; Tripuraneni18, intermediate searches for the negative curvature direction Fang18; Allen18neon; Allen18natasha, perturbations in the initialization Lee16; Du17; Scutari18 or to the update direction Gelfand91; Ge15; Jin17; Fang19; Jin19; HadiDaneshmand18; Swenson19; Vlaski19single; Vlaski19nonconvexP1; Vlaski19nonconvexP2, both in the centralized and decentralized setting. Our focus in this manuscript will be on strategies that exploit the presence of perturbations in the update direction to escape from saddle-points. The motivation for this is two-fold. First, in large-scale and online learning problems, the evaluation of exact descent directions is generally infeasible, making the utilization of stochastic gradients, and hence the introduction of stochastic perturbations a necessity. Second, as we shall see, perturbations to the gradient direction can be shown to be sufficient to guarantee efficient escape from saddle-points, meaning that the escape-time can be bounded by quantities that scale favorably with problem dimensions and parameters, resulting in simple, yet effective solutions for escaping saddle-points and guaranteeing second-order stationarity without the need to significantly alter the operation of the algorithm.
2.2 Stochastic Gradient Descent
One popular first-order approach to pursuing a minimizer for problem (1) can be obtained means of gradient descent, resulting in the recursion:
The limitation of this recursion lies in the fact that evaluation of the exact gradient of requires statistical information about the random variable in light of:
The most common remedy for this challenge is to instead employ a stochastic approximations of the gradient based on realizations of the random variable available at time . We denote a general stochastic gradient approximation by and iterate:
Observe that we now denote in bold font to emphasize the fact that, by utilizing a stochastic approximation based on realizations of the random variable in place of the true gradient based on the distribution of , the resulting iterates will become stochastic themselves. We will leave the actual specification of the approximation for the examples and describe performance guarantees under general approximations satisfying fairly general modeling conditions.
2.3 Modeling Conditions
We begin by introducing smoothness conditions on both the gradient and Hessian of the risk . [Lipschitz gradients] The gradient is Lipschitz, namely, there exists such that for any :
∎[Lipschitz Hessians] The risk is twice-differentiable and there exists such that:
∎ Condition (22) appears commonly in the study of first-order optimality guarantees of (stochastic) gradient algorithms Nesterov98; Bertsekas00; Sayed14. The Lipschitz condition on the Hessian matrix is not necessary to establish performance bounds in the (strongly-)convex case or first-order stationarity, but can be used to more accurately quantify deviations around the minimizer in steady-state Sayed14, or to establish the escape from saddle-points Ge15; Jin17; HadiDaneshmand18; Vlaski19nonconvexP1; Vlaski19nonconvexP2. The second set of conditions below establishes bounds on the quality of the stochastic gradient approximation . We define the stochastic gradient noise process:
[Gradient noise process] The gradient noise process (24
) is unbiased with a relative bound on its fourth-moment:
for some non-negative constants . ∎ Relation (25) requires that the gradient approximation be unbiased. Condition (26) imposes a bound on the fourth moment of the gradient noise, but allows for this bound to grow with the norm of the gradient . Note that, in light of Jensen’s inequality and sub-additivity of the square root, condition (26) implies and is slightly stronger than:
Condition (27) is sufficient to establish limiting first-order stationarity Bertsekas00, while the fourth-moment condition (26) will allow us to more carefully analyze the dynamics of (21) around first-order stationary points and establish escape from saddle-points, resulting in second-order guarantees. We also impose conditions on the covariance of the gradient noise. [Lipschitz covariances] The gradient noise process has a Lipschitz covariance matrix, i.e.,
for all , some and . ∎ Note from the definition of the gradient noise covariance (28), that the distribution of the gradient noise process is a function of the iterate . This, of course, is natural since the gradient noise is defined in (24) as the difference between the true and the approximate gradient at the current iterate. The fact that the perturbations introduced into the stochastic recursion (21) are not necessarily identically distributed over time introduces challenges in the study of their cumulative effect. Thankfully, the gradient noise processes induced by most constructions for and losses of interest have a covariance with a Lipschitz-type property (29). This condition ensures that the covariance is sufficiently smooth over localized regions in space, resulting in essentially identically distributed gradient noise perturbations in the short-term and a tractable analysis. It has also been exploited to derive accurate steady-state performance expressions in the strongly-convex setting Sayed14.
In contrast to Assumption 24, which bounds the perturbations induced by employing stochastic gradient approximations from above, we will also be imposing a lower bound on the stochastic gradient noise. [Gradient noise in strict saddle-points] Suppose is an approximate strict-saddle point following Definition 2.1. Introduce the eigendecomposition of the Hessian matrix as and partition:
where and . Then, we assume that:
for some . ∎
If we construct a local Taylor approximation around the strict saddle-points , we have:
since at strict saddle-points and, hence, the linear term vanishes. For every in the range of , i.e., , we then have by definition of , and hence . We conclude that the space spanned by corresponds to the local descent directions around the strict saddle-point . Hence, condition (31) imposes a lower bound on the gradient noise component in the local descent direction (spanned by ) in the vicinity of saddle-points. It is a notable deviation from the assumptions typically imposed in the convex setting. While assumptions 2.3–27 are for example all leveraged in deriving steady-state performance expressions in Sayed14 under an additional strong-convexity condition, assumption 2.3 is unique to the study of the behavior of stochastic gradient-type algorithms in the vicinity of saddle-points HadiDaneshmand18; Vlaski19nonconvexP1; Vlaski19nonconvexP2 in nonconvex optimization. It may be particularly surprising since the presence of perturbations in the dynamics of gradient-type algorithms are generally understood to be negative side-effects of the utilization of stochastic gradient approximations and result in deterioration of performance, which is generally true for (strongly) convex objectives. When generalizing to nonconvex objectives, as recent analysis has shown Ge15; Jin17; HadiDaneshmand18; Vlaski19nonconvexP1; Vlaski19nonconvexP2, the persistent presence of gradient perturbations allows the algorithm to efficiently escape from saddle-points, which are unstable to gradient perturbations, and arrive at local minima, which tend to be more stable to the same types of perturbations. In this sense, condition (31) allows the algorithm to distinguish stable local minima from unstable saddle-points, both of which are first-order stationary points.
As we will see in the examples in the sequel, and the following Section 3, the formulation (21) under the modeling conditions 2.3–2.3 is sufficiently general to capture a plethora of first-order stochastic algorithms for the minimization of (7). [Stochastic gradient descent] Suppose we have access to a realization of the data at time . We can construct a stochastic gradient approximation as:
Then, condition (25) follows immediately by definition of (33), while (26) can be verified for a number of choices of the loss function and data distributions of . We shall denote the resulting constants:
∎ [Mini-batch stochastic gradient descent] Suppose we instead have access to a collection of independent samples at time and the computational capacity to compute gradient operations at every iteration. We can then construct the mini-batch gradient approximation:
It again follows that satisfies (25). For the fourth-order moment can verify by induction over that:
in terms of the constants and of the single-element stochastic gradient algorithm in example 2, as well as the constant:
We observe a -fold decrease in the mean-fourth moment, which implies a -fold reduction in the second-order moment and complies with our intuition about variance reduction by averaging. For the gradient noise covariance we have:
∎ [Perturbed stochastic gradient descent] In the absence of prior knowledge that there is a gradient noise component in the descent direction for every strict saddle-point (Assumption 2.3), one can always guarantee condition (31) to hold by adding a small perturbation term with positive-definite covariance matrix as done in Ge15; Jin19 to construct:
For the gradient noise covariance we then have:
and hence Assumption 2.3 is guaranteed to hold. More elaborate constructions, such as only adding an additional perturbation when the iterate is suspected to be near a first-order stationary point (as done in Jin17) are also possible. ∎
2.4 Second-Order Guarantee
Due to space limitations, we will only outline the main results that lead to a second-order guarantee for the stochastic approximation algorithm (21) and refer the reader to Vlaski19single for a thorough derivation of the result. We begin by formalizing the space decomposition into first and second-order stationary points as well as strict saddle-points. [Sets] To simplify the notation in the sequel, we introduce the following sets:
where is a small positive parameter, and are constants:
and is a parameter to be chosen. Note that . For brevity, we also define the probabilities , and . Then, for all , we have . ∎ The set formalizes the set of -first-order stationary points in Definition 2.1 by setting the constant multiplying the step-size to where are problem-dependent constants. The set then corresponds to the set of second-order stationary points in Definition 2.1 while denotes the set of strict saddle-points in Definition 2.1. For a visualization, we refer the reader back to Fig. 1.
Points in both and are “undesirable” limiting points in the sense that they have local directions of descent. Our objective is to show that for iterates within both sets, algorithm (21) will continue to descend along the risk (7) by taking local gradient steps. The two sets and are distinguished by the fact that for points in , the gradient norm is large enough for a single (stochastic) gradient step to be sufficient to guarantee descent in expectation. Points in (i.e., strict saddle-points) on the other hand are more challenging since the gradient norm is so small that a single gradient step is no longer sufficient to guarantee descent.
[Descent in the large-gradient regime Vlaski19single] For sufficiently small step-sizes:
and when the gradient at is sufficiently large, i.e., , the stochastic gradient recursion (21) yields descent in expectation in one iteration, namely,
We also establish the following technical result, which bounds the negative effect of the gradient noise close to local minima :
∎ In the vicinity of strict saddle-points , a more detailed analysis is necessary. Here, it is not the gradient step that ensures descent, but rather the cumulative effect of the gradient noise perturbations to the gradient update. The definition of a strict saddle-point (43) ensures that there is a direction of negative curvature in the local risk surface, while Assumption 2.3 guarantees that with some probability the iterate is perturbed towards the descent direction. Together, these conditions allow the algorithm to escape along the descent direction with high probability in a finite number of iterations. This intuition is formalized by constructing a local short-term model based on a local quadratic approximation of the risk surface with identically distributed gradient perturbations and exploiting the smoothness conditions 22 and 27 to bound the approximation error (Vlaski19single, Lemma 3). [Descent through strict saddle-points Vlaski19single] Beginning at a strict saddle-point and iterating for iterations after with
∎ Theorem 2.4 ensures that, even when the norm of the gradient is too small to carry sufficient information about the descent direction, the gradient noise along with the negative local curvature of the risk surface around strict saddle-points is sufficient to guarantee descent in iterations, where the escape-time scales favorably with problem parameters. For example, the escape time scales logarithmically with the problem dimension , implying that we can expect fast evasion of saddle-points even in high dimensions. Having established descent both in the large-gradient regime and strict-saddle point regime, we can combine the results to conclude eventual second-order stationarity. [Second-order guarantee for stochastic gradient descent Vlaski19single] Suppose . Then, for sufficiently small step-sizes , we have with probability , that , i.e., and in at most iterations, where
the quantity denotes the sub-optimality at the initialization and denotes the escape time from Theorem 2.4.
3 Federated Learning
In many large-scale applications, data is not available at a central processor, but is instead collected and processed at distributed locations. In this section, we consider a multi-agent setting with a collection of agents and a central node for parameter aggregation. We associate with each agent its own risk (1), indexed by :
and would like to pursue the minimizer of:
where denote positive weights, normalized to add up to one without loss of generality, i.e., . The problem of distributed minimization of (54) in the presence of a centralized processor, but without the aggregation of raw data can be achieved using primal and primal-dual approaches Zinkevich10; Duchi09. More recently, federated learning has emerged as a framework for the solution of (54) under considerations of asynchrony, heterogeneity, communication and computational restrictions and privacy concerns as they are encountered in practical applications mcmahan16. We show in the sequel that a version of the federated averaging algorithm mcmahan16 can be interpreted as the construction of a particular choice for the stochastic gradient approximation , and hence second-order guarantees can be obtained directly by specializing the results from Section 2. As a baseline, consider the true gradient update to (54), which takes the form:
Just like its single-agent counter-part (19), recursion (55) has the drawback of requiring statistical information about to evaluate the expectations in (53). Additionally, (55) requires full and synchronous participation of all agents at every iteration by providing (or approximating) the local gradient . The former issue can be addressed by employing stochastic gradient approximations based on realizations of data from the distribution of , while the latter issue can be relaxed by allowing for partial participation of agents. To this end, at every iteration , we sample agent-indices without replacement from the set to form . We introduce the participation indicator function:
Then, at every iteration, the global model is broadcast to participating agents, which collect local data and perform the update:
The central processor can then aggregate the intermediate estimates from the participating agents and compute:
We argue in the sequel that the approximation:
can be viewed as an instance of the stochastic approximation introduced in Section 2 and hence the results from the single-agent analysis apply. In addition to assuming each satisfies Assumptions 2.3–2.3, we will impose the following bound on the agent heterogeneity Swenson19; Vlaski19nonconvexP1. [Bounded gradient disagreement] For each pair of agents and , the gradient disagreement is bounded, namely, for any :
∎ Relation (61) ensures that the disagreement on the local descent direction for any pair of agents is bounded, and is weaker than the more common assumption of uniformly bounded absolute gradients. From Jensen’s inequality, we similarly bound the deviation from the aggregate gradient:
[Federated averaging as a centralized stochastic gradient approximation] We define the local gradient approximation:
We then have:
where we used the fact that . For the aggregate risk we then find:
For the fourth-order moment we have:
where follows from the convexity of and Jensen’s inequality. For the local gradient noise terms we have: