1 Introduction and Related Work
We consider a collection of agents, where each agent is equipped with a local stochastic cost function:
(1) 
where
denotes a parameter vector and
denotes the random data at agent . We construct the global cost function:(2) 
where the denote convex combination weights that add up to one, i.e. . When data realizations for can be aggregated at a central location, descent along the negative gradient of (2) can be approximated by means of a centralized stochastic gradient algorithm of the form [22, 26]:
(3) 
where denotes a stochastic gradient approximation constructed at time . One possible construction is to let:
(4) 
which is obtained by employing a weighted combination of instantaneous approximations using all realizations available at time . This construction requires the evaluation of (stochastic) gradients per iteration. If computational constraints limit the number of gradient evaluations per iteration to one, we can instead randomly sample an agent location from the available data and let:
(5) 
The evident drawback of such a simplified centralized strategy is that only one sample is processed and a large number of samples is discarded at every iteration. We can hence expect the construction (4) to result in better performance relative to the simplified choice (5). When communication constraints limit the exchange of information among agents, we can instead appeal to decentralized strategies. For the purpose of this work, we shall focus on the standard diffusion strategy, which takes the form:
(6a)  
(6b) 
where denote convex combination coefficients satisfying:
(7) 
The symbol denotes the set of neighbors of agent . When the graph is stronglyconnected, it follows from the PerronFrobenius theorem that the combination matrix
has a spectral radius of one and a single eigenvalue at one with corresponding eigenvector
[26]:(8) 
Comparing the diffusion strategy (6a)–(6b) to the centralized constructions (4) or (5), we observe that the adaptation step (6a) carries the same complexity per agent as the simplified construction (5). However, since these computations are performed at agents in parallel, and the information is diffused over the network through the combination step (6b), we expect the diffusion strategy (6a)–(6b) to outperform the simplified centralized strategy (5) and more closely match the full construction (4). In fact, the spectral properties (8) of the combination weights (7) allow us to establish the following relation for the weighted network mean [6, 26]:
(9) 
which almost corresponds to the centralized recursion (3)–(4) with the full gradient approximation (4) with the only difference being that the stochastic gradients are evaluated at the individual iterates instead of the weighted network centroid . So long as the iterates cluster around the network centroid, and under appropriate smoothness conditions on the (stochastic) gradients, it is hence to be expected that the network centroid (9) will match the performance of the full gradient approximation (4). This intuition has been studied in great detail and formalized for strongly convex cost functions, establishing that all iterates in (6a)–(6b) will actually match the centralized full gradient approximation (4) both in terms of convergence rate [6] and steadystate error [7], which implies a linear improvement over the simplified construction (5) in terms of the number of agents [26] when employing a symmetric combination policy for which .
More recently, these results have been extended to the pursuit of firstorder stationary points in nonconvex environments [19, 29] for consensus and the exact diffusion algorithm [34]. Firstorder stationary points can include saddlepoints and even local maxima and can generate a bottleneck for many optimization algorithms and problem formulations [8]. Hence, the purpose of this work is is to establish that linear speedup can also be expected in the escape from saddlepoints and pursuit of secondorder stationary points for nonconvex optimization problems. To this end, we refine and exploit recent results in [23, 24].
1.1 Related Works
Strategies for decentralized optimization include incremental strategies [4], and decentralized gradient descent (or consensus) [28], as well as the diffusion algorithm [26, 6, 32]. A second class of strategies is based on primaldual arguments [16, 9, 15, 31, 27, 34]. While most of these algorithms are applicable to nonconvex optimization problems, most performance guarantees in nonconvex environments are limited to establishing convergence to firstorder stationary points, i.e., points where the gradient is equal to zero [20, 30, 19, 29, 33].
Landscape analysis of commonly employed loss surfaces has uncovered that in many important settings such as tensor decomposition
[11], matrix completion [13], lowrank recovery [12], as well as certain deep learning architectures
[17], all local minima correspond to global minima and all other firstorder stationary points have a strictsaddle property, which states that the Hessian matrix has at least one negative eigenvalue. These results have two implications. First, while firstorder stationarity is a useful result in the sense that it ensures stability of the algorithm, even in nonconvex environments, it is not sufficient to guarantee satisfactory performance, since firstorder stationary points include strict saddlepoints, which need not be globally or even locally optimal. On the other hand, establishing the escape from strict saddlepoints, is sufficient to establish convergence to global optimality in all of these problems.These observations have sparked a number of works examining secondorder guarantees of local descent algorithms. Strategies for the escape from saddlepoints can generally be divided into one of two classes. First, since the Hessian at every strictsaddle point, by definition, contains at least one negative eigenvalue, the descent direction can be identified by directly employing the Hessian matrix [21] or through an intermediate search for the negative curvature direction [10, 2]. The second class of strategies leverages the fact that perturbations in the initialization [18] or the update direction [11, 5, 14, 25] cause iterates of firstorder algorithms to not get “stuck” in strict saddlepoints, which can be shown to be unstable. Recently these results have been extended to decentralized optimization with deterministic gradients and random initialization [1] as well as stochastic gradients with diminishing stepsize and decaying additive noise [3] as well as constant stepsizes [23, 24]. We establish in this work, that the saddlepoint escape time of the diffusion strategy (6a)–(6b
) decays linearly with the number of agents in the network when symmetric combination policies are employed and show how asymmetric combination policies can result in further improvement when agents have access to estimates of varying quality.
2 Modeling Conditions
We shall be employing the following common modeling conditions [26, 11, 14, 3]. See [23, 24] for a discussion. [Smoothness] For each , the gradient is Lipschitz, namely, for any :
(10) 
Furthermore, is twicedifferentiable with Lipschitz Hessian:
(11) 
For each pair of agents and , the gradient disagreement is bounded, namely, for any :
(12) 
[Gradient noise process] For each , the gradient noise process is defined as
(13) 
and satisfies
(14a)  
(14b) 
where we denote by the filtration generated by the random processes for all and and for some nonnegative constants . We also assume that the gradient noise processes are pairwise uncorrelated over the space conditioned on . [Lipschitz covariances] The gradient noise process has a Lipschitz covariance matrix, i.e.,
(15) 
satisfies
(16) 
for some and . We shall also make the simplifying assumption. [Gradient noise lower bound] The gradient noise covariance at every agent is bounded from below:
(17) 
This condition can be loosened significantly by requiring a gradient noise component to be present only in the vicinity of strict saddlepoints and only in the local descent direction, see e.g. [14, 24]. Nevertheless, the simplified condition can always be ensured for example by adding a small amount of isotropic noise, similar to [11, 5] and will be sufficient for the purpose this work.
3 Convergence Analysis
3.1 Noise Variance Relations
The performance guarantees established in [23, 24] depend on the statistical properties of the weighted gradient noise term:
(18) 
Under assumptions 2–16, we can refine the bounds from [23]: [Variance Bounds] Under assumptions 2–16 we have:
(19)  
(20) 
Relations (19) and (20) follow from the pairwise uncorrelatedness condition in assumption 12 after crossmultiplying. From (19) we observe that the average noise term (18) driving the network centroid experiences a variance reduction. Specifically, in the case when and we would obtain . This fold reduction in gradient noise variance is at the heart of the improved performance established for stronglyconvex costs [26] and in the pursuit of firstorder stationary points [19]. We shall establish in the sequel that this improvement also holds in the time required to escape from undesired saddlepoints.
3.2 Space Decomposition
[Sets] The parameter space is decomposed into:
(21)  
(22)  
(23) 
where is a small positive parameter, is a parameter to be chosen, and . Note that
. We also define the probabilities
, and . Then for all , we have . Points in the complement ofhave small gradient norm and hence correspond to approximately firstorder stationary points. These points are further classified into strictsaddle points
, where the Hessian has a significant negative eigenvalue, and secondorder stationary points . Pursuit of secondorder stationary points requires descent for points in as well as .3.3 Performance Guarantees
Due to space limitations, we forego a detailed discussion on the derivation of the secondorder guarantees of the diffusion algorithm (6a)–(6b) and refer the reader to [23, 24]. We instead briefly list the guarantees resulting from the variance bounds (19)–(20) and will focus on the dependence on the combination policy further below. Adjusting the theorems in [23, 24] to account for the variance bounds (19)–(20), we obtain: [Network disagreement (4th order)] Under assumptions 2 12, the network disagreement is bounded after sufficient iterations by:
(24) 
where denotes the mixing rate of the adjacency matrix, with and , and denotes a term that is higher in order than . The argument is an adjustment of [23, Theorem 1]. This result ensures that the entire network clusters around the network centroid after sufficient iterations, allowing us to leverage it as a proxy for all agents. [Descent relation] Beginning at in the large gradient regime , we can bound:
(25) 
as long as where the relevant constants are listed in definition 3.2. The argument is an adjustment of [23, Theorem 2]. [Descent through strict saddlepoints] Suppose , i.e., is approximately stationary with significant negative eigenvalue. Then, iterating for iterations after with
(26) 
guarantees
(27) 
The argument is an adjustment of [24, Theorem 1]. Theorem 3.3 ensures descent in one iteration as long as the gradient norm is sufficiently large, while 3.3 ensures descent even for firstorder stationary points, as long as the Hessian has a negative eigenvalue in a number of iterations that can be bounded. This ensures efficient escape from strict saddlepoints. We conclude: For sufficiently small stepsizes , we have with probability , that , i.e.,
(28) 
and in at most iterations, where
(29) 
The argument is an adjustment of [24, Theorem 2].
4 Comparative Analysis
4.1 StepSize Normalization
Note that in Theorem 3.3, both the limiting accuracy (28) and convergence rate (29) depend on the combination policy and network size through . To facilitate comparison, we shall normalize the stepsize in (6a):
(30) 
Under this setting, Theorem 3.3 ensures a point satisfying
(31) 
and in at most:
(32) 
iterations with
(33) 
Note that the normalization of the stepsize causes (31) to become independent of , allowing for the fair evaluation of (32) and (33) as a function of the number of agents.
4.2 Linear Speedup Using Symmetric Combination Weights
When the combination matrix is symmetric, i.e., , it follows that [26]. For simplicity, in this section, we shall also assume a uniform data profile for all agents, i.e., that and for all . We obtain: [Linear Speedup for Symmetric Policies] Under the stepsize normalization (30), and for symmetric combination policies with the uniform data profile and for all , the escape time simplifies to:
(34) 
The result follows immediately after cancellations.
4.3 Benefit of Employing Asymmetric Combination Weights
In this subsection, we show how employing asymmetric combination weights can be beneficial in terms of the time required to escape saddlepoints when the data profile across agents is no longer uniform. In particular, we will no longer require the upper and lower bounds and to be common for all agents, and no longer require the combination policy to be symmetric. Instead, to simplify the derivation, we assume that the gradient noise is approximately isotropic, i.e., so that (33) can be simplified to:
(35) 
Then, we can formulate the following optimization problem to minimize the escape time over the space of valid combination policies:
(36) 
This precise optimization problem has appeared before in the pursuit of asymmetric combination policies that minimize the steadystate error of the diffusion strategy (6a)–(6b) in stronglyconvex environments [26]. Its solution is available in closed form and can even be pursued in a decentralized manner, requiring only exchanges among neighbors [26]. [MetropolisHastings Combination Policy [26]] Under the stepsize normalization (30), the asymmetric MetropolisHastings combination policy minimizes the approximate saddlepoint escape time (35). It takes the form:
(37) 
where denotes the size of the neighborhood of agent .
5 Simulations
We construct a sample landscape to verify the linear speedup in the size of the network indicated by the analysis in this work. The loss function is constructed from a singlelayer neural network with a linear hidden layer and a logistic activation function for the output layer. Penalizing this architecture with the crossentropy loss gives:
(38) 
where and denote the weights of the individual layers, denotes the feature vector, and is the class variable. It can be verified that this loss has a single strict saddlepoint at and global minima in the positive and negative quadrant, respectively [24]. We show the evolution of the function value at the network centroid under the stepsize normalization rule (30) and observe a linear speedup in , consistent with (34) while noting no significant differences in steadystate performance, which is consistent with (31).
References
 [1] (2018Sep.) Secondorder guarantees of distributed gradient algorithms. available as arXiv:1809.08694. Cited by: §1.1.
 [2] (2018Dec.) NEON2: finding local minima via firstorder oracles. In Proc. of NIPS, pp. 3716–3726. Cited by: §1.1.
 [3] (201903) Annealing for distributed global optimization. available as arXiv:1903.07258. Cited by: §1.1, §2.
 [4] (199704) A new class of incremental gradient methods for least squares problems. SIAM J. Optim. 7 (4), pp. 913–926. External Links: ISSN 10526234, Link, Document Cited by: §1.1.
 [5] (2019Feb.) Stochastic gradient descent escapes saddle points efficiently. available as arXiv:1902.04811. Cited by: §1.1, §2.
 [6] (201506) On the learning behavior of adaptive networks  Part I: transient analysis. IEEE Transactions on Information Theory 61 (6), pp. 3487–3517. External Links: ISSN 00189448 Cited by: §1.1, §1.
 [7] (201506) On the learning behavior of adaptive networks – Part II: performance analysis. IEEE Transactions on Information Theory 61 (6), pp. 3518–3548. External Links: Document, ISSN 00189448 Cited by: §1.
 [8] (2017) Gradient descent can take exponential time to escape saddle points. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1067–1077. External Links: ISBN 9781510860964, Link Cited by: §1.
 [9] (201203) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic Control 57 (3), pp. 592–606. External Links: Document, ISSN 00189286 Cited by: §1.1.
 [10] (2018) SPIDER: nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Proc. of NIPS, pp. 689–699. Cited by: §1.1.
 [11] (2015) Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proc. of Conference on Learning Theory, Paris, France, pp. 797–842. Cited by: §1.1, §1.1, §2.

[12]
(2017)
No spurious local minima in nonconvex low rank problems: a unified geometric analysis.
In
Proceedings of the 34th International Conference on Machine Learning
, pp. 1233–1242. External Links: Link Cited by: §1.1.  [13] (2016) Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pp. 2973–2981. External Links: Link Cited by: §1.1.
 [14] (201803) Escaping saddles with stochastic gradients. available as arXiv:1803.05999. Cited by: §1.1, §2.
 [15] (2014) Communicationefficient distributed dual coordinate ascent. In Proc. International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3068–3076. Cited by: §1.1.
 [16] (201108) Cooperative convex optimization in networked systems: augmented lagrangian algorithms with directed gossip communication. IEEE Transactions on Signal Processing 59 (8), pp. 3889–3902. External Links: Document, ISSN 1053587X Cited by: §1.1.
 [17] (2016) Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594. External Links: Link Cited by: §1.1.
 [18] (2016) Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, New York, pp. 1246–1257. External Links: Link Cited by: §1.1.
 [19] (2017) Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30, pp. 5330–5340. External Links: Link Cited by: §1.1, §1, §3.1.
 [20] (201606) NEXT: innetwork nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks 2 (2), pp. 120–136. External Links: Link, 1602.00591 Cited by: §1.1.
 [21] (20060801) Cubic regularization of newton method and its global performance. Mathematical Programming 108 (1), pp. 177–205. External Links: ISSN 14364646, Document, Link Cited by: §1.1.
 [22] (1997) Introduction to Optimization. Optimization Software. Cited by: §1.
 [23] (201907) Distributed learning in nonconvex environments – Part I: Agreement at a Linear rate. submitted for publication, available as arXiv:1907.01848. Cited by: §1.1, §1, §2, §3.1, §3.3.
 [24] (201907) Distributed learning in nonconvex environments – Part II: Polynomial escape from saddlepoints. submitted for publication, available as arXiv:1907.01849. Cited by: §1.1, §1, §2, §3.1, §3.3, §5.
 [25] (201908) Secondorder guarantees of stochastic gradient descent in nonconvex optimization. submitted for publication, available as arXiv:1908.07023. Cited by: §1.1.
 [26] (201407) Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning 7 (45), pp. 311–801. External Links: Link, Document, ISSN 19358237 Cited by: §1.1, §1, §2, §3.1, §4.2, §4.3.
 [27] (2015) EXTRA: an exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization 25 (2), pp. 944–966. External Links: Document Cited by: §1.1.
 [28] (20101201) Distributed stochastic subgradient projection algorithms for convex optimization. Journal of Optimization Theory and Applications 147 (3), pp. 516–545. External Links: ISSN 15732878, Document Cited by: §1.1.
 [29] (2018) : Decentralized training over decentralized data. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 4848–4856. External Links: Link Cited by: §1.1, §1.
 [30] (2017Aug.) Nonconvex distributed optimization. IEEE Transactions on Automatic Control 62 (8), pp. 3744–3757. External Links: Document, ISSN 00189286 Cited by: §1.1.
 [31] (201206) Distributed dual averaging for convex optimization under communication delays. In Proc. American Control Conference (ACC), Vol. , Montreal, Canada, pp. 1067–1072. External Links: Document, ISSN 23785861 Cited by: §1.1.
 [32] (201909) Regularized diffusion adaptation via conjugate smoothing. available as arXiv:1909.09417. Cited by: §1.1.
 [33] (2019Jan.) Global convergence of ADMM in nonconvex nonsmooth optimization. Journal of Scientific Computing 78 (1), pp. 29–63. Cited by: §1.1.
 [34] (201902) Exact diffusion for distributed optimization and learning—Part I: algorithm development. IEEE Transactions on Signal Processing 67 (3), pp. 708–723. External Links: Document, ISSN 1053587X Cited by: §1.1, §1.
Comments
There are no comments yet.