Linear Speedup in Saddle-Point Escape for Decentralized Non-Convex Optimization

10/30/2019 ∙ by Stefan Vlaski, et al. ∙ 31

Under appropriate cooperation protocols and parameter choices, fully decentralized solutions for stochastic optimization have been shown to match the performance of centralized solutions and result in linear speedup (in the number of agents) relative to non-cooperative approaches in the strongly-convex setting. More recently, these results have been extended to the pursuit of first-order stationary points in non-convex environments. In this work, we examine in detail the dependence of second-order convergence guarantees on the spectral properties of the combination policy for non-convex multi agent optimization. We establish linear speedup in saddle-point escape time in the number of agents for symmetric combination policies and study the potential for further improvement by employing asymmetric combination weights. The results imply that a linear speedup can be expected in the pursuit of second-order stationary points, which exclude local maxima as well as strict saddle-points and correspond to local or even global minima in many important learning settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

We consider a collection of agents, where each agent is equipped with a local stochastic cost function:

(1)

where

denotes a parameter vector and

denotes the random data at agent . We construct the global cost function:

(2)

where the denote convex combination weights that add up to one, i.e. . When data realizations for can be aggregated at a central location, descent along the negative gradient of (2) can be approximated by means of a centralized stochastic gradient algorithm of the form [22, 26]:

(3)

where denotes a stochastic gradient approximation constructed at time . One possible construction is to let:

(4)

which is obtained by employing a weighted combination of instantaneous approximations using all realizations available at time . This construction requires the evaluation of (stochastic) gradients per iteration. If computational constraints limit the number of gradient evaluations per iteration to one, we can instead randomly sample an agent location from the available data and let:

(5)

The evident drawback of such a simplified centralized strategy is that only one sample is processed and a large number of samples is discarded at every iteration. We can hence expect the construction (4) to result in better performance relative to the simplified choice (5). When communication constraints limit the exchange of information among agents, we can instead appeal to decentralized strategies. For the purpose of this work, we shall focus on the standard diffusion strategy, which takes the form:

(6a)
(6b)

where denote convex combination coefficients satisfying:

(7)

The symbol denotes the set of neighbors of agent . When the graph is strongly-connected, it follows from the Perron-Frobenius theorem that the combination matrix

has a spectral radius of one and a single eigenvalue at one with corresponding eigenvector 

[26]:

(8)

Comparing the diffusion strategy (6a)–(6b) to the centralized constructions (4) or (5), we observe that the adaptation step (6a) carries the same complexity per agent as the simplified construction (5). However, since these computations are performed at agents in parallel, and the information is diffused over the network through the combination step (6b), we expect the diffusion strategy (6a)–(6b) to outperform the simplified centralized strategy (5) and more closely match the full construction (4). In fact, the spectral properties (8) of the combination weights (7) allow us to establish the following relation for the weighted network mean  [6, 26]:

(9)

which almost corresponds to the centralized recursion (3)–(4) with the full gradient approximation (4) with the only difference being that the stochastic gradients are evaluated at the individual iterates instead of the weighted network centroid . So long as the iterates cluster around the network centroid, and under appropriate smoothness conditions on the (stochastic) gradients, it is hence to be expected that the network centroid (9) will match the performance of the full gradient approximation (4). This intuition has been studied in great detail and formalized for strongly convex cost functions, establishing that all iterates in (6a)–(6b) will actually match the centralized full gradient approximation (4) both in terms of convergence rate [6] and steady-state error [7], which implies a linear improvement over the simplified construction (5) in terms of the number of agents [26] when employing a symmetric combination policy for which .

More recently, these results have been extended to the pursuit of first-order stationary points in non-convex environments [19, 29] for consensus and the exact diffusion algorithm [34]. First-order stationary points can include saddle-points and even local maxima and can generate a bottleneck for many optimization algorithms and problem formulations [8]. Hence, the purpose of this work is is to establish that linear speedup can also be expected in the escape from saddle-points and pursuit of second-order stationary points for non-convex optimization problems. To this end, we refine and exploit recent results in [23, 24].

1.1 Related Works

Strategies for decentralized optimization include incremental strategies [4], and decentralized gradient descent (or consensus) [28], as well as the diffusion algorithm [26, 6, 32]. A second class of strategies is based on primal-dual arguments [16, 9, 15, 31, 27, 34]. While most of these algorithms are applicable to non-convex optimization problems, most performance guarantees in non-convex environments are limited to establishing convergence to first-order stationary points, i.e., points where the gradient is equal to zero [20, 30, 19, 29, 33].

Landscape analysis of commonly employed loss surfaces has uncovered that in many important settings such as tensor decomposition 

[11], matrix completion [13], low-rank recovery [12]

, as well as certain deep learning architectures 

[17], all local minima correspond to global minima and all other first-order stationary points have a strict-saddle property, which states that the Hessian matrix has at least one negative eigenvalue. These results have two implications. First, while first-order stationarity is a useful result in the sense that it ensures stability of the algorithm, even in non-convex environments, it is not sufficient to guarantee satisfactory performance, since first-order stationary points include strict saddle-points, which need not be globally or even locally optimal. On the other hand, establishing the escape from strict saddle-points, is sufficient to establish convergence to global optimality in all of these problems.

These observations have sparked a number of works examining second-order guarantees of local descent algorithms. Strategies for the escape from saddle-points can generally be divided into one of two classes. First, since the Hessian at every strict-saddle point, by definition, contains at least one negative eigenvalue, the descent direction can be identified by directly employing the Hessian matrix [21] or through an intermediate search for the negative curvature direction [10, 2]. The second class of strategies leverages the fact that perturbations in the initialization [18] or the update direction [11, 5, 14, 25] cause iterates of first-order algorithms to not get “stuck” in strict saddle-points, which can be shown to be unstable. Recently these results have been extended to decentralized optimization with deterministic gradients and random initialization [1] as well as stochastic gradients with diminishing step-size and decaying additive noise [3] as well as constant step-sizes [23, 24]. We establish in this work, that the saddle-point escape time of the diffusion strategy (6a)–(6b

) decays linearly with the number of agents in the network when symmetric combination policies are employed and show how asymmetric combination policies can result in further improvement when agents have access to estimates of varying quality.

2 Modeling Conditions

We shall be employing the following common modeling conditions [26, 11, 14, 3]. See [23, 24] for a discussion. [Smoothness] For each , the gradient is Lipschitz, namely, for any :

(10)

Furthermore, is twice-differentiable with Lipschitz Hessian:

(11)

For each pair of agents and , the gradient disagreement is bounded, namely, for any :

(12)

[Gradient noise process] For each , the gradient noise process is defined as

(13)

and satisfies

(14a)
(14b)

where we denote by the filtration generated by the random processes for all and and for some non-negative constants . We also assume that the gradient noise processes are pairwise uncorrelated over the space conditioned on . [Lipschitz covariances] The gradient noise process has a Lipschitz covariance matrix, i.e.,

(15)

satisfies

(16)

for some and . We shall also make the simplifying assumption. [Gradient noise lower bound] The gradient noise covariance at every agent is bounded from below:

(17)

This condition can be loosened significantly by requiring a gradient noise component to be present only in the vicinity of strict saddle-points and only in the local descent direction, see e.g. [14, 24]. Nevertheless, the simplified condition can always be ensured for example by adding a small amount of isotropic noise, similar to [11, 5] and will be sufficient for the purpose this work.

3 Convergence Analysis

3.1 Noise Variance Relations

The performance guarantees established in [23, 24] depend on the statistical properties of the weighted gradient noise term:

(18)

Under assumptions 216, we can refine the bounds from [23]: [Variance Bounds] Under assumptions 216 we have:

(19)
(20)

Relations (19) and (20) follow from the pairwise uncorrelatedness condition in assumption 12 after cross-multiplying. From (19) we observe that the average noise term (18) driving the network centroid experiences a variance reduction. Specifically, in the case when and we would obtain . This -fold reduction in gradient noise variance is at the heart of the improved performance established for strongly-convex costs [26] and in the pursuit of first-order stationary points [19]. We shall establish in the sequel that this improvement also holds in the time required to escape from undesired saddle-points.

3.2 Space Decomposition

[Sets] The parameter space is decomposed into:

(21)
(22)
(23)

where is a small positive parameter, is a parameter to be chosen, and . Note that

. We also define the probabilities

, and . Then for all , we have . Points in the complement of

have small gradient norm and hence correspond to approximately first-order stationary points. These points are further classified into strict-saddle points

, where the Hessian has a significant negative eigenvalue, and second-order stationary points . Pursuit of second-order stationary points requires descent for points in as well as .

3.3 Performance Guarantees

Due to space limitations, we forego a detailed discussion on the derivation of the second-order guarantees of the diffusion algorithm (6a)–(6b) and refer the reader to [23, 24]. We instead briefly list the guarantees resulting from the variance bounds (19)–(20) and will focus on the dependence on the combination policy further below. Adjusting the theorems in [23, 24] to account for the variance bounds (19)–(20), we obtain: [Network disagreement (4th order)] Under assumptions 2 12, the network disagreement is bounded after sufficient iterations by:

(24)

where denotes the mixing rate of the adjacency matrix, with and , and denotes a term that is higher in order than . The argument is an adjustment of [23, Theorem 1]. This result ensures that the entire network clusters around the network centroid after sufficient iterations, allowing us to leverage it as a proxy for all agents. [Descent relation] Beginning at in the large gradient regime , we can bound:

(25)

as long as where the relevant constants are listed in definition 3.2. The argument is an adjustment of [23, Theorem 2]. [Descent through strict saddle-points] Suppose , i.e., is approximately stationary with significant negative eigenvalue. Then, iterating for iterations after with

(26)

guarantees

(27)

The argument is an adjustment of [24, Theorem 1]. Theorem 3.3 ensures descent in one iteration as long as the gradient norm is sufficiently large, while 3.3 ensures descent even for first-order stationary points, as long as the Hessian has a negative eigenvalue in a number of iterations that can be bounded. This ensures efficient escape from strict saddle-points. We conclude: For sufficiently small step-sizes , we have with probability , that , i.e.,

(28)

and in at most iterations, where

(29)

The argument is an adjustment of [24, Theorem 2].

4 Comparative Analysis

4.1 Step-Size Normalization

Note that in Theorem 3.3, both the limiting accuracy (28) and convergence rate (29) depend on the combination policy and network size through . To facilitate comparison, we shall normalize the step-size in (6a):

(30)

Under this setting, Theorem 3.3 ensures a point satisfying

(31)

and in at most:

(32)

iterations with

(33)

Note that the normalization of the step-size causes (31) to become independent of , allowing for the fair evaluation of (32) and (33) as a function of the number of agents.

4.2 Linear Speedup Using Symmetric Combination Weights

When the combination matrix is symmetric, i.e., , it follows that  [26]. For simplicity, in this section, we shall also assume a uniform data profile for all agents, i.e., that and for all . We obtain: [Linear Speedup for Symmetric Policies] Under the step-size normalization (30), and for symmetric combination policies with the uniform data profile and for all , the escape time simplifies to:

(34)

The result follows immediately after cancellations.

4.3 Benefit of Employing Asymmetric Combination Weights

In this subsection, we show how employing asymmetric combination weights can be beneficial in terms of the time required to escape saddle-points when the data profile across agents is no longer uniform. In particular, we will no longer require the upper and lower bounds and to be common for all agents, and no longer require the combination policy to be symmetric. Instead, to simplify the derivation, we assume that the gradient noise is approximately isotropic, i.e., so that (33) can be simplified to:

(35)

Then, we can formulate the following optimization problem to minimize the escape time over the space of valid combination policies:

(36)

This precise optimization problem has appeared before in the pursuit of asymmetric combination policies that minimize the steady-state error of the diffusion strategy (6a)–(6b) in strongly-convex environments [26]. Its solution is available in closed form and can even be pursued in a decentralized manner, requiring only exchanges among neighbors [26]. [Metropolis-Hastings Combination Policy [26]] Under the step-size normalization (30), the asymmetric Metropolis-Hastings combination policy minimizes the approximate saddle-point escape time (35). It takes the form:

(37)

where denotes the size of the neighborhood of agent .

5 Simulations

We construct a sample landscape to verify the linear speedup in the size of the network indicated by the analysis in this work. The loss function is constructed from a single-layer neural network with a linear hidden layer and a logistic activation function for the output layer. Penalizing this architecture with the cross-entropy loss gives:

(38)

where and denote the weights of the individual layers, denotes the feature vector, and is the class variable. It can be verified that this loss has a single strict saddle-point at and global minima in the positive and negative quadrant, respectively [24]. We show the evolution of the function value at the network centroid under the step-size normalization rule (30) and observe a linear speedup in , consistent with (34) while noting no significant differences in steady-state performance, which is consistent with (31).

Figure 1: Linear speedup in saddle-point escape time.

References

  • [1] A. Daneshmand, G. Scutari and V. Kungurtsev (2018-Sep.) Second-order guarantees of distributed gradient algorithms. available as arXiv:1809.08694. Cited by: §1.1.
  • [2] Z. Allen-Zhu and Y. Li (2018-Dec.) NEON2: finding local minima via first-order oracles. In Proc. of NIPS, pp. 3716–3726. Cited by: §1.1.
  • [3] B. Swenson, S. Kar, H. V. Poor and J. M. F. Moura (2019-03) Annealing for distributed global optimization. available as arXiv:1903.07258. Cited by: §1.1, §2.
  • [4] D. P. Bertsekas (1997-04) A new class of incremental gradient methods for least squares problems. SIAM J. Optim. 7 (4), pp. 913–926. External Links: ISSN 1052-6234, Link, Document Cited by: §1.1.
  • [5] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade and M. I. Jordan (2019-Feb.) Stochastic gradient descent escapes saddle points efficiently. available as arXiv:1902.04811. Cited by: §1.1, §2.
  • [6] J. Chen and A. H. Sayed (2015-06) On the learning behavior of adaptive networks - Part I: transient analysis. IEEE Transactions on Information Theory 61 (6), pp. 3487–3517. External Links: ISSN 0018-9448 Cited by: §1.1, §1.
  • [7] J. Chen and A. H. Sayed (2015-06) On the learning behavior of adaptive networks – Part II: performance analysis. IEEE Transactions on Information Theory 61 (6), pp. 3518–3548. External Links: Document, ISSN 0018-9448 Cited by: §1.
  • [8] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, B. Póczos, and A. Singh (2017) Gradient descent can take exponential time to escape saddle points. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1067–1077. External Links: ISBN 978-1-5108-6096-4, Link Cited by: §1.
  • [9] J. C. Duchi, A. Agarwal, and M. J. Wainwright (2012-03) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic Control 57 (3), pp. 592–606. External Links: Document, ISSN 0018-9286 Cited by: §1.1.
  • [10] C. Fang, C. J. Li, Z. Lin, and T. Zhang (2018) SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Proc. of NIPS, pp. 689–699. Cited by: §1.1.
  • [11] R. Ge, F. Huang, C. Jin, and Y. Yuan (2015) Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proc. of Conference on Learning Theory, Paris, France, pp. 797–842. Cited by: §1.1, §1.1, §2.
  • [12] R. Ge, C. Jin, and Y. Zheng (2017) No spurious local minima in nonconvex low rank problems: a unified geometric analysis. In

    Proceedings of the 34th International Conference on Machine Learning

    ,
    pp. 1233–1242. External Links: Link Cited by: §1.1.
  • [13] R. Ge, J. D. Lee, and T. Ma (2016) Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pp. 2973–2981. External Links: Link Cited by: §1.1.
  • [14] H. Daneshmand, J. Kohler, A. Lucchi and T. Hofmann (2018-03) Escaping saddles with stochastic gradients. available as arXiv:1803.05999. Cited by: §1.1, §2.
  • [15] M. Jaggi, V. Smith, M. Takáč, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan (2014) Communication-efficient distributed dual coordinate ascent. In Proc. International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3068–3076. Cited by: §1.1.
  • [16] D. Jakovetic, J. Xavier, and J. M. F. Moura (2011-08) Cooperative convex optimization in networked systems: augmented lagrangian algorithms with directed gossip communication. IEEE Transactions on Signal Processing 59 (8), pp. 3889–3902. External Links: Document, ISSN 1053-587X Cited by: §1.1.
  • [17] K. Kawaguchi (2016) Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594. External Links: Link Cited by: §1.1.
  • [18] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht (2016) Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, New York, pp. 1246–1257. External Links: Link Cited by: §1.1.
  • [19] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu (2017) Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30, pp. 5330–5340. External Links: Link Cited by: §1.1, §1, §3.1.
  • [20] P. D. Lorenzo and G. Scutari (2016-06) NEXT: in-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks 2 (2), pp. 120–136. External Links: Link, 1602.00591 Cited by: §1.1.
  • [21] Y. Nesterov and B.T. Polyak (2006-08-01) Cubic regularization of newton method and its global performance. Mathematical Programming 108 (1), pp. 177–205. External Links: ISSN 1436-4646, Document, Link Cited by: §1.1.
  • [22] B. T. Polyak (1997) Introduction to Optimization. Optimization Software. Cited by: §1.
  • [23] S. Vlaski and A. H. Sayed (2019-07) Distributed learning in non-convex environments – Part I: Agreement at a Linear rate. submitted for publication, available as arXiv:1907.01848. Cited by: §1.1, §1, §2, §3.1, §3.3.
  • [24] S. Vlaski and A. H. Sayed (2019-07) Distributed learning in non-convex environments – Part II: Polynomial escape from saddle-points. submitted for publication, available as arXiv:1907.01849. Cited by: §1.1, §1, §2, §3.1, §3.3, §5.
  • [25] S. Vlaski and A. H. Sayed (2019-08) Second-order guarantees of stochastic gradient descent in non-convex optimization. submitted for publication, available as arXiv:1908.07023. Cited by: §1.1.
  • [26] A. H. Sayed (2014-07) Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning 7 (4-5), pp. 311–801. External Links: Link, Document, ISSN 1935-8237 Cited by: §1.1, §1, §2, §3.1, §4.2, §4.3.
  • [27] W. Shi, Q. Ling, G. Wu, and W. Yin (2015) EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization 25 (2), pp. 944–966. External Links: Document Cited by: §1.1.
  • [28] S. Sundhar Ram, A. Nedic, and V. V. Veeravalli (2010-12-01) Distributed stochastic subgradient projection algorithms for convex optimization. Journal of Optimization Theory and Applications 147 (3), pp. 516–545. External Links: ISSN 1573-2878, Document Cited by: §1.1.
  • [29] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu (2018) : Decentralized training over decentralized data. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 4848–4856. External Links: Link Cited by: §1.1, §1.
  • [30] T. Tatarenko and B. Touri (2017-Aug.) Non-convex distributed optimization. IEEE Transactions on Automatic Control 62 (8), pp. 3744–3757. External Links: Document, ISSN 0018-9286 Cited by: §1.1.
  • [31] K. I. Tsianos and M. G. Rabbat (2012-06) Distributed dual averaging for convex optimization under communication delays. In Proc. American Control Conference (ACC), Vol. , Montreal, Canada, pp. 1067–1072. External Links: Document, ISSN 2378-5861 Cited by: §1.1.
  • [32] S. Vlaski, L. Vandenberghe, and A. H. Sayed (2019-09) Regularized diffusion adaptation via conjugate smoothing. available as arXiv:1909.09417. Cited by: §1.1.
  • [33] Y. Wang, W. Yin, and J. Zeng (2019-Jan.) Global convergence of ADMM in nonconvex nonsmooth optimization. Journal of Scientific Computing 78 (1), pp. 29–63. Cited by: §1.1.
  • [34] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed (2019-02) Exact diffusion for distributed optimization and learning—Part I: algorithm development. IEEE Transactions on Signal Processing 67 (3), pp. 708–723. External Links: Document, ISSN 1053-587X Cited by: §1.1, §1.