Distributed Learning in Non-Convex Environments – Part I: Agreement at a Linear Rate

07/03/2019 ∙ by Stefan Vlaski, et al. ∙ 0

Driven by the need to solve increasingly complex optimization problems in signal processing and machine learning, there has been increasing interest in understanding the behavior of gradient-descent algorithms in non-convex environments. Most available works on distributed non-convex optimization problems focus on the deterministic setting where exact gradients are available at each agent. In this work and its Part II, we consider stochastic cost functions, where exact gradients are replaced by stochastic approximations and the resulting gradient noise persistently seeps into the dynamics of the algorithm. We establish that the diffusion learning strategy continues to yield meaningful estimates non-convex scenarios in the sense that the iterates by the individual agents will cluster in a small region around the network centroid. We use this insight to motivate a short-term model for network evolution over a finite-horizon. In Part II [2] of this work, we leverage this model to establish descent of the diffusion strategy through saddle points in O(1/μ) steps and the return of approximately second-order stationary points in a polynomial number of iterations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The broad objective of distributed adaptation and learning is the solution of global, stochastic optimization problems by networked agents through localized interactions and in the absence of information about the statistical properties of the data. When constant, rather than diminishing, step-sizes are employed, the resulting algorithms are adaptive in nature and are able to adapt to drifts in the data statistics. In this work, we consider a collection of agents, where each agent is equipped with a stochastic risk of the form with

referring to the loss function,

denoting a parameter vector, and

referring to the stochastic data. The expectation is over the probability distribution of the data. The objective of the network is to seek the Pareto solution:

(1)

where the are positive weights that are normalized to add up to one and will be specified further below; in particular, in the special case when the are identical, they can be removed from (1). Algorithms for the solution of (1) have been studied extensively over recent years both with inexact [3, 4, 5, 6] and exact [7, 8, 9] gradients. Here, we focus on the following diffusion strategy, which has been shown in previous works to provide enhanced performance and stability guarantees under constant step-size learning and adaptive scenarios [10, 4]:

(2a)
(2b)

where denotes a stochastic approximation for the true local gradient . The intermediate estimate is obtained at agent by taking a stochastic gradient update relative to the local cost . The intermediate estimates are then fused across local neighborhoods where are convex combination weights satisfying:

(3)

The symbol denotes the set of neighbors of agent . [Strongly-connected graph] We shall assume that the graph described by the weighted combination matrix is strongly-connected [4]. This means that there exists a path with nonzero weights between any two agents in the network and, moreover, at least one agent has a nontrivial self-loop, . ∎

It then follows from the Perron-Frobenius theorem [11, 12, 4] that

has a single eigenvalue at one while all other eigenvalues are strictly inside the unit circle, so that

. Moreover, if we let

denote the right-eigenvector of

that is associated with the eigenvalue at one, and if we normalize the entries of to add up to one, then it also holds that all entries of are strictly positive, i.e.,

(4)

where the denote the individual entries of the Perron vector, .

I-a Related Works

The performance of the diffusion algorithm (2a)–(2b) has been studied extensively in differentiable settings [10, 5], with extensions to multi-task [13], constrained [14], and non-differentiable [15] environments. A common assumption in these works, along with others studying the behavior of distributed optimization algorithms in general, is that of convexity (or strong-convexity) of the aggregate risk . While many problems of interest such as least-squares estimation [4]

, logistic regression 

[4]

, and support vector machines 

[16] are convex, there has been increased interest in the optimization of non-convex cost functions. Such problems appear frequently in the design of robust estimators [17] and the training of more complex machine learning architectures such as those involving dictionary learning [18]

and artificial neural networks 

[19].

Motivated by these applications, recent works have pursued the study of optimization algorithms for non-convex problems, both in the centralized and distributed settings [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]. While some works focus on establishing convergence to a stationary point [30, 31], there has been growing interest in examining the ability of gradient descent implementations to escape from saddle points, since such points represent bottlenecks to the underlying learning problem [19]. We defer a detailed discussion on the plethora of related works on second-order guarantees [38, 20, 21, 22, 23, 33, 24, 25, 27, 26, 28, 29, 34] to Part II [2], where we will be establishing the ability of the diffusion strategy (2a)–(2b) to escape strict-saddle points efficiently. For ease of reference, the modeling conditions and results from this and related works are summarized in Table I.

The key contributions of Parts I and II this work are three-fold. To the best of our knowledge, we present the first analysis establishing efficient (i.e., polynomial) escape from strict-staddle points in the distributed setting. Second, we establish that the gradient noise process is sufficient to ensure efficient escape without the need to alter it by adding artificial forms of perturbations, interlacing steps with small and large step-sizes, or imposing a dispersive noise assumption. Third, relative to the existing literature on centralized non-convex optimization, where the focus is mostly on deterministic or finite-sum optimization, our modeling conditions are specifically tailored to the scenario of learning from stochastic streaming

data. In particular, we only impose bounds on the gradient noise variance in expectation, rather than assume a bound with probability one 

[24, 28]

or a sub-Gaussian distribution 

[29]. Furthermore, we assume that any Lipschitz conditions only hold on the expected stochastic gradient approximation, rather than for every realization, with probability one [25, 27, 26].

Modeling conditions Results
Gradient Hessian Initialization Perturbations Step-size Stationary Saddle
Centralized
 [20] Lipschitz SGD + Annealing diminishing asymptotic
 [21] Lipschitz & bounded Lipschitz i.i.d. and bounded w.p. 1 constant polynomial
 [22] Lipschitz Random constant asymptotic
 [23] Lipschitz Lipschitz Selective & bounded w.p. 1 constant polynomial
 [24] Lipschitz Lipschitz SGD, bounded w.p. 1 alternating polynomial
 [25] Lipschitz Lipschitz Bounded variance, Lipschitz w.p. 1 constant polynomial
 [26] Lipschitz Lipschitz Bounded variance, Lipschitz w.p. 1 constant polynomial
 [27] Lipschitz Lipschitz Bounded variance, Lipschitz w.p. 1 constant polynomial
 [28] Lipschitz Lipschitz SGD, bounded w.p. 1 constant polynomial
 [29] Lipschitz Lipschitz SGD + Gaussian constant polynomial
Decentralized
 [30] Lipschitz & bounded constant
 [31] Lipschitz constant
 [32] Lipschitz & bounded i.i.d. diminishing
 [33] Lipschitz Exists Random constant asymptotic
 [34] Bounded disagreement SGD + Annealing diminishing asymptotic
This work Bounded disagreement Lipschitz

Bounded moments

constant polynomial
TABLE I: Comparison of modeling assumptions and results for gradient-based methods. Statements marked with are not explicitly stated but are implied by other conditions. The works marked with establish global (asymptotic) convergence, which of course implies escape from saddle-points.

I-B Preview of Results

We first establish that in non-convex environments, as was already shown earlier in [5] for convex environments, the evolution of the individual iterates at the agents continues to be well-described by the evolution of the weighted centroid vector in the sense that the iterates from across the network will cluster around this centroid after sufficient iterations. We subsequently consider two cases separately and establish descent in both of them. The first case corresponds to the region where the gradient at the network centroid is large and establish that descent can occur in one iteration. The second and more challenging case occurs when the gradient norm is small, but there is a sufficiently negative eigenvalue in the Hessian matrix. We establish Part II [2] that the recursion will continue to descend along the aggregate cost at a rate of per iterations. Combined with the first result, this descent relation allows us to provide guarantees about the second-order optimality of the returned iterates.

The flow of the argument is summarized in Fig. 1. We decompose into the set of approximate first-order stationary points, i.e., those with and the complement, i.e., the large-gradient regime. For the large-gradient regime, descent is established in Theorem II-C. We proceed to further decompose the set of approximate first-order stationary points into those that are -strict-saddle, i.e., those that have a Hessian with significant negative eigenvalue , and the complement, which are approximately second-order stationary points. For -strict-saddle points we establish descent in Part II [2, Theorem 1]. Finally, in Part II [2, Theorem 2], we conclude that the centroid will reach an approximately second-order stationary point in a polynomial number of iterations.

Network centroid at time

NOT-stationary

Descent in one iteration by Theorem II-C:

-stationary

-strict-saddle

Descent in iterations in Part II [2, Theorem 1]:

is approximately second-order stationary.
Fig. 1: Classification of approximately stationary points. Theorem II-C in this work establishes descent in the green branch. The red branch is treated in Part II [2, Theorem 1]. The two results are combined in [2, Theorem 2] to establish the return of a second-order stationary point with high probability.

Ii Evolution Analysis

We shall perform the analysis under the following common assumptions on the gradients and their approximations. [Lipschitz gradients] For each , the gradient is Lipschitz, namely, for any :

(5)

In light of (1) and Jensen’s inequality, this implies for the aggregate cost:

(6)

The Lipschitz gradient conditions (5) and (6) imply bounds on the both the function value and the Hessian matrix (when it exists), which will be used regularly throughout the derivations. In particular, we have for the function values:

(7)

For the Hessian matrix we have [4]:

(8)

[Bounded gradient disagreement] For each pair of agents and , the gradient disagreement is bounded, namely, for any :

(9)

This assumption is similar to the one used in [34] under a diminishing step-size with annealing. Note that condition (9) is weaker than the more common assumption of bounded gradients. Condition (9) is automatically satisfied in cases where the expected risks

are common (though agents still may see different realizations of data), or in the case of centralized stochastic gradient descent where the number of agents is one. This condition is also satisfied whenever agent-specific risks with bounded gradients are regularized by common regularizers with potentially unbounded gradients, as is common in many machine learning applications. Observe that (

9) implies a similar condition on the deviation from the centralized gradient via Jensen’s inequality:

(10)

[Filtration] We denote by the filtration generated by the random processes for all and :

(11)

where contains the iterates across the network at time . Informally, captures all information that is available about the stochastic processes across the network up to time . ∎

Throughout the following derivations, we will frequently rely on appropriate conditionings to make the analysis tractable. A frequent theme will be the exchange of conditioning on filtrations by conditioning on events. To this end, the following lemma will be used repeatedly. [Conditioning] Suppose

is a random variable measurable by

. In other words, is deterministic conditioned on and

(12)

Then,

(13)

for any deterministic set and random . Denote by the random indicator function:

(14)

Since is measurable by , then is measurable by as well. In other words, the event is deterministic conditioned on . Furthermore, for the random variable , we have:

(15)

Rearranging yields:

(16)

Similarly, for the random variable , we have:

(17)

It then follows that:

(18)

where in step we pulled into the inner expectation, since it is deterministic conditioned on and follows from the law of total expectation. [Gradient noise process] For each , the gradient noise process is defined as

(19)

and satisfies

(20a)
(20b)

for some non-negative constant . We also assume that the gradient noise pocesses are pairwise uncorrelated over the space conditioned on , i.e.:

(21)

∎ Property (20a) means that the gradient noise construction is unbiased on average. Property (20b) means that the fourth-moment of the gradient noise is bounded. These properties are automatically satisfied for several costs of interest [4, 10]. Note, that the bound on the fourth-order moment, in light of Jensen’s intequality, immediately implies:

(22)

While our primary interest is in the development of algorithms that allow for learning from streaming data, we remark briefly that the results obtained in this work are equally applicable to empirical risk minimization via stochastic gradient descent, by assuming that the streaming data is selected according to a particular distribution.

[Empirical Risk Minimization] Suppose the costs are empirical based on locally collected data and take the form:

(23)

In empirical risk minimization (ERM) problems, we are interested in finding a vector that minimizes the following empirical risk over the data across the entire network:

(24)

If we introduce the uniformly-distributed random variable

with probability for all , then the cost (24) is equivalent to solving:

(25)

which is of the same form as (1) with . The resulting gradient noise process satisfies the assumptions imposed in this work under appropriate conditions on the risk . This observation has been leveraged to accurately quantify the performance of stochastic gradient descent, as well as mini-batch and importance sampling generalizations, for emprical minimization of convex risks in [7]. ∎

Ii-a Network basis transformation

In analyzing the dynamics of the distributed algorithm (2a)–(2b), it is useful to introduce the following extended quantities by collecting variables from across the network:

(26)
(27)
(28)

where denotes the Kronecker product operation. We can then write the diffusion recursion (2a)–(2b) compactly as

(29)

By construction, the combination matrix is left-stochastic and primitive and hence admits a Jordan decomposition of the form with [4, 5]:

(30)

where is a block Jordan matrix with the eigenvalues through on the diagonal and on the first lower sub-diagonal. The extended matrix then satisfies with , , . The spectral properties of and its corresponding eigendecomposition have been exploited extensively in the study of the diffusion learning strategy in the convex setting [4, 5], and will continue to be useful in non-convex scenarios.

Multiplying both sides of (29) by from the left, we obtain in light of (4):

(31)

Letting and exploiting the block-structure of the gradient term, we find:

(32)

Note that is a convex combination of iterates across the network and can be viewed as a weighted centroid. The recursion for is reminiscent of a stochastic gradient step associated with the aggregate cost with the exact gradients replaced by stochastic approximations and with the stochastic gradients evaluated at , rather than . In fact, we can write:

(33)

where we defined the perturbation terms:

(34)
(35)

We use the subscript for to emphasize that it depends on data up to time , in contrast to which is also dependent on the most recent data from time . Observe that arises from the disagreement within the network, and in particular that if each remains close to the network centroid , this perturbation will be small in light of the Lipschitz condition (5) on the gradients. The second perturbation term arises from the noise introduced by stochastic gradient approximations at each agent. We now establish that recursion (33) will continue to exhibit some of the desired properties of (centralized) gradient descent, despite the presence of persistent and coupled perturbation terms.

Ii-B Network disagreement

To begin with, we study more closely the evolution of the individual estimates relative to the network centroid . Multiplying (29) by from the left yields in light of (30):

(36)

Then, for the deviation from the network centroid:

(37)

so that the deviation from the centroid can be easily recovered from in (II-B). Proceeding with (II-B), we find:

(38)

where follows from the sub-multiplicative property of norms, and follows from Jensen’s inequality with

(39)

for sufficiently small due to Assumption 3, where . We observe that the term contracts at an exponential rate given by for small , also known as the mixing rate of the graph. Iterating this relation and applying Assumptions 3II, we obtain the following result.

[Network disagreement (4th order)] Under assumptions 3II, the network disagreement is bounded after sufficient iterations by:

(40)

where

(41)

and denotes a term that is higher in order than .

Proof:

Appendix A.

Note again, that Jensen’s inequality immediately implies for the second-order moment:

(42)

where follows from (40) and sub-additivity of the square root, i.e. . This result establishes that, for every agent , we have after sufficient iterations :

(43)

or, by Markov’s inequality [39]:

(44)

and hence will be arbitrarily close to with arbitrarily high probability for all agents. This result has two implications. First, it allows us to use the network centroid as a proxy for all iterates in the network, since all agents will cluster around the network centroid after sufficient iterations. Second, it allows us to bound the perturbation terms encountered in (33). [Perturbation bounds (2nd and 4th order)] Under assumptions 3II and for sufficiently small step-sizes , the perturbation terms are bounded as:

(45)
(46)

after sufficient iterations .

Proof:

Appendix B.

[Sets] To simplify the notation in the sequel, we introduce following sets:

(47)
(48)
(49)
(50)

where is a small positive parameterm, and are constants:

(51)
(52)

and is a parameter to be chosen. Note that . We also define the probabilities , and . Then for all , we have . ∎ The definitions (47)–(50) decompose the parameter-space into two disjoint sets and , and further sub-divides into and . The set denotes the set all points where the norm of the gradient is large, while denotes the set of all points where the norm of the gradient is small, i.e., approximately first-order stationary points. In a manner similar to related works on the escape from strict-saddle points, we further decompose the set of approximate first-order stationary points into those points that do have a significant negative eigenvalue, and those in that do not [21, 23]. Points in the parameter space that have a small gradient norm and no significant negative eigenvalue are referred to as second-order stationary points, while points in are known as strict saddle-points due to the presence of a strictly negative eigenvalue in the Hessian matrix. In the sequel, we will establish descent for centroids in in Theorem II-C and centroids in in Part II [2, Theorem 1], and hence the approach of a point in with high probability after a polynomial number of iterations in Part II [2, Theorem 2]. Second-order stationary points are generally more likely to be “good” minimizers than first-order stationary points, which could even correspond to local maxima. Furthermore, for a certain class of cost functions, known as “strict-saddle” functions, second-order stationary points always correspond to local minimia for sufficiently small  [21].

Ii-C Evolution of the network centroid

Having established in (42), that after sufficient iterations, all agents in the network will have contracted around the centroid in a small cluster for small step-sizes, we can now leverage as a proxy for all . From Assumption II and (7), we have the following bound:

(53)

From (33), we then obtain:

(54)

This relation, along with (33) and the results from Lemma 44, allow us to establish the following theorem. [Descent relation] Beginning at in the large gradient regime , we can bound:

(55)

as long as where the relevant constants are listed in definition II-B. On the other hand, beginning at , we can bound:

(56)

as long as .

Proof:

Appendix D.

Relation (II-C) guarantees a lower bound on the expected improvement when the gradient norm at the current iterate is suffiently large, i.e. is not an approximately first-order stationary point. On the other hand, when , inequality (II-C) it establishes an upper bound on the expected ascent. The respective bounds can be balanced by appropriately choosing , which will be leveraged in Part II [2]. We are left to treat the third possibility, namely . In this case, since the norm of the gradient is small, it is no longer possible to guarantee descent in a single iteration. We shall study the dynamics in more detail in the sequel.

Ii-D Behavior around stationary points

In the vicinity of saddle-points, the norm of the gradient is not sufficiently large to guarantee descent at every iteration as indicated by (II-C). Instead, we will study the cumulative effect of the gradient, as well as perturbations, over several iterations. For this purpose, we introduce the following second-order condition on the cost functions, which is common in the literature [4, 21, 23]. [Lipschitz Hessians] Each is twice-differentiable with Hessian and, there exists such that:

(57)

By Jensen’s inequality, this implies that also satisfies:

(58)

Let denote an arbitraty point in time. We use in order to emphasize approximately first-order stationary points, where the norm of the gradient is small. Such first-order stationary points could either be in the set of second-order stationary points or in the set of strict-saddle points . Our objective is to show that when , we can guarantee descent after several iterations. To this end, starting at , we have for :

(59)

Subsequent analysis will rely on an auxilliary model, referred to as a short-term model. It will be seen that this model is more tractable and evolves “close” to the true recursion under the second-order smoothness condition on the Hessian matrix (58) and as long as the iterates remain close to a stationary point. A similar approach has been introduced and used to great advantage in the form of a “long-term model” to derive accurate mean-square deviation performance expressions for strongly-convex costs in [4, 10, 40, 41]. The approach was also used to provide a “quadratic approximation” to establish the ability of stochastic gradient based algorithms to escape from strict saddle-points in the single-agent case under i.i.d. perturbations in [21].

For the driving gradient term in (59), we have from the mean-value theorem [4]:

(60)

where

(61)

Subtracting (59) from , we obtain:

(62)

We introduce short-hand notation for the deviation:

(63)

Note that denotes the deviation of the network centroid at time