Geometrically Convergent Distributed Optimization with Uncoordinated Step-Sizes

09/19/2016 ∙ by Angelia Nedić, et al. ∙ Boston University University of Illinois at Urbana-Champaign 0

A recent algorithmic family for distributed optimization, DIGing's, have been shown to have geometric convergence over time-varying undirected/directed graphs. Nevertheless, an identical step-size for all agents is needed. In this paper, we study the convergence rates of the Adapt-Then-Combine (ATC) variation of the DIGing algorithm under uncoordinated step-sizes. We show that the ATC variation of DIGing algorithm converges geometrically fast even if the step-sizes are different among the agents. In addition, our analysis implies that the ATC structure can accelerate convergence compared to the distributed gradient descent (DGD) structure which has been used in the original DIGing algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent advances in networked and distributed systems require the development of scalable algorithms that take into account the decentralized nature of the problem and communication constraints. Formation control [2, 3], distributed spectrum sensing [4], statistical inference and learning [5, 6, 7, 8] are among some areas of application of such algorithms.

The problem of optimal performance of a number of such distributed systems can be modeled as optimization problems where the objective function is the aggregation of local private information distributed throughout the system.

This paper focuses on problems of the form

(1)

where each function is held privately by agent to encode the agent’s objective function, e.g. private data. Moreover, the complete systems seeks to solve the joint problem by exchanging information over a network. Such network might correspond to privacy settings or communication constraints.

Several algorithms have been proposed for the solution of problems of the form (1) since the 1980s [9, 10]. Initial approaches for general and possibly time-varying graphs were based in distributed sub-gradients with extensions to handle stochasticity and asynchronous updates [11, 12, 13]. Such algorithms are flexible for the class of functions and graphs they can handle but are considerably slow. Even for strongly convex functions a diminishing step-size is required which hinders the possibility of linear rates [14, 15, 16]. Recent studies have achieved linear convergence rates for strongly convex function [17, 18, 19, 1, 20, 21, 22]. Nonetheless, these methods require a careful selection of the step-sizes.

Recently in [23, 24], the authors utilize the Adapt-Then-Combine strategy111The readers are referred to reference [25] for more discussion on different strategies of information diffusion in networks. to develop an augmented version of the distributed gradient method for distributed optimization over time-invariant graphs. This algorithm is shown to converge for convex smooth objective functions for sufficiently small constant step-size. Moreover, no coordination on the step-sizes are needed. Additionally similar structures of the dynamic average consensus have been explored for more general classes of non-convex functions [26]. For non-convex problems the work in [27, 28, 29] develops a large class of distributed algorithms by utilizing varios “function-surrogate modules” thus providing a great flexibility in its use and rendering a new class of algorithms that subsumes many of the existing distributed algorithms. The authors in [28, 23] simultaneously proposed methods that track the gradient averages.

In this paper we study the Adapt-Then-Combine Distributed Inexact Gradient Tracking (ATC-DIGing) algorithm for the solution of the optimization problem (1). Specifically, we show that geometric convergence rates222Suppose that a sequence converges to in some norm . We say that the convergence is R-linear (Geometric) if there exist and some positive constant such that for all . This rate is often referred to as global to be distinguished from the case when the given relations are valid for some sufficiently large indices . can still be obtained in the studied algorithm for uncoordinated step-sizes. Moreover, under specific conditions the ATC-DIGing algorithm can use step-sizes as large as the centralized case, which improves the stability region of the ATC-structure over the Distributed Gradient Descent (DGD) structure used in the original DIGing algorithm.

This paper is organized as follows. Section II presents some preliminary definitions, the proposed algorithm and the main result of this paper. Section III shows the analysis and proof of the proposed algorithm. Section IV discusses the implications of the results of some general remarks and comments about the contributions. Finally Section VI presents some conclusions and future work.

Notation. Each agent holds a local copy of the variable of the problem in (1), which is denoted by ; its value at iteration/time is denoted by . In general per agent information will be represented by superscripts with letter or and time indices by subscripts with the letter . We stack the raw version of all into a single matrix such that , while its corresponding -th row is denoted by . We introduce an aggregate objective function of the local variables: , where its gradient is a matrix whose -th row is defined as . We say that is consensual if all of its rows are identical, i.e., . Furthermore, we let

denote a column vector with all entries equal to one (its size is to be understood from the context). The bar denotes averages, e.g.

, and its consensus violation is denotes as , where . We use to denote the Ł weighted (semi)-norm, that is, . Note that since , we always have , where stands for the Frobenius norm. For a tuple of matrices with , in view of the definition of the consensus violation, it holds that

Ii Definitions, Algorithm and Main Result

The set of agents interact over a time-invariant connected undirected graph , where correspond to the edges in the graph. A pair of agents indicates that agent can exchange information with agent . The neighbors of agent is a set defined as . Additionally, there is a nonnegative doubly-stochastic weight matrix , compliant with the graph , such that if then otherwise .

Next we are going to formalize the set of assumptions we will use for our results.

Assumption 1

The graph is connected and is doubly stochastic.

Assumption 1 is recurrent in many distributed optimization algorithms. It guarantees some minimum exchange of information between agents and balancedness of such exchanges. Specifically this assumption can be relaxed without much extra work for the case of uniformly connected time-varying directed graphs [1].

Lemma 1

Let Assumption 1 hold. For any matrix with appropriate dimensions, if , then we have where is a constant less than .

Lemma 1 is standard in the consensus literature. An explicit expression of in terms of can be found in [30] if more specific assumptions are made.

We also need the following two assumptions on the objective functions, which are common for deriving linear (geometric) rates of gradient-based algorithms for strongly convex smooth optimization problems.

Assumption 2 (Smoothness)

Every function is differentiable and has Lipschitz continuous gradients, i.e., there exists a constant such that

In Section III we will use , which is the Lipschitz constant of , and which is the Lipschitz constant of .

Assumption 3 (Strong convexity)

Every function satisfies

for any , where . Moreover, at least one is nonzero.

In the analysis we will use and . Assumption 3 implies the -strong convexity of . Under this assumption, the optimal solution to problem (1) is guaranteed to exist and to be unique since . We note that all the convergence results in our analysis are achieved under Assumption 3. We will also use .

With the above definitions and assumptions in place, we now state the ATC-DIGing algorithm in its compact vector form. Each agent will maintain two variables and at each time instant . These variables are updated according to the following rule:

(2a)
(2b)

where

is a doubly stochastic matrix of weights (to be defined soon) and

is a diagonal matrix where is the step-size of agent . The initial value is arbitrary and .

Algorithm of the form (2) have been recently proposed under the name Aug-DGM by [23, 24]

, where the convergence of the algorithm under uncoordinated step-sizes is prove. Our objective will be to study the convergence rate of the algorithm. We will show that such algorithm converges geometrically fast, and we will provide an explicit rate estimate. These contributions are stated in the next theorem, which is the main result of this paper.

Theorem 2 (Explicit geometric rate)

Let Assumptions 1, 2 and 3 hold. Let the step-size matrix be such that its largest positive entry satisfies the following relation:

where is the condition number of the step-size matrix , and . Then, assuming that the step-size heterogeneity is small enough (), the sequence generated by the ATC-DIGing algorithm with uncoordinated step-sizes converges to the optimal solution at a global R-linear (geometric) rate where is given by

(3)

Theorem 2 provides an explicit convergence rate estimate for the ATC-DIGing algorithm. Such rate might not be tight and better choices in the analysis will shown result in better bounds.

Iii The Small Gain Theorem for Linear Rates

To establish the R-linear rate of the algorithm, one of our technical innovations will be to resort to a somewhat unusual version of small gain theorem under a well-chosen metric, whose original version has received an extensive research and been widely applied in control theory [31]. We choose to analyze the ATC-DIGing algorithm using the small gain theorem due to its effectiveness in showing geometric rates for other algorithms, e.g. [1]. We will give an intuition of the whole analytical approach shortly, after stating the small gain theorem at first.

Let us adopt the notation for the infinite sequence where . Furthermore, let us define

(4a)
(4b)

where the parameter will serve as the linear rate parameter later in our analysis. While is always finite, may be infinite. If , i.e., each is a scalar, we will just write and for these quantities. Intuitively, is a weighted “ergodic norm” of . Noticing that the weight is exponentially growing with respect to , if we can show that is bounded, then it would imply that geometrically fast. This ergodic definition enables us to give analysis to those algorithms which do not converge Q-linearly. Next we will state the small gain theorem which gives a sufficient condition to for the boundedness of . The theorem is a basic result in control systems and a detailed discussion about its result can be found in [31].

Theorem 3 (The small gain theorem)

Suppose that is a set of sequences such that for all positive integers and for each then

(5)

where the constants (gains) are nonnegative and satisfy . Then

(6)

For simplicity of exposition we will denote the bound relation in (5) as an arrow . Clearly, the small gain theorem involves a cycle . Due to this cyclic structure similar bounds hold for .

Lemma 4 (Bounded norm R-linear rate)

For any matrix sequence , if is bounded, then converges at a global R-linear (geometric) rate .

Before summarizing our main proof idea, let us define some quantities which we will use frequently in our analysis. We define where is the optimal solution of (1). Also, define

(7)

which is the optimality residual of the iterates (at the -th iteration). Moreover, let us adopt the notation

(8)

and with the convention that .

We will apply the small gain theorem with the metric and a right choice of around the cycle of arrows shown in Figure 1.

[width=0.4]arrows

Fig. 1: Bound relations between variables in the ATC-DIGing algorithm

After the establishment of each arrow/relation, we will apply the small gain theorem. Specifically we will use the sequences to show they are bounded and hence conclude that all quantities in the “circle of arrows” decay at an R-linear rate .

Note that to apply the small gain theorem, we would need to have gains () that multiply to less than one. This is achieved by choosing an appropriate step-size matrix .

The next lemma presents the establishment of each arrow/relation in the sketch in Fig. 1.

Lemma 5

Let Assumptions 1, 2 and 3 hold and let be as given in Lemma 1. Also, let be such that . Then, we have for all ,

where

and

Lemma 5 provides a subset of the required relations necessary for the application of the small gain theorem. Relations remains to be addressed. For this, we need an interlude on gradient descent with errors in the gradient. Since this part is relatively independent from the preceding development, we provide it in the next subsection.

Iii-a The Inexact Gradient Descent on a Sum of Strongly Convex Functions

In this subsection, we consider the basic (centralized) first-order method for problem (1) under inexact first-order oracle. To distinguish from the notation used for our distributed optimization problem/algorithm/analysis, let us make some definitions that are only used in this subsection. Problem (1) is restated as follows with different notation,

where all ’s satisfy Assumptions 2 and 3 with being replaced by . Let us consider the inexact gradient descent (IGD) on the function :

(9)

where is the step-size and is an additive noise. Let be the global minimum of , and define

The main lemma of this subsection is stated next; it is basically obtained by following the ideas in [32, 1].

Lemma 6 (The error bound on the IGD)

Suppose that

(10)

where and . Let Assumptions 2 and 3 hold for all ’s. Then, for a set of subsequences , the tuple sequence generated by the inexact gradient method (9) obeys

(11)

Now we prove the last arrow of our proof sketch [cf. (1)] in the following lemma. Its establishment will use the error bound on the IGD of Lemma 6, as a key ingredient.

Lemma 7 ( )

Let Assumptions 1, 2, and 3 hold. In addition, suppose that the parameters and are such that

where and are some tunable parameters. Then, we have

(12)

where the constant if ; if .

Iii-B Proof of Main Result

We are now ready to show the proof of our main result in Theorem 2.

Theorem 2.

We will use the small gain Theorem 3, together with Lemma 5 and Lemma 7, to show that is bounded. Therefore, we need , that is,

(13)

where and , along with other restrictions on parameters that appear in Lemmas 5 and 7.

To obtain some concise though probably loose bound on the convergence rate, we next use Lemma

5 with some specific values for the parameters and , which yields the desired result. Specifically, let and . By further using , it yields from (13) that

(14)

where we require/assume so to have to ensure the non-emptiness of (13) (this way the right-hand-side of (14) is always positive). (14) further implies that

(15)

Solving for when gives

(16)

Meanwhile, considering that we have that

(17)

Aggregating the multiple conditions for and provides the desired result. ∎

Iv Discussion

Other possible choices of , , , and exist and may give tighter bounds but here we only aim to provide an explicit estimate of the convergence rate.

If , i.e., the multiple agents use an identical step-size, we can use a small-gain theorem sketch that is similar to the one in reference [1] to obtain the geometric rate which is tighter than the result in this paper. Specifically, in this case, to reach -accuracy, the number of iterations needed by DIGing is at the order of while that by ATC-DIGing is . See appendix D for more detailed explanation. This comparison of the algorithms shows that ATC-DIGing has faster convergence rate and it is less sensitive to the condition number , especially when is small (the network is well-connected). This implies that in the DIGing-family, we should use the ATC structure as much as possible.

Here one of our goals is to demonstrate that the geometric rate can still be obtained even if we use uncoordinated step-sizes. Compared to the case of coordinated (identical) step-size in [1], to allow uncoordinated step-size we have to demonstrate that decays geometrically fast instead of only does so. Thus more steps in the small-gain theorem sketch [c.f. (1)] is needed and worse bound on the rate is derived.

Considering the bounds on Theorem 2, there is a trade-off between the tolerance of step-size heterogeneity () and the achievable largest step-size (). In addition, Theorem 2 says that when the graph is well-connected ( is small) enough and heterogeneity () is small enough, a step-size as large as can be utilized and the corresponding convergence rate can be as fast as .

To make the paper concise, we analyse ATC-DIGing under a rather simple network setting, i.e., time-invariant undirected graph. But it can be expected that the idea of analysis in this paper can be extended to dealing with Push-DIGing and the other possible variants of DIGing even under the setting of time-varying directed graphs. Thus in the following numerical test, we shall conduct the experiments under tougher situation for the DIGing families.

V Numerical test

In this section, we use numerical experiments to demonstrate the performance of DIGing family under uncoordinated step-sizes. The problem we are solving is decentralized Huber minimization over time-varying undirected graphs. The experiments settings including data/graph generation are the same as those in section 6 of reference [1] except that at each iteration and agent, we perturb the base step-size

by a random variable

satisfying the uniform distribution over interval

. In other words, at iteration , agent uses step-size where is the random variable generated over agent at time . Monte Carlo simulation shows that such step-size sequences in the current experiment () results a heterogeneity at mean .

Numerical results are illustrated in Fig. 2. It shows that under uncoordinated step-sizes, DIGing families still converge geometrically fast.

Fig. 2: Plot of residuals for a time-varying undirected graph sequence. The base step-sizes ’s are set to be the same as the constant step-sizes used in the left sub-figure of Fig. 2 from reference [1].

Vi Conclusions and Future Work

We have shown that the ATC-DIGing algorithm for the distributed optimization problem (1) converges geometrically to the optimal solution even if all agents have constant uncoordinated step-sizes. We also provide explicit estimation for its convergence rates. Convergence and rates are derived using the small gain theorem. Nevertheless no claims about tightness of this estimates are given. Under specific conditions the obtained rate shows that ATC-DIGing is less sensitive to the problem parameters that the DGD. Future work should consider extensions to time-varying directed graphs and explore tightness of the rates.

References

  • [1] A. Nedić, A. Olshevsky, and W. Shi, “Achieving Geometric Convergence for Distributed Optimization over Time-Varying Graphs,” arXiv preprint arXiv:1607.03218, 2016.
  • [2] W. Ren, “Consensus Based Formation Control Strategies for Multi-Vehicle Systems,” in Proceedings of the American Control Conference, 2006, pp. 4237–4242.
  • [3] W. Ren, R. Beard, and E. Atkins, “Information Consensus in Multivehicle Cooperative Control: Collective Group Behavior through Local Interaction,” IEEE Control Systems Magazine, vol. 27, pp. 71–82, 2007.
  • [4] J. Bazerque and G. Giannakis, “Distributed Spectrum Sensing for Cognitive Radio Networks by Exploiting Sparsity,” IEEE Transactions on Signal Processing, vol. 58, pp. 1847–1862, 2010.
  • [5] S. Lee and A. Nedić, “Drsvm: Distributed random projection algorithms for svms,” in 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).   IEEE, 2012, pp. 5286–5291.
  • [6] M. Rabbat and R. Nowak, “Distributed Optimization in Sensor Networks,” in Proceedings of the 3rd international symposium on Information processing in sensor networks.   ACM, 2004, pp. 20–27.
  • [7] A. Nedić, A. Olshevsky, and C. Uribe, “Fast Convergence Rates for Distributed Non-Bayesian Learning,” arXiv preprint arXiv:1508.05161, 2015.
  • [8] A. Nedić, A. Olshevsky, and C. A. Uribe, “Nonasymptotic convergence rates for cooperative learning over time-varying directed graphs,” in Proceedings of the American Control Conference, 2015, pp. 5884–5889.
  • [9] J. N. Tsitsiklis and M. Athans, “Convergence and asymptotic agreement in distributed decision problems,” IEEE Transactions on Automatic Control, vol. 29, no. 1, pp. 42–50, 1984.
  • [10] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods.   Prentice-Hall, Inc., 1989.
  • [11] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [12] A. Nedić, “Asynchronous Broadcast-Based Convex Optimization over a Network,” IEEE Transactions on Automatic Control, vol. 56, no. 6, pp. 1337–1351, 2011.
  • [13] S. S. Ram, A. Nedić, and V. Veeravalli, “Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization,” Journal of Optimization Theory and Applications, vol. 147, no. 3, pp. 516–545, 2010.
  • [14] J. Duchi, A. Agarwal, and M. Wainwright, “Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling,” IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592–606, 2012.
  • [15] M. Zhu and S. Martinez, “On Distributed Convex Optimization under Inequality and Equality Constraints,” IEEE Transactions on Automatic Control, vol. 57, no. 1, pp. 151–164, 2012.
  • [16] B. He and X. Yuan, “On the Convergence Rate of the Douglas-Rachford Alternating Direction Method,” SIAM Journal on Numerical Analysis, vol. 50, no. 2, pp. 700–709, 2012.
  • [17] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the Linear Convergence of the ADMM in Decentralized Consensus Optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
  • [18] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  • [19] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers,” arXiv preprint arXiv:1508.02073, 2015.
  • [20] C. Xi and U. Khan, “On the Linear Convergence of Distributed Optimization over Directed Graphs,” arXiv preprint arXiv:1510.02149, 2015.
  • [21] J. Zeng and W. Yin, “ExtraPush for Convex Smooth Decentralized Optimization over Directed Networks,” arXiv preprint arXiv:1511.02942, 2015.
  • [22] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” arXiv preprint arXiv:1605.07112, 2016.
  • [23] J. Xu, S. Zhu, Y. Soh, and L. Xie, “Augmented Distributed Gradient Methods for Multi-Agent Optimization Under Uncoordinated Constant Stepsizes,” in Proceedings of the 54th IEEE Conference on Decision and Control (CDC), 2015, pp. 2055–2060.
  • [24] J. Xu, “Augmented Distributed Optimization for Networked Systems,” Ph.D. dissertation, Nanyang Technological University, 2016.
  • [25] A. Sayed, “Diffusion Adaptation over Networks,” Academic Press Library in Signal Processing, vol. 3, pp. 323–454, 2013.
  • [26] M. Zhu and S. Martinez, “Discrete-Time Dynamic Average Consensus,” Automatica, vol. 46, no. 2, pp. 322–329, 2010.
  • [27] P. Di Lorenzo and G. Scutari, “NEXT: In-Network Nonconvex Optimization,” IEEE Transactions on Signal and Information Processing over Networks, 2016.
  • [28] ——, “Distributed nonconvex optimization over networks,” in IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP),, 2015, pp. 229–232.
  • [29] ——, “Distributed nonconvex optimization over time-varying networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4124–4128.
  • [30] A. Nedić, A. Olshevsky, A. Ozdaglar, and J. Tsitsiklis, “On Distributed Averaging Algorithms and Quantization Effects,” IEEE Transactions on Automatic Control, vol. 54, no. 11, pp. 2506–2517, 2009.
  • [31] C. Desoer and M. Vidyasagar, Feedback Systems: Input-Output Properties.   Siam, 2009, vol. 55.
  • [32] O. Devolder, F. Glineur, and Y. Nesterov, “First-Order Methods with Inexact Oracle: The strongly convex case,” UCL, Tech. Rep., 2013.

Appendix A Proof of Lemma 5

A-a (i)

Proof.

By the Lipschitz continuity of (Assumption 2), it is easy to show that for all and any ,

(18)

From (2), using Lemma 1, it follows that

and therefore, for all ,

Taking the maximum over on both sides of the above relation, we obtain

Hence,

(19)

Combining (18) and (19) completes the proof. ∎

A-B (ii)

Proof.

Considering and we have

The desired result follows immediately. ∎

A-C (iii)

This follows automatically from definition.

A-D (iv)

Proof.

From (2), using Lemma 1, for all , it follows that

and therefore, for all ,

(20)

Taking the maximum over on both sides of (20), we obtain

This completes the proof. ∎

Appendix B Proof of Lemma 6

Proof.

By assumptions, for each and , we have

(21)

Through using the basic inequality where is a tunable parameter, it follows from (21) that

and therefore

(22)

Averaging (22) over through to gives

(23)

On the other hand, we also have that for any vector ,

where is some tunable parameter, and therefore

(24)

Averaging (B) over through to gives

(25)

Having (B) and (B) at hand, we are ready to show how is related to . First, plugging and into the basic equality yields