# On linear convergence of two decentralized algorithms

Decentralized algorithms solve multi-agent problems over a connected network, where the information can only be exchanged with accessible neighbors. Though there exist several decentralized optimization algorithms, there are still gaps in convergence conditions and rates between decentralized algorithms and centralized ones. In this paper, we fill some gaps by considering two decentralized consensus algorithms: EXTRA and NIDS. Both algorithms converge linearly with strongly convex functions. We will answer two questions regarding both algorithms. What are the optimal upper bounds for their stepsizes? Do decentralized algorithms require more properties on the functions for linear convergence than centralized ones? More specifically, we relax the required conditions for linear convergence for both algorithms. For EXTRA, we show that the stepsize is in order of O(1/L) (L is the Lipschitz constant of the gradient of the functions), which is comparable to that of centralized algorithms, though the upper bound is still smaller than that of centralized ones. For NIDS, we show that the upper bound of the stepsize is the same as that of centralized ones, and it does not depend on the network. In addition, we relax the requirement for the functions and the mixing matrix, which reflects the topology of the network. As far as we know, we provide the linear convergence results for both algorithms under the weakest conditions.

## Authors

• 23 publications
• 25 publications
• ### A Linearly Convergent Proximal Gradient Algorithm for Decentralized Optimization

Decentralized optimization is a promising paradigm that finds various ap...
05/20/2019 ∙ by Sulaiman A. Alghunaim, et al. ∙ 0

• ### A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization

We consider a primal-dual algorithm for minimizing f(x)+h(Ax) with diffe...
11/18/2017 ∙ by Zhi Li, et al. ∙ 0

• ### Centralized and Decentralized Global Outer-synchronization of Asymmetric Recurrent Time-varying Neural Network by Data-sampling

In this paper, we discuss the outer-synchronization of the asymmetricall...
04/02/2016 ∙ by Wenlian Lu, et al. ∙ 0

• ### Optimal algorithms for smooth and strongly convex distributed optimization in networks

In this paper, we determine the optimal convergence rates for strongly c...
02/28/2017 ∙ by Kevin Scaman, et al. ∙ 0

• ### A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates

This paper considers the problem of decentralized optimization with a co...
04/25/2017 ∙ by Zhi Li, et al. ∙ 0

• ### Hop: Heterogeneity-Aware Decentralized Training

Recent work has shown that decentralized algorithms can deliver superior...
02/04/2019 ∙ by Qinyi Luo, et al. ∙ 0

• ### Algorithms and Complexity for Functions on General Domains

Error bounds and complexity bounds in numerical analysis and information...
08/16/2019 ∙ by Erich Novak, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

This paper considers the optimization problem

 minimizex∈Rp ¯f(x)\coloneqq1nn∑i=1fi(x) (1)

over a -agent network. Each function is known only by the corresponding agent and assumed to be convex and differentiable. These agents form a connected network to solve the problem (1

) cooperatively without knowing other agents’ functions. The whole system is decentralized such that each agent has an estimation of the global variable

and can only exchange the estimation with their accessible neighbors at every iteration. We introduce

 f(x)\coloneqqn∑i=1fi(xi), (2)

where each is a local estimation of the global variable and its th iterated value is . There is a symmetric mixing matrix encoding the communication between the agents. The minimum condition for

is that it has one eigenvalue

and all other eigenvalues are smaller than

. In addition, the all-one vector

is an eigenvector of

corresponding to the eigenvalue (this is satisfied when the sum of each row is ).

Early decentralized methods based on decentralized gradient descent [1, 2, 3, 4, 5] have sublinear convergence for strongly convex objective functions, because of the diminishing stepsize that is needed to obtain a consensual and optimal solution. This sublinear convergence rate is much slower than that for centralized ones. The first decentralized algorithm with linear convergence [6] is based on Alternate Direction Multiplier Method (ADMM) [7, 8]. Note that this type of algorithms have rate for general convex functions [9, 10, 11]. After that, many linearly convergent algorithms are proposed. Some examples are EXTRA [12], NIDS [13], DIGing [14, 15], ESOM [16], gradient tracking methods [17, 18, 19, 15, 14, 20, 21], exact diffusion [22, 23], dual optimal [24, 25]. There are also works on composite functions, where each private function is the sum of a smooth and a nonsmooth functions [26, 13, 27, 28]. Another topic of interest is decentralized optimization over directed and dynamic graphs [29, 30, 31, 32, 14, 33, 34]. Interested reader can refer to [35] and the references therein for more algorithms.

This paper focuses on two linear convergent algorithms: EXTRA and NIDS, and provides better theoretical convergence results for them. EXact firsT-ordeR Algorithm (EXTRA) was proposed in [12], and its iteration is described in (5). There are conditions on the stepsize for its convergence. For the general convex case, where each is convex and -smooth (i.e., has a -Lipschitz continuous gradient), the condition in [12] is . Therefore, there is an implicit condition for that the smallest eigenvalue of is larger than . Later the condition is relaxed to in [36], and the corresponding requirement for is that the smallest eigenvalue of is larger than . In addition, this condition for the stepsize is shown to be optimal, i.e., EXTRA may diverge if the condition is not satisfied. Though we can always manipulate to change the smallest eigenvalues, the convergence speed of EXTRA depends on the matrix . In the numerical experiment, we will see that it is beneficial to choose small eigenvalues for EXTRA in certain scenarios.

The linear convergence of EXTRA requires additional conditions on the functions. There are mainly three types of conditions used in the literature: the strong convexity of (and some weaker variants) [12], the strong convexity of each (and some weaker variants) [36], and the strong convexity of one function  [23]. Note that the condition on is much weaker than the other two; there are cases where is strongly convex but none of ’s is. E.g., for , where is the vector whose th component is and all other components are . If is (restricted) strongly convex with parameter , the linear convergence of EXTRA is shown when in [12]. The upper bound for the stepsize is very conservative, and the better performance with a larger stepsize was shown numerically in [12] without proof. If each is strongly convex with parameter , the linear convergence is shown when and in [27] and [36], respectively. One contribution of this paper to show the linear convergence of EXTRA under the condition of and .

The algorithm NIDS (Network InDepenment Stepsize) was proposed in [13]. Though there is a small difference from EXTRA, NIDS can choose a stepsize that does not depend on the mixing matrices. The convergence of NIDS is shown when . The result for linear convergence requires the strong convexity of . Another contribution of this paper is the linear convergence of NIDS under the (restricted) strong convexity of and relaxed mixing matrices with .

In sum, we provide new and stronger linear convergence results for both EXTRA and NIDS. More specifically,

• We show the linear convergence of EXTRA with the strong convexity of and the relaxed condition . The upper bound of the stepsize can be as large as , which is shown to be optimal in [36] for general convex problems;

• We show the linear convergence of NIDS with the same condition on and as EXTRA. But, the large network-independent stepsize is kept.

### I-a Notation

Since agent has its own estimation of the global variables , we put them together and define

 x=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣ − x⊤1 − − x⊤2 − ⋮ − x⊤n −⎤⎥ ⎥ ⎥ ⎥ ⎥⎦∈Rn×p. (3)

The gradient of is defined as

 ∇f(x)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣ − ∇f1(x1)⊤ − − ∇f2(x2)⊤ − ⋮ − ∇fn(xn)⊤ −⎤⎥ ⎥ ⎥ ⎥ ⎥⎦∈Rn×p. (4)

We say that is consensual if i.e., , where and

In this paper, we use and to denote the Frobenious norm and the corresponding inner product, respectively. For a given matrix and any positive (semi)definite matrix , which is denoted as ( for positive semidefinite), we define The largest and the smallest eigenvalues of a matrix are defined as and . For a symmetric positive semidefinite matrix , we let be the smallest nonzero eigenvalue. is the pseudo inverse of . For a matrix we say a matrix is in if and is in if there exists such that For simplicity, we may use and to replace and , respectively, in the proofs.

## Ii Algorithms and prerequisites

One iteration of EXTRA can be expressed as

 xk+2= (5) −α[∇f(xk+1)−∇f(xk)].

The stepsize , and the symmetric matrices and satisfy . The initial value is chosen arbitrarily, and . In practice, we usually let

One iteration of NIDS for solving (1) is

 xk+2= I+W2[2xk+1−xk (6) −α(∇f(xk+1)−∇f(xk))],

where is the stepsize. The initial value is chosen arbitrarily, and .

If we choose in (5), the difference between EXTRA and NIDS happens only in the communicated data, i.e., whether we exchange the gradient information or not? However, this small difference brings big changes in the convergence [13]. In order for both algorithms to converge, we have the following assumptions on and .

###### Assumption 1 (Mixing matrix)

The connected network consists of a set of nodes and a set of undirected edges . An undirected edge means that there is a connection between agents and and both agents can exchange data. The mixing matrices and satisfy:

1. (Decentralized property) If and , then .

2. (Symmetry) , .

3. (Null space property)

 Null{W−˜W} =span{1}, Null{I−˜W} ⊇span{1}.
4. (Spectral property)

 I+W2≽˜W≻−13I,˜W≽W.
###### Remark 1

Parts 2-4 imply that the spectrum of is enlarged to , while the original assumption is for doubly stochastic matrices. Therefore, in our assumption, does not have to be positive definite. This assumption for is strictly weaker than those in [12] and [13].

###### Remark 2

From [12, Proposition 2.2], which is a critical assumption for both algorithms.

Before showing the theoretical results of EXTRA and NIDS, we reformulate both algorithms.

Reformulation of EXTRA: We reformulate EXTRA by introducing a variable as

 xk+1=˜Wxk+yk−α∇f(xk), (7a) yk+1=yk−(˜W−W)xk+1, (7b)

with . Then (7) is equivalent to EXTRA (5).

###### Proposition 1

Let the -sequence generated by (7) with be , then it’s identical to the sequence generated by EXTRA (5) with the same initial point .

###### Proof:

From (7a), we have

 x1= ˜Wx0+y0−α∇f(x0) = ˜Wx0−(˜W−W)x0−α∇f(x0) = Wx0−α∇f(x0).

For , we have

 xk+2= ˜Wxk+1+yk+1−α∇f(xk+1) = Wxk+1+yk−α∇f(xk+1) = −α[f(xk+1)−f(xk)],

where the second and the last equalities are from (7b) and (7a), respectively.

###### Remark 3

By (7b) and the assumption of , each is in . In addition, for some .

Reformulation of NIDS: We adopt the following reformulation of NIDS from [13]:

 dk+1=dk+I−W2α[xk−α∇f(xk)−αdk], (8a) xk+1=xk−α∇f(xk)−αdk+1, (8b)

with The equivalence is shown in [13].

To establish the linear convergence of EXTRA and NIDS, we need the following two assumptions.

###### Assumption 2 (Solution existence)

There is a unique solution for the consensus problem (1).

###### Assumption 3 (Lipschitz differentiability and (restricted) strong convexity)

Each component is a proper, closed and convex function with a Lipschitz continuous gradient:

 ∥∇fi(x)−∇fi(˜x)∥≤L∥x−˜x∥, ∀x, ˜x∈Rp, (9)

where is the Lipschitz constant. Furthermore, is (restricted) strongly convex with respect to :

 ⟨x−x∗,∇¯f(x)−∇¯f(x∗)⟩≥μ¯f∥x−x∗∥2, ∀x∈Rp. (10)
###### Proposition 2 ([12, Appendix A])

The following two statements are equivalent:

1. is (restricted) strongly convex with respect to

2. For any , is (restricted) strongly convex with respect to . Specially, we can let

 μg=min⎧⎨⎩μ¯f2,μ2¯fλ+min(I−W)μ2¯f+16L2η⎫⎬⎭.

This proposition gives

 ⟨x−x∗,∇f(x)−∇f(x∗)⟩+η∥x−x∗∥2I−W ≥ μg∥x−x∗∥2 (11)

for any From [37, Theorem 2.1.5], the inequality (9) is equivalent to, for any ,

 (12)

## Iii New Linear Convergence Results for EXTRA and NIDS

Throughout this section, we assume that Assumptions 1-3 hold.

### Iii-a Linear Convergence of EXTRA

For simplicity, we introduce some notations. Because of part 4 of Assumption 1, given mixing matrices and there is a constant

 θ∈(34,min{11−λmin(˜W),1}]

such that

 ¯¯¯¯¯¯W\coloneqq (13) H\coloneqq (14) M\coloneqq (˜W−W)†≽0, (15) G\coloneqq W+I−2˜W≽0. (16)

Based on (13), we have

 ˜W=¯¯¯¯¯¯W−(1−θ)(I−˜W). (17)

Let be a fixed point of (7), it is straightforward to show that satisfies

 (˜W−W)x∗= 0 (18)

Part 3 of Assumption 1 shows that is consensual, i.e., for certain . The -iteration in (7b) and the initialization of show . Then we have . Thus, is the optimal solution to the problem (1).

###### Lemma 1 (Norm over range space [13, Lemma 3])

For any symmetric positive (semi)definite matrix with rank (), let be its eigenvalues. Then is a -dimensional subspace in and has a norm defined by , where is the pseudo inverse of . In addition, for all .

For simplicity, we let and stand for and , respectively, in the proofs. The same simplification applies to .

###### Lemma 2 (Norm equality)

Let be the sequence generated by (7), then it satisfies

 (19)
###### Proof:

From Remark 3, we have

 x+=M(y−y+)+z+ (20)

for This equality and (18) give

 ∥x+−x∗∥2˜W−W= ⟨x+−x∗,(˜W−W)(x+−x∗)⟩ = ⟨x+,(˜W−W)x+⟩ = ⟨M(y−y+),y−y+⟩ = ∥y−y+∥2M,

where the third equality holds because of (15), (20), and .

###### Lemma 3 (A key inequality for EXTRA)

Let be the sequence generated by (7), then we have

 ∥xk+1−x∗∥2H+∥yk+1−y∗∥2M ≤ ∥xk−x∗∥2H+∥yk−y∗∥2M−∥xk−xk+1∥2¯¯¯¯¯W −∥xk−xk+1∥2(θ−34)(I−˜W)−∥xk+1−x∗∥2G (21)
###### Proof:

The iteration (7) and equation (17) show

 2α⟨x+−x∗,∇f(x)−∇f(x∗)⟩ = 2⟨x+−x∗,˜W(x−x+)+˜W(x+−x∗) −(x+−x∗)+(y−y∗)⟩ = 2⟨x+−x∗,˜W(x−x+)+(˜W−I)(x+−x∗) +(˜W−W)(x+−x∗)+y+−y+y−y∗⟩ = 2⟨x+−x∗,˜W(x−x+)⟩ +2⟨x+−x∗,y+−y∗⟩−2∥x+−x∗∥2G = 2⟨x+−x∗,¯W(x−x+)⟩ −2⟨x+−x∗,(1−θ)(I−˜W)(x−x+)⟩ +2⟨x+−x∗,y+−y∗⟩−2∥x+−x∗∥2G, (22)

where the first equality comes from (7a), the second one follows (7b), and the last one is from (17).

From Remark 3, for some . Thus

 ⟨z+−x∗,y+−y∗⟩=0,

and the equality (22) can be rewritten as

 2α⟨x+−x∗,∇f(x)−∇f(x∗)⟩ = 2⟨x+−x∗,¯W(x−x+)⟩ −2⟨x+−x∗,(1−θ)(I−˜W)(x−x+)⟩ +2⟨M(y−y+),y+−y∗⟩−2∥x+−x∗∥2G.

Using the basic equality

 2⟨a−b,b−c⟩=∥a−c∥2−∥a−b∥2−∥b−c∥2

and Lemma 2, we have

 ∥x+−x∗∥2¯W−∥x+−x∗∥2(1−θ)(I−˜W) +∥y+−y∗∥2M = ∥x−x∗∥2¯¯¯¯¯W−∥x−x∗∥2(1−θ)(I−˜W) +∥y−y∗∥2M−∥x−x+∥2¯¯¯¯¯W +∥x−x+∥2(1−θ)(I−˜W)−∥x+−x∗∥2˜W−W −2∥x+−x∗∥2G −2α⟨x+−x∗,∇f(x)−∇f(x∗)⟩. (23)

Note that the following inequality holds,

 12∥x+−x∗∥2˜W−W≤∥x+−x∗∥2˜W−W +12∥x−x∗∥2˜W−W−14∥x−x+∥2˜W−W.

Adding it onto both sides of (23), we have

 ∥x+−x∗∥2H−12∥x+−x∗∥2G+∥y+−y∗∥2M ≤ ∥x−x∗∥2H−12∥x−x∗∥2G+∥y−y∗∥2M −∥x−x+∥2¯¯¯¯¯W−∥x−x+∥2(θ−34)(I−˜W) +14∥x−x+∥2G−2∥x+−x∗∥2G −2α⟨x+−x∗,∇f(x)−∇f(x∗)⟩. (24)

Apply the inequality

 14∥x−x+∥2G≤12∥x−x∗∥2G+12∥x+−x∗∥2G,

then the key inequality (3) is obtained.

In the following theorem, we assume (i.e., ). It is easy to amend the proof to show the result for this special case.

###### Theorem 1 (Q-linear convergence of EXTRA)

Under Assumptions 1-3, we define

 r1= (25) r2= 12λmax(G¯¯¯¯¯W−1)>0, (26) r3= r1r2r1+r2+r1r2∈(0,1), (27)

and choose two small parameters and such that

 ξ∈ (0,min{r34λmax(¯¯¯¯¯WM),1}), (28) η∈ (29)

 P\coloneqq H+ξ2(I−W)≻0, Q\coloneqq M+(r3−2ξλmax(¯¯¯¯¯¯WM))¯¯¯¯¯¯W−1≻0.

Then for any stepsize we have

 ∥xk+1−x∗∥2P+∥yk+1−y∗∥2Q (30) ≤ ρ(∥xk−x∗∥2P+∥yk−y∗∥2Q),

where

 ρ\coloneqqmax (31) (4α−2α2Lλmin(¯¯¯¯¯W))ηξ, 1−r3−4ξλmax(¯¯¯¯¯WM)r3+(1−2ξ)λmax(¯¯¯¯¯WM)}.
###### Proof:

From (3) in Lemma 3, we have

 ∥x+−x∗∥2H+∥y+−y∗∥2M ≤ ∥x−x∗∥2H+∥y−y∗∥2M−∥x−x+∥2¯¯¯¯¯W −∥x−x+∥2(θ−34)(I−˜W)−∥x+−x∗∥2G −2α⟨x+−x∗,∇f(x)−∇f(x∗)⟩. (32)

Then we find an upper bound of as

 −∥x−x+∥2¯¯¯¯¯W−2α⟨x+−x∗,∇f(x)−∇f(x∗)⟩ = −2α⟨x−x∗,∇f(x)−∇f(x∗)⟩ −∥¯¯¯¯¯¯W(x−x+)−α(∇f(x)−∇f(x∗))∥2¯¯¯¯¯W−1 ≤ −(2α−α2Lλmin(¯¯¯¯¯W))⟨x−x∗,∇f(x)−∇f(x∗)⟩ −∥¯¯¯¯¯¯W(x−x+)−α(∇f(x)−∇f(x∗))∥2¯¯¯¯¯W−1,

where, the inequality comes from (12). Combining it with (32), we have

 ∥x+−x∗∥2H+∥y+−y∗∥2M −∥x−x∗∥2H−∥y−y∗∥2M ≤ −(2α−α2Lλmin(¯¯¯¯¯W))⟨x−x∗,∇f(x)−∇f(x∗)⟩ −∥¯¯¯¯¯¯W(x−x+)−α(∇f(x)−∇f(x∗))∥2¯¯¯¯¯W−1 −∥x−x+∥2(θ−34)(I−˜W)−∥x+−x∗∥2G. (33)

The inequality (33) shows that is a Cauchy sequence converging to the fixed point of (7). From (11), we can bound the first term on the right hand side of (33) as

 −(2α−α2Lλmin(¯¯¯¯¯W))⟨x−x∗,∇f(x)−∇f(x∗)⟩ ≤ −(2