# Distributed Conjugate Gradient Tracking for Resource Allocation in Unbalanced Networks

This paper proposes a distributed conjugate gradient tracking algorithm (DCGT) to solve resource allocation problems in a possibly unbalanced network, where each node of the network computes its optimal resource via interacting only with its neighboring nodes. Our key idea is the novel use of the celebrated AB algorithm to the dual of the resource allocation problem. To study the convergence of DCGT, we first establish the sublinear convergence of AB for non-convex objective functions, which advances the existing results on AB as they require the strong-convexity of objective functions. Then we show that DCGT converges linearly for strongly convex and Lipschitz smooth objective functions, and sublinearly without the Lipschitz smoothness. Finally, simulation results validate that DCGT outperforms state-of-the-art algorithms in distributed resource allocation problems.

Comments

There are no comments yet.

## Authors

• 10 publications
• 8 publications
• 4 publications
• ### Improved Convergence Rates for Distributed Resource Allocation

In this paper, we develop a class of decentralized algorithms for solvin...
06/16/2017 ∙ by Angelia Nedić, et al. ∙ 0

read it

• ### Distributed Non-Convex First-Order Optimization and Information Processing: Lower Complexity Bounds and Rate Optimal Algorithms

We consider a class of distributed non-convex optimization problems ofte...
04/08/2018 ∙ by Haoran Sun, et al. ∙ 0

read it

• ### Federated Learning over Wireless Networks: Convergence Analysis and Resource Allocation

There is an increasing interest in a fast-growing machine learning techn...
10/29/2019 ∙ by Canh Dinh, et al. ∙ 0

read it

• ### Distributed, Private, and Derandomized Allocation Algorithm for EV Charging

Efficient resource allocation is challenging when privacy of users is im...
04/16/2018 ∙ by Hamid Nabati, et al. ∙ 0

read it

• ### A Problem-Adaptive Algorithm for Resource Allocation

We consider a sequential stochastic resource allocation problem under th...
02/12/2019 ∙ by Xavier Fontaine, et al. ∙ 0

read it

• ### Derandomized Distributed Multi-resource Allocation with Little Communication Overhead

We study a class of distributed optimization problems for multiple share...
12/21/2018 ∙ by Syed Eqbal Alam, et al. ∙ 0

read it

• ### Powershare Mechanics

This paper proposes the governance framework of a gamified social networ...
07/18/2019 ∙ by Beka Dalakishvili, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The distributed resource allocation problem (DRAP) is concerned with optimally allocating resources to multiple nodes, which are distributed among different nodes of a directed network. Specifically, each node is associated with a local privacy-preserved objective function to measure the cost of its allocated resource, and the global goal is to jointly minimize the total cost. The key feature of the DRAP is that each node computes its optimal resource via interacting only with its neighboring nodes in the network. A typical application is economic dispatch, where the local cost function is often quadratic [1]. See [2, 3, 4, 5] for other applications.

### I-a Literature review

Research on DRAPs can be categorized based on whether the underlying network is balanced or not. A balanced network means that the “amount” of information to any node is equal to that from this node, which is critical to the algorithm design. Most of early works on DRAPs focus on balanced networks and the recent interest is shifted to the unbalanced case.

The central-free algorithm (CFA) in [2] is the first documented result on DRAPs in balanced networks where at each iteration every node updates its decision variable using the weighted error between gradients of its local objective function and those of its neighbors. The CFA can be accelerated by designing an optimal weighting matrix [3]. It is shown that the CFA achieves a linear convergence rate for strongly convex and Lipschitz smooth cost functions. For time-varying networks, the CFA is shown to converge sublinearly in the absence of strong convexity [4]. This rate is further improved in [6] by optimizing its dependence on the number of nodes. In addition, there are also several ADMM based methods that only work on balanced networks [7, 8, 9]. By exploiting the mirror relationship between the distributed optimization and distributed resource allocation, several accelerated distributed resource allocation algorithms are given in [10]. Moreover, the works [11] and [12] study continuous-time algorithms for DRAPs by using control theory tools.

For unbalanced networks, the algorithm design for DRAPs is much more complicated, which has been widely acknowledged in the distributed optimization literature [13]. Under this case, a consensus based algorithm that adopts the celebrated surplus idea [14] is proposed in [1] and [15]. However, their convergence results are only for quadratic cost functions where the linear system theory is easily accessible. The extension to general convex functions is performed in [16] by adopting the nonnegative surplus method, at the expense of a slower convergence rate. The ADMM-based algorithms are developed in [17, 18], and algorithms that aim to handle communication delay in time-varying networks and perform event-triggered updates are studied in [19] and [20], respectively. We note that all the above-mentioned works [1, 15, 16, 19, 20, 17, 18] do not provide explicit convergence rate for their algorithms. In contrast, the DCGT of this work is proved to achieves a linear convergence rate for strongly convex and Lipschitz smooth cost functions, and has a sublinear convergence rate without the Lipschitz smoothness.

There are several recent works that can show the convergence rate for their algorithms over unbalanced networks. Most of them leverage the dual relationship between DRAPs and distributed optimization problems. For example, the algorithms in [21] and [22] use stochastic gradients and diminishing stepsize to solve the dual problem of DRAPs, and thus their convergence rates are limited to an order of for Lipschitz smooth cost functions. [22] also shows a rate of even if the cost function is strongly convex. An algorithm with linear convergence is recently proposed in [23] for strongly convex and Lipschitz smooth cost functions. However, its convergence rate is unclear if either the strongly convexity or the Lipschitz smoothness is removed. In [9], a push-sum based algorithms is given in tie with the ADMM. Although it can handle time-varying networks, the convergence rate is even for strongly convex and Lipschitz smooth functions.

### I-B Our contributions

In this work, we propose a distributed conjugate gradient tracking algorithm (DCGT) to solve DRAPs over unbalanced networks. The DCGT exploits the duality of DRAPs and distributed optimization problems via the convex conjugate function of DRAPs, and takes advantage of the state-of-the-art distributed algorithm [24, 25]. When the cost function is strongly convex and Lipschitz smooth, we show that the DCGT converges at a linear rate . If the Lipschitz smoothness assumption is removed, we show that the decision variable in each node of the DCGT converges to its optimal value at a convergence rate of . To our best knowledge, these convergence results are only established for undirected balanced networks in [10]. Although a distributed algorithm for directed networks is also proposed in [10], there is no convergence result. We finally illustrate the advantages of DCGT over existing algorithms via simulation.

To establish the sublinear convergence of the DCGT, we first show that the distributed converges sublinearly to a stationary point even for non-convex objective functions. Clearly, this advances existing works [24, 26, 27] as their convergence results are only for strongly-convex objective functions. In fact, their convergence proofs for in [24, 26, 27] depend on a complicated 3-dimensional matrix and derive the linear convergence rate where is the spectral radius of this matrix. This approach is no longer applicable since linear convergence rate is impossible for the general non-convex case and hence the spectral radius of such a matrix cannot be strictly less than . Moreover, we interpret the DCGT with the celebrated surplus-based average consensus (SBAC) algorithm [14], which provides insights of the DCGT based on the optimality condition.

The rest of this paper is organized as follows. In Section II, we formulate the constrained DRAP in detail. Section III provides the DCGT algorithm for solving DRAPs over unbalanced directed networks, and interprets it as a surplus-based gradient consensus algorithm or a distributed gradient tracking algorithm (). In Section IV, we conduct convergence analysis of the DCGT. In particular, the convergence result of for non-convex objective functions is provided. Section V performs numerical experiments to validate the effectiveness of the DCGT. Finally, we draw conclusive remarks in Section VI.

Notation: We use a lowercase , bold letter and uppercase

to denote a scalar, vector, and matrix, respectively.

denotes the transpose of . denotes the element in the -th row and -th column of . For vectors we use to denote the -norm and for matrices we use and to denote spectral norm and Frobenius norm respectively. denotes the cardinality of set . denotes the set of all -dimensional real numbers. denotes the vector with all ones, the dimension of which depends on the context. We use to denote the gradient of a differentiable function at . We say a nonnegative matrix is row-stochastic if , and column-stochastic if is row-stochastic. We use to denote the big-O notation.

## Ii Problem formulation

Consider the distributed resource allocation problem (DRAP) with nodes where each node has a local privacy-preserved cost function . The goal is to solve the following optimization problem in a distributed manner:

 minimizew1,⋯,wn∈Rm n∑i=1Fi(wi) (1) subject to wi∈Wi, n∑i=1wi=n∑i=1di

where is the local decision vector of node , representing the resources allocated to . is a local convex and closed constraint set. denotes the resource demand of node . Both and are only known to node . Let , then denotes the total available resources, showing the coupling among nodes.

###### Remark 1

Problem (1) covers many forms of DRAPs considered in the literature. For example, the standard local constraint for some constants and is a one-dimensional special case of (1), see e.g. [16, 15, 1, 19, 23]. Moreover, the coupling constraint can be weighted as , which is transformed into (1) by defining a new variable . In addition, many works only consider quadratic cost functions[15, 1].

Solving (1) distributedly means that each node can only communicate and exchange information with a subset of nodes via a communication network, which is modeled by a directed graph . Here denotes the set of nodes, denotes the set of edges, and if node can send information to node . Note that does not necessarily imply that . Define and to be the set of in-neighbors and out-neighbors of node , respectively. That is, node can only receive messages from its in-neighbors and send messages to its out-neighbors. Let if , and , otherwise. is balanced if for all .

The following assumptions are made throughout the paper.

###### Assumption 1 (Strong convexity and Slater’s condition)
1. The local cost function is -strongly convex for all , i.e., for any and ,

 Fi(θw1+(1−θ)w2) (2) ≤θFi(w1)+(1−θ)Fi(w2)−μ2θ(1−θ)∥w1−w2∥2.
2. The constraint is satisfied for some point in the relative interior of the Cartesian product .

###### Assumption 2 (Strongly connected network)

is strongly connected, i.e., there exists a directed path from any node to any node .

Assumption 1 is common in the literature. Note that we do not assume the differentiability of . Under Assumption 1, the optimal point of (1) is unique. We denote its optimal value and optimal point by and , i.e., . Assumption 2 is also common and necessary for the information mixing over a network.

## Iii The Distributed Conjugate Gradient Tracking Algorithm

This section provides the distributed conjugate gradient tracking algorithm (DCGT) to solve (1) over a directed network, and provides two interpretations for DCGT to show the insight of its design. In particular, the DCGT can be explained as a surplus-based gradient consensus algorithm, or a distributed gradient tracking method ().

### Iii-a The DCGT

The DCGT is summarized in Algorithm 1, where each node computes the following update

 ¯w(i)k+1 =∑j∈Niniaij(¯w(j)k+αs(j)k), (3a) w(i)k+1 =argminw∈Wi{Fi(w)−wT¯w(i)k+1}, (3b) s(i)k+1 =∑j∈Ninibijs(j)k−(w(i)k+1−w(i)k). (3c)

Each node keeps updating three vectors and in (3) iteratively. In particular, at each iteration node receives and from each of its in-neighbors , and update according to (3a), where it is satisfied that for any and . is a positive stepsize. The update of in (3c) is similar, where for any and . Let for any for convenience, and define two matrices and , then

is a row-stochastic matrix and

is a column-stochastic matrix. Clearly, the directed network associated with and can be unbalanced.

###### Remark 2

Using a row- and a column- stochastic matrices is to handle the unbalancedness of directed networks as in [13, 28]. In implementation, one can simply set and , and then both conditions are satisfied. Note that this method requires each node to access the number of its out-neighbors, which is commonly used in the literature of distributed optimization over directed networks [29, 28, 30].

The update of in (3b) requires to find an optimal point of a simple local optimization problem, which is similar to many duality-based optimization algorithms such as the dual ascent method [31], and can be readily solved by standard algorithms, e.g., projected (sub)gradient method or Newton’s method. If the local constraint set is the whole space and is differentiable, the solution can be expressed as

 w(i)k+1=∇−1F(¯w(i)k+1) (2b′)

where is the inverse function of , i.e., for any . Moreover, if the decision variable is a scalar and the local constraint set is the interval as with [16, 19, 1], then (3b) becomes

 w(i)k+1=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩¯wi,if ∇−1F(¯w(i)k+1)>¯wiw––i,if ∇−1F(¯w(i)k+1)

Since the update rule (2b) is adopted in [15, 1, 16, 19], their algorithms are special cases of (3b).

An interesting feature of the DCGT lies in the way to handle the coupling constraint . Notice that the algorithm is initialized such that and .111If only the total resource demand is known to all nodes, then we can simply set , which can be done in a distributed manner [16]. By pre-multiplying (3c) with , we obtain that . Thus, if converges to 0, then the constraint is satisfied asymptotically, which is essential to the convergence proof of the DCGT.

Next, we interpret the DCGT from two different perspectives.

### Iii-B Interpretation — Surplus-based gradient consensus

We first show that the DCGT can be regarded as the surplus-based consensus algorithm (SBCA) [14] applied to the local gradient . The SBCA is a celebrated average consensus algorithm, aiming to ensure all nodes’ states to achieve average consensus, i.e., for all . It does not involve any optimization problem, and has the following update rule222Eq. (4a) is slightly different from the original form in [14] though they are essentially equivalent.:

 w(i)k+1 =∑j∈Niniaij(w(j)k+1+αs(j)k) (4a) s(i)k+1 =∑j∈Ninibijs(j)k−(w(i)k+1−w(i)k) (4b)

where and should satisfy the same condition as the DCGT. The algorithm achieves average consensus over any strongly connected networks, provided that is sufficiently small. Roughly speaking, (4a) is to push all to consensus, (4b) is to keep the sum unchanged over , and it is shown that . For more details the reader is referred to [14].

We now show its connection to the DCGT. To simplify notation, we assume . By introducing the Lagrange multiplier to (1), the Lagrange function of problem (1) is given as follows

 L(W,x)=n∑i=1Fi(wi)+xT(n∑i=1wi−d) (5)

where . Let be the optimal point of (1), the Karush-Kuhn-Tucker (KKT) conditions [32] imply that

 ∇F1(w⋆1)=⋯=∇Fn(w⋆n) (6a) n∑i=1w⋆i=d. (6b)

Note that this is a necessary and sufficient condition since Slater’s condition holds [32]. Therefore, the problem reduces to finding satisfying (6).

The consensus condition of gradients (6a) motivates us to use the SBCA. The goal is to achieve the consensus of local gradients while keeping the sum of local states fixed. Thus, it is natural to replace in (4a) with and leave (4b) unchanged. Then, (4) becomes

 ∇Fi(w(i)k+1) =∑j∈Niniaij(∇Fj(w(j)k)+αs(j)k) (7) s(i)k+1 =∑j∈Ninibijs(j)k−(w(i)k+1−w(i)k) (8)

which is exactly (3) with (2b) by introducing a variable .

Although this interpretation helps to understand the DCGT, the convergence analysis of the SBCA is based on the linear system theory, which is no longer applicable to the DCGT since the gradient terms in (3) generally introduce nonlinearity. To prove the convergence of the DCGT, we interpret it as a distributed optimization algorithm with gradient tracking over an unbalanced network in the next subsection, and leverage the interpretation to derive the convergence rate in Section IV.

### Iii-C Interpretation — Distributed optimization with gradient tracking

We now interpret the DCGT in the context of distributed optimization over directed networks. This observation is very helpful to prove its convergence, and importantly, show its convergence rate.

Consider the dual problem of (1), which is given by

 maximizex∈Rm infW∈WL(W,x) (9)

where is the Lagrange function defined in (5). The strong duality holds since the Slater’s condition is satisfied [32], and hence problem (9) is equivalent to (1). Moreover, the objective function in (9) can be written as

 infW∈WL(W,x) =infW∈Wn∑i=1Fi(wi)+xTwi−xTd (10) =n∑i=1infwi∈Wi{Fi(wi)+xTwi}−xTd =n∑i=1−F∗i(−x)−xTd

where is the convex conjugate function of . Thus, the dual problem (9) can be rewritten as a convex optimization problem

 minimizex∈Rm f(x)≜n∑i=1fi(x), fi(x)≜F∗i(−x)+xTdn (11)

or equivalently,

 minimizex1,⋯,xn∈Rm n∑i=1fi(xi) (12) subject to x1=⋅⋅⋅=xn.

Problem (12) is equivalent to problem in the sense that the optimal value of (12) is and the optimal point of (12) satisfies , which implies if is differentiable and . Hence, we can simply focus on solving the dual problem (12), which is widely studied in the context of distributed optimization.

Since is strongly convex, we have that is differentiable and has Lipschitz continuous gradients [32], and the supremum in the definition of is attainable. From Danskin’s theorem [31], the gradient of is given by

 ∇F∗i(x)=argmaxw∈Wi{xTw−Fi(w)}. (13)

Thus, it follows from (11) that

 ∇fi(x)=−∇F∗i(−x)+dn=−argminw∈Wi{xTw+Fi(w)}+1nd. (14)

There are several distributed algorithms to solve (12) over unbalanced and directed networks, such as [29, gradient-push], [33, Push-DIGing], [28, D-DGD], [34, DEXTRA], [35], and their asynchronous counterparts [36, AsySPA], [37, APPG] and [38, 39]. By using the idea of gradient tracking, [24] and [25] propose the distributed (or called push-pull gradient in [25]), which achieves a linear convergence rate if the objective function is strongly convex and Lipschitz smooth. Moreover, the linear update rule of is easier to implement than its competitors (e.g. [33]). Therefore, we adopt to solve (12), which has the following update rule,

 x(i)k+1 =∑j∈Niniaij(x(j)k−αy(j)k) (15) y(i)k+1 =∑j∈Ninibijy(j)k+∇fi(x(i)k+1)−∇fi(x(i)k)

where and are positive weights satisfying the same condition as those of the DCGT, is a sufficiently small stepsize, and and are initialized such that .

Plugging the gradient (14), which results from the conjugate function , into (15) and let , we have

 x(i)k+1 =∑j∈Niniaij(x(j)k−αy(j)k) (16) ^x(i)k+1 =argminw∈Wi{wTx(i)k+Fi(w)} y(i)k+1 =∑j∈Ninibijy(j)k+^x(i)k−^x(i)k+1

which is exactly (3) by letting , and .

Then, the convergence of (3) can follow from the convergence of . However, existing works [24, 40, 25, 26, 27] only show the convergence of for strongly convex and Lipschitz smooth objective functions. Note that in (12) is often not strongly convex due to the introduction of convex conjugate function , though is strongly convex [32]. This is indeed the case for applications that include exponential terms [41] or logarithmic terms [42] in cost functions. In fact, we can only obtain that is differentiable and -Lipschitz smooth [43, Theorem 4.2.1], i.e.,

 ∥∇fi(x)−∇fi(y)∥≤1μ∥x−y∥,∀i∈V,x,y∈Rn. (17)

Thus, we still need to examine the convergence of for non-strongly convex objective functions .

Without strong convexity, the existing results on the convergence proof of may not hold. In fact, a key technique in the proof of [24, 25] relies on constructing a complicated 3-dimensional matrix and showing its spectral radius less than 1 for sufficiently small stepsize. This method does not hold here because the spectral radius of such a matrix is not strictly less than , and we cannot expect a linear convergence rate of . In the next section, we prove that converges to a stationary point at a rate of even for non-convex objective functions, with the help of which we then show the convergence and convergence rate of the DCGT.

## Iv Convergence Analysis

In this section, we first establish the convergence result of in (15) to solve (12) for non-convex , which clearly is of independent interest as the existing results on only apply to the strongly convex case. Then, we show the convergence result of the DCGT by combining with the results in Section III-C.

### Iv-a Convergence analysis of AB without convexity

The algorithm to solve (12) is given in (15). To facilitate the presentation, let

 Xk =[x(1)k,⋯,x(n)k]T∈Rn×m (18) Yk =[y(1)k,⋯,y(n)k]T∈Rn×m ∇fk =[∇f1(x(1)k),⋯,∇fn(x(n)k)]T∈Rn×m

and

 ij={aij,if (j,i)∈E0,otherwise, [B]ij={bij,if (j,i)∈E0,otherwise. (19)

Note that is row-stochastic and is column-stochastic.

Then, (15) can be written in the following compact form

 Xk+1 =A(Xk−αYk) (20a) Yk+1 =BYk+∇fk+1−∇fk (20b)

The convergence result of for non-strongly convex or even non-convex functions are stated in the following theorem.

###### Theorem 1 (Convergence of AB without convexity)

Suppose Assumption 2 holds and all in (12) are differentiable and -Lipschitz smooth (c.f. (17)). If the stepsize is sufficiently small, i.e., satisfies (70), then generated by (15) satisfies that

 1kk∑t=1∥∇f(¯xt)∥2≤f(x0)−f⋆γk+3Lα2(L2c20+c22)γ(1−θ)k (21) +α(√nLc0+c2)(1+∑k0t=1∥∇f(¯xt)∥2)γ(1−θ)k

where , is the normalized left Perron vector of , and are positive constants given in (46), (51), (71), (72) of Appendix, respectively.

Moreover, it holds that

 1kk∑t=1∥Xt−1¯xTt∥2F≤2c20(1−θ)k+c21α2kk∑t=1∥∇f(¯xt)∥2. (22)

The proof of Theorem 1 is deferred to the Appendix. Theorem 1 shows that converges to a stationary point of at a rate of for non-convex functions, which is consistent with the centralized gradient algorithm [31].

### Iv-B Convergence analysis of the DCGT

We now establish the convergence and quantify the convergence rate of the DCGT.

###### Theorem 2 (Convergence of the DCGT)

Suppose Assumptions 1 and 2 hold. If the stepsize is sufficiently small, then in the DCGT (3) converges to its optimal point of (1), i.e., .

###### Proof:

Under Assumption 1, the strong duality holds between the original problem (1) and its dual problem (12), and is -Lipschitz smooth. The analysis in Section III-C reveals the DCGT (16) can be written in a form of (15). Invoking Theorem 1, we know in (15) converge to a point and . Since is convex, we obtain that is an optimal point of problem (12), and hence is an optimal Lagrange multiplier of (1). In view of the relation between (16) and (15), we have converges to , which is the optimal point from KKT conditions.

###### Remark 3

We note that it is possible to extend DCGT to time-varying networks [16], since the convergence of the DCGT essentially depends on the convergence of , which is recently extended to time-varying networks [27] under the strong convexity assumption.

Next, we quantify the convergence rate of the DCGT. Since the resource allocation problem (1) is a constrained optimization problem, there are several ways to analyze the convergence rate. For example, [10] and [23] show convergence rates from a dual perspective based on the optimality condition (6), and or is used as metrics. In contrast, [6] provides convergence rates of approaching the optimal value . However, this does not consider any constraint violation. In this work, we establish the convergence rate of , which not only is more intuitive, but also implicitly includes the vanishing rate of constraint violations. Nonetheless, it is more challenge to quantify the convergence rate in such a way since Theorem 1 presents the convergence rate of w.r.t. the norm of gradients, which is not directly related to . To this end, we introduce a weaker version of Lipschitz smooth condition.

###### Assumption 3

There exists a constant such that

 Fi(w)≤Fi(w⋆i)+∇Fi(w⋆i)T(w−w⋆i)+β2∥w−w⋆i∥2,∀i,w (23)

where is an optimal point of .

Roughly speaking, Assumption 3 is to bound the growth rate of around the optimal point . It is weaker than the Lipschitz smooth assumption. Note that the standard gradient methods and many optimization algorithms make the Lipschitz smoothness assumption to derive their convergence rates.

Moreover, we further assume the local constraint set is the whole space, i.e., . This is also assumed in [10, 6] and can be relaxed if the dual-based convergence rate evaluation is used as . It can also be removed if is a scalar, since the gradient of the conjugate function here can be explicitly given [23, 6].

###### Theorem 3 (Convergence rate of the DCGT)

Suppose that the conditions in Theorem 2 are satisfied, Assumption 3 holds, and . If the stepsize is sufficiently small, then generated by the DCGT in (3) satisfies that

 1kk∑t=1n∑i=1∥w(i)k−w⋆i∥2≤ck (24)

where is a constant depends on , and the network topology.

Moreover, if all have Lipschitz continuous gradients, then linearly converges, i.e., for some .

###### Remark 4

The constant in Theorem 3 and the upper bound of stepsize can be explicitly given, but it is complicated and tedious. Therefore, we prefer to present the asymptotic result, which shows the DCGT converges linearly for strongly convex and Lipschitz smooth cost functions and sub-linearly if Lipschitz smoothness assumption is removed.

###### Proof:

Recall the dual problem in (9) and (11), which combined with Assumption 3 implies that

 f(x)=supwin∑i=1−Fi(wi)−xTwi+xTd (25) ≥maxwin∑i=1−β2∥wi−w⋆i∥2−xTwi−∇Fi(w⋆i)T(wi−w⋆i) −Fi(w⋆i)+xTd =n∑i=1−Fi(w⋆i)+n2β∥x−x⋆∥2−xT(n∑i=1w⋆i−d) =f⋆+n2β∥x−x⋆∥2

where we have used , , and .

The convexity of implies that

 f⋆≥f(x)+∇f(x)T(x⋆−x). (26)

where is a minimum point of , i.e, . Adding (25) and (26) together yields that which further implies

 ∥x−x⋆∥≤2βn∥∇f(x)∥. (27)

Inequality (27) establishes a relation between the norm of gradient and the distance to the optimal point, which is followed by

 n∑i=1∥x(i)k−x⋆∥2≤4β2n2n∑i=1∥∇f(x(i)k)∥2 (28) ≤4β2n2n∑i=1(∥∇f(¯xk)∥+∥∇f(x(i)k)−∇f(¯xk)∥)2 ≤8β2n∥∇f(¯xk)∥2+8β2n2μ2n∑i=1∥x(i)k−¯xk∥2 =8β2n∥∇f(¯xk)∥2+8β2n2μ2∥Xk−1¯xTk∥2F

where the last inequality used (17).

Recall (16) and . It follows from (28) that

 n∑i=1∥w(i)k+1−w⋆i∥2 (29) ≤1μ2n∑i=1∥∇Fi(w(i)k+1)−∇F