I Introduction
The distributed resource allocation problem (DRAP) is concerned with optimally allocating resources to multiple nodes, which are distributed among different nodes of a directed network. Specifically, each node is associated with a local privacypreserved objective function to measure the cost of its allocated resource, and the global goal is to jointly minimize the total cost. The key feature of the DRAP is that each node computes its optimal resource via interacting only with its neighboring nodes in the network. A typical application is economic dispatch, where the local cost function is often quadratic [1]. See [2, 3, 4, 5] for other applications.
Ia Literature review
Research on DRAPs can be categorized based on whether the underlying network is balanced or not. A balanced network means that the “amount” of information to any node is equal to that from this node, which is critical to the algorithm design. Most of early works on DRAPs focus on balanced networks and the recent interest is shifted to the unbalanced case.
The centralfree algorithm (CFA) in [2] is the first documented result on DRAPs in balanced networks where at each iteration every node updates its decision variable using the weighted error between gradients of its local objective function and those of its neighbors. The CFA can be accelerated by designing an optimal weighting matrix [3]. It is shown that the CFA achieves a linear convergence rate for strongly convex and Lipschitz smooth cost functions. For timevarying networks, the CFA is shown to converge sublinearly in the absence of strong convexity [4]. This rate is further improved in [6] by optimizing its dependence on the number of nodes. In addition, there are also several ADMM based methods that only work on balanced networks [7, 8, 9]. By exploiting the mirror relationship between the distributed optimization and distributed resource allocation, several accelerated distributed resource allocation algorithms are given in [10]. Moreover, the works [11] and [12] study continuoustime algorithms for DRAPs by using control theory tools.
For unbalanced networks, the algorithm design for DRAPs is much more complicated, which has been widely acknowledged in the distributed optimization literature [13]. Under this case, a consensus based algorithm that adopts the celebrated surplus idea [14] is proposed in [1] and [15]. However, their convergence results are only for quadratic cost functions where the linear system theory is easily accessible. The extension to general convex functions is performed in [16] by adopting the nonnegative surplus method, at the expense of a slower convergence rate. The ADMMbased algorithms are developed in [17, 18], and algorithms that aim to handle communication delay in timevarying networks and perform eventtriggered updates are studied in [19] and [20], respectively. We note that all the abovementioned works [1, 15, 16, 19, 20, 17, 18] do not provide explicit convergence rate for their algorithms. In contrast, the DCGT of this work is proved to achieves a linear convergence rate for strongly convex and Lipschitz smooth cost functions, and has a sublinear convergence rate without the Lipschitz smoothness.
There are several recent works that can show the convergence rate for their algorithms over unbalanced networks. Most of them leverage the dual relationship between DRAPs and distributed optimization problems. For example, the algorithms in [21] and [22] use stochastic gradients and diminishing stepsize to solve the dual problem of DRAPs, and thus their convergence rates are limited to an order of for Lipschitz smooth cost functions. [22] also shows a rate of even if the cost function is strongly convex. An algorithm with linear convergence is recently proposed in [23] for strongly convex and Lipschitz smooth cost functions. However, its convergence rate is unclear if either the strongly convexity or the Lipschitz smoothness is removed. In [9], a pushsum based algorithms is given in tie with the ADMM. Although it can handle timevarying networks, the convergence rate is even for strongly convex and Lipschitz smooth functions.
IB Our contributions
In this work, we propose a distributed conjugate gradient tracking algorithm (DCGT) to solve DRAPs over unbalanced networks. The DCGT exploits the duality of DRAPs and distributed optimization problems via the convex conjugate function of DRAPs, and takes advantage of the stateoftheart distributed algorithm [24, 25]. When the cost function is strongly convex and Lipschitz smooth, we show that the DCGT converges at a linear rate . If the Lipschitz smoothness assumption is removed, we show that the decision variable in each node of the DCGT converges to its optimal value at a convergence rate of . To our best knowledge, these convergence results are only established for undirected balanced networks in [10]. Although a distributed algorithm for directed networks is also proposed in [10], there is no convergence result. We finally illustrate the advantages of DCGT over existing algorithms via simulation.
To establish the sublinear convergence of the DCGT, we first show that the distributed converges sublinearly to a stationary point even for nonconvex objective functions. Clearly, this advances existing works [24, 26, 27] as their convergence results are only for stronglyconvex objective functions. In fact, their convergence proofs for in [24, 26, 27] depend on a complicated 3dimensional matrix and derive the linear convergence rate where is the spectral radius of this matrix. This approach is no longer applicable since linear convergence rate is impossible for the general nonconvex case and hence the spectral radius of such a matrix cannot be strictly less than . Moreover, we interpret the DCGT with the celebrated surplusbased average consensus (SBAC) algorithm [14], which provides insights of the DCGT based on the optimality condition.
The rest of this paper is organized as follows. In Section II, we formulate the constrained DRAP in detail. Section III provides the DCGT algorithm for solving DRAPs over unbalanced directed networks, and interprets it as a surplusbased gradient consensus algorithm or a distributed gradient tracking algorithm (). In Section IV, we conduct convergence analysis of the DCGT. In particular, the convergence result of for nonconvex objective functions is provided. Section V performs numerical experiments to validate the effectiveness of the DCGT. Finally, we draw conclusive remarks in Section VI.
Notation: We use a lowercase , bold letter and uppercase
to denote a scalar, vector, and matrix, respectively.
denotes the transpose of . denotes the element in the th row and th column of . For vectors we use to denote the norm and for matrices we use and to denote spectral norm and Frobenius norm respectively. denotes the cardinality of set . denotes the set of all dimensional real numbers. denotes the vector with all ones, the dimension of which depends on the context. We use to denote the gradient of a differentiable function at . We say a nonnegative matrix is rowstochastic if , and columnstochastic if is rowstochastic. We use to denote the bigO notation.Ii Problem formulation
Consider the distributed resource allocation problem (DRAP) with nodes where each node has a local privacypreserved cost function . The goal is to solve the following optimization problem in a distributed manner:
(1)  
where is the local decision vector of node , representing the resources allocated to . is a local convex and closed constraint set. denotes the resource demand of node . Both and are only known to node . Let , then denotes the total available resources, showing the coupling among nodes.
Remark 1
Problem (1) covers many forms of DRAPs considered in the literature. For example, the standard local constraint for some constants and is a onedimensional special case of (1), see e.g. [16, 15, 1, 19, 23]. Moreover, the coupling constraint can be weighted as , which is transformed into (1) by defining a new variable . In addition, many works only consider quadratic cost functions[15, 1].
Solving (1) distributedly means that each node can only communicate and exchange information with a subset of nodes via a communication network, which is modeled by a directed graph . Here denotes the set of nodes, denotes the set of edges, and if node can send information to node . Note that does not necessarily imply that . Define and to be the set of inneighbors and outneighbors of node , respectively. That is, node can only receive messages from its inneighbors and send messages to its outneighbors. Let if , and , otherwise. is balanced if for all .
The following assumptions are made throughout the paper.
Assumption 1 (Strong convexity and Slater’s condition)

The local cost function is strongly convex for all , i.e., for any and ,
(2) 
The constraint is satisfied for some point in the relative interior of the Cartesian product .
Assumption 2 (Strongly connected network)
is strongly connected, i.e., there exists a directed path from any node to any node .
Iii The Distributed Conjugate Gradient Tracking Algorithm
This section provides the distributed conjugate gradient tracking algorithm (DCGT) to solve (1) over a directed network, and provides two interpretations for DCGT to show the insight of its design. In particular, the DCGT can be explained as a surplusbased gradient consensus algorithm, or a distributed gradient tracking method ().
Iiia The DCGT
The DCGT is summarized in Algorithm 1, where each node computes the following update
(3a)  
(3b)  
(3c) 
Each node keeps updating three vectors and in (3) iteratively. In particular, at each iteration node receives and from each of its inneighbors , and update according to (3a), where it is satisfied that for any and . is a positive stepsize. The update of in (3c) is similar, where for any and . Let for any for convenience, and define two matrices and , then
is a rowstochastic matrix and
is a columnstochastic matrix. Clearly, the directed network associated with and can be unbalanced.Remark 2
Using a row and a column stochastic matrices is to handle the unbalancedness of directed networks as in [13, 28]. In implementation, one can simply set and , and then both conditions are satisfied. Note that this method requires each node to access the number of its outneighbors, which is commonly used in the literature of distributed optimization over directed networks [29, 28, 30].
The update of in (3b) requires to find an optimal point of a simple local optimization problem, which is similar to many dualitybased optimization algorithms such as the dual ascent method [31], and can be readily solved by standard algorithms, e.g., projected (sub)gradient method or Newton’s method. If the local constraint set is the whole space and is differentiable, the solution can be expressed as
(2b) 
where is the inverse function of , i.e., for any . Moreover, if the decision variable is a scalar and the local constraint set is the interval as with [16, 19, 1], then (3b) becomes
(2b) 
Since the update rule (2b) is adopted in [15, 1, 16, 19], their algorithms are special cases of (3b).
An interesting feature of the DCGT lies in the way to handle the coupling constraint . Notice that the algorithm is initialized such that and .^{1}^{1}1If only the total resource demand is known to all nodes, then we can simply set , which can be done in a distributed manner [16]. By premultiplying (3c) with , we obtain that . Thus, if converges to 0, then the constraint is satisfied asymptotically, which is essential to the convergence proof of the DCGT.
Next, we interpret the DCGT from two different perspectives.
IiiB Interpretation — Surplusbased gradient consensus
We first show that the DCGT can be regarded as the surplusbased consensus algorithm (SBCA) [14] applied to the local gradient . The SBCA is a celebrated average consensus algorithm, aiming to ensure all nodes’ states to achieve average consensus, i.e., for all . It does not involve any optimization problem, and has the following update rule^{2}^{2}2Eq. (4a) is slightly different from the original form in [14] though they are essentially equivalent.:
(4a)  
(4b) 
where and should satisfy the same condition as the DCGT. The algorithm achieves average consensus over any strongly connected networks, provided that is sufficiently small. Roughly speaking, (4a) is to push all to consensus, (4b) is to keep the sum unchanged over , and it is shown that . For more details the reader is referred to [14].
We now show its connection to the DCGT. To simplify notation, we assume . By introducing the Lagrange multiplier to (1), the Lagrange function of problem (1) is given as follows
(5) 
where . Let be the optimal point of (1), the KarushKuhnTucker (KKT) conditions [32] imply that
(6a)  
(6b) 
Note that this is a necessary and sufficient condition since Slater’s condition holds [32]. Therefore, the problem reduces to finding satisfying (6).
The consensus condition of gradients (6a) motivates us to use the SBCA. The goal is to achieve the consensus of local gradients while keeping the sum of local states fixed. Thus, it is natural to replace in (4a) with and leave (4b) unchanged. Then, (4) becomes
(7)  
(8) 
Although this interpretation helps to understand the DCGT, the convergence analysis of the SBCA is based on the linear system theory, which is no longer applicable to the DCGT since the gradient terms in (3) generally introduce nonlinearity. To prove the convergence of the DCGT, we interpret it as a distributed optimization algorithm with gradient tracking over an unbalanced network in the next subsection, and leverage the interpretation to derive the convergence rate in Section IV.
IiiC Interpretation — Distributed optimization with gradient tracking
We now interpret the DCGT in the context of distributed optimization over directed networks. This observation is very helpful to prove its convergence, and importantly, show its convergence rate.
Consider the dual problem of (1), which is given by
(9) 
where is the Lagrange function defined in (5). The strong duality holds since the Slater’s condition is satisfied [32], and hence problem (9) is equivalent to (1). Moreover, the objective function in (9) can be written as
(10)  
where is the convex conjugate function of . Thus, the dual problem (9) can be rewritten as a convex optimization problem
(11) 
or equivalently,
(12)  
Problem (12) is equivalent to problem in the sense that the optimal value of (12) is and the optimal point of (12) satisfies , which implies if is differentiable and . Hence, we can simply focus on solving the dual problem (12), which is widely studied in the context of distributed optimization.
Since is strongly convex, we have that is differentiable and has Lipschitz continuous gradients [32], and the supremum in the definition of is attainable. From Danskin’s theorem [31], the gradient of is given by
(13) 
Thus, it follows from (11) that
(14) 
There are several distributed algorithms to solve (12) over unbalanced and directed networks, such as [29, gradientpush], [33, PushDIGing], [28, DDGD], [34, DEXTRA], [35], and their asynchronous counterparts [36, AsySPA], [37, APPG] and [38, 39]. By using the idea of gradient tracking, [24] and [25] propose the distributed (or called pushpull gradient in [25]), which achieves a linear convergence rate if the objective function is strongly convex and Lipschitz smooth. Moreover, the linear update rule of is easier to implement than its competitors (e.g. [33]). Therefore, we adopt to solve (12), which has the following update rule,
(15)  
where and are positive weights satisfying the same condition as those of the DCGT, is a sufficiently small stepsize, and and are initialized such that .
Plugging the gradient (14), which results from the conjugate function , into (15) and let , we have
(16)  
which is exactly (3) by letting , and .
Then, the convergence of (3) can follow from the convergence of . However, existing works [24, 40, 25, 26, 27] only show the convergence of for strongly convex and Lipschitz smooth objective functions. Note that in (12) is often not strongly convex due to the introduction of convex conjugate function , though is strongly convex [32]. This is indeed the case for applications that include exponential terms [41] or logarithmic terms [42] in cost functions. In fact, we can only obtain that is differentiable and Lipschitz smooth [43, Theorem 4.2.1], i.e.,
(17) 
Thus, we still need to examine the convergence of for nonstrongly convex objective functions .
Without strong convexity, the existing results on the convergence proof of may not hold. In fact, a key technique in the proof of [24, 25] relies on constructing a complicated 3dimensional matrix and showing its spectral radius less than 1 for sufficiently small stepsize. This method does not hold here because the spectral radius of such a matrix is not strictly less than , and we cannot expect a linear convergence rate of . In the next section, we prove that converges to a stationary point at a rate of even for nonconvex objective functions, with the help of which we then show the convergence and convergence rate of the DCGT.
Iv Convergence Analysis
In this section, we first establish the convergence result of in (15) to solve (12) for nonconvex , which clearly is of independent interest as the existing results on only apply to the strongly convex case. Then, we show the convergence result of the DCGT by combining with the results in Section IIIC.
Iva Convergence analysis of without convexity
The algorithm to solve (12) is given in (15). To facilitate the presentation, let
(18)  
and
(19) 
Note that is rowstochastic and is columnstochastic.
Then, (15) can be written in the following compact form
(20a)  
(20b) 
The convergence result of for nonstrongly convex or even nonconvex functions are stated in the following theorem.
Theorem 1 (Convergence of without convexity)
Suppose Assumption 2 holds and all in (12) are differentiable and Lipschitz smooth (c.f. (17)). If the stepsize is sufficiently small, i.e., satisfies (70), then generated by (15) satisfies that
(21)  
where , is the normalized left Perron vector of , and are positive constants given in (46), (51), (71), (72) of Appendix, respectively.
Moreover, it holds that
(22) 
IvB Convergence analysis of the DCGT
We now establish the convergence and quantify the convergence rate of the DCGT.
Theorem 2 (Convergence of the DCGT)
Proof:
Under Assumption 1, the strong duality holds between the original problem (1) and its dual problem (12), and is Lipschitz smooth. The analysis in Section IIIC reveals the DCGT (16) can be written in a form of (15). Invoking Theorem 1, we know in (15) converge to a point and . Since is convex, we obtain that is an optimal point of problem (12), and hence is an optimal Lagrange multiplier of (1). In view of the relation between (16) and (15), we have converges to , which is the optimal point from KKT conditions.
Remark 3
Next, we quantify the convergence rate of the DCGT. Since the resource allocation problem (1) is a constrained optimization problem, there are several ways to analyze the convergence rate. For example, [10] and [23] show convergence rates from a dual perspective based on the optimality condition (6), and or is used as metrics. In contrast, [6] provides convergence rates of approaching the optimal value . However, this does not consider any constraint violation. In this work, we establish the convergence rate of , which not only is more intuitive, but also implicitly includes the vanishing rate of constraint violations. Nonetheless, it is more challenge to quantify the convergence rate in such a way since Theorem 1 presents the convergence rate of w.r.t. the norm of gradients, which is not directly related to . To this end, we introduce a weaker version of Lipschitz smooth condition.
Assumption 3
There exists a constant such that
(23) 
where is an optimal point of .
Roughly speaking, Assumption 3 is to bound the growth rate of around the optimal point . It is weaker than the Lipschitz smooth assumption. Note that the standard gradient methods and many optimization algorithms make the Lipschitz smoothness assumption to derive their convergence rates.
Moreover, we further assume the local constraint set is the whole space, i.e., . This is also assumed in [10, 6] and can be relaxed if the dualbased convergence rate evaluation is used as . It can also be removed if is a scalar, since the gradient of the conjugate function here can be explicitly given [23, 6].
Theorem 3 (Convergence rate of the DCGT)
Suppose that the conditions in Theorem 2 are satisfied, Assumption 3 holds, and . If the stepsize is sufficiently small, then generated by the DCGT in (3) satisfies that
(24) 
where is a constant depends on , and the network topology.
Moreover, if all have Lipschitz continuous gradients, then linearly converges, i.e., for some .
Remark 4
The constant in Theorem 3 and the upper bound of stepsize can be explicitly given, but it is complicated and tedious. Therefore, we prefer to present the asymptotic result, which shows the DCGT converges linearly for strongly convex and Lipschitz smooth cost functions and sublinearly if Lipschitz smoothness assumption is removed.
Proof:
Recall the dual problem in (9) and (11), which combined with Assumption 3 implies that
(25)  
where we have used , , and .
The convexity of implies that
(26) 
where is a minimum point of , i.e, . Adding (25) and (26) together yields that which further implies
(27) 
Comments
There are no comments yet.