1 Introduction
Over the last decade, decentralized learning has received a lot of attention in the machine learning community due to the rise of distributed highdimensional datasets. This paper focuses on finding a global solution to learning problems in the setting where each node merely has access to a subset of data and are allowed to exchange information with their neighboring nodes only. Specifically, consider a connected network with nodes where each node has access to a local function which is the average of component functions , i.e. . Considering as the local variable of node , the problem of interest is
(1) 
The formulation (1) captures problems in sensor network, mobile computation, and multiagent control, where either efficiently centralizing data or globally aggregate intermediate results is unfeasible (Johansson, 2008; Bullo et al., 2009; Forero et al., 2010; Ribeiro, 2010).
Developing efficient methods for such problem has been one of the major efforts in the machine learning community. While early work dates back to 1980’s (Tsitsiklis et al., 1986; Bertsekas & Tsitsiklis, 1989), consensus based gradient descent and dual averaging methods with sublinear convergence have made their debut (Nedic & Ozdaglar, 2009; Duchi et al., 2012), which consist of two steps: all nodes (i) gather the (usually dense) iterates from theirs neighbors via communication to compute a weighted average, and (ii) update the average by the full gradient of the local to obtain new iterates. Following such protocol, successors with linear convergence have been proposed recently (Shi et al., 2015a; Mokhtari et al., 2016; Scaman et al., 2017).
Despite the progress, two entangled challenges, realized by the above interlacing steps, still remain. The first challenge is the computation complexity of existing methods. Real world tasks commonly suffer from the illconditionness of the underlying problem, which deteriorates the performance of existing methods due to their heavy dependence on the problem condition number (Shi et al., 2015a; Mokhtari et al., 2016). Besides, even a single node could contain a plethora of data points, which impedes the full local gradient evaluation required by most existing methods. The second challenge is the high communication overhead. The existing linear convergent methods overlooked such practical issue and simply adopt the dense communication strategy, which restrains their applications.
Furthermore, important problems like AUC maximization involves pairwise component functions which take input outside the local nodes. Multiple rounds of communications are necessary to estimate the gradient, which precludes the direct application of existing linear convergent algorithms. Only sublinear convergent algorithm exists
(Colin et al., 2016; Ying et al., 2016).To bridge these gaps, we rephrase problem (1
) under the monotone operator framework and propose an efficient algorithm named Decentralized Stochastic Backward Aggregation (DSBA). In the computation step of DSBA, each node computes the resolvent of a stochastically approximated monotone operator to reduce the dependence on the problem condition number. Such resolvent admits closed form solution in problems like Ridge Regression. In the communication step of DSBA, each node receives the nonzero components of the difference between consecutive iterates to reconstruct the neighbors’ iterates. Since the
relaxed AUC maximization problem is equivalent to the minimax problem of a convexconcave function, whose differential is a monotone operator, fitting it into our formulation is seamless. More specifically, our contributions are as follows:
DSBA accesses a single data point in each iteration and converges linearly with fast rate. The number of steps required to accurate solution is , where is the condition number of the problem and is the condition number of the graph. This rate significantly improves over the existing stochastic decentralized solvers and most deterministic ones, which also holds for the relaxed AUC maximization.

In contrast to the dense vector transmission in existing methods, the internode communication is sparse in DSBA. Specifically, the periteration communication complexity is
for DSBA and for all the other linear convergent methods, where is the sparsity of the dataset and is the problem dimension. When communication is a critical factor, our sparse communication scheme is more favorable.
Empirical studies on convex minimization and AUC maximization problems are conducted to validate the efficiency of our algorithm. Improvements are observed in both computation and communication.
Notations
We use the bold uppercase letters to denote matrices and bold lowercase letters to denote vectors. We refer the row of matrix by and refer the element in the row and column by . denotes the power of . is the projection operator to the range of .
2 Related Work
Method  Convergence Rate  Periteration Cost  Communication Cost 

EXTRA (Shi et al., 2015a)  
DLM (Ling et al., 2015)  
SSDA (Scaman et al., 2017)  
DSA (Mokhtari & Ribeiro, 2016)  
DSBA (this paper)  
DSBAs (this paper) 
Deterministic Methods: Directly solving the primal objective, the consensusbased Decentralized Gradient Descent (DGD) method (Nedic & Ozdaglar, 2009; Yuan et al., 2016) has been proposed, yielding sublinear convergence rate. EXTRA (Shi et al., 2015a) improves over DGD by incorporating information from the last two iterates and is shown to converge linearly. Alternatively, DADMM (Shi et al., 2014) directly applies ADMM method to problem (1) and achieves linear convergence. However, DADMM computes the proximal operator of in each iteration. To avoid such expensive proximal operator computation, Ling et al. propose a linearized variant of DADMM named DLM (Ling et al., 2015). There also have been some efforts to exploit secondorder information for accelerating convergence in illcondition problems (Mokhtari et al., 2017; Eisen et al., 2017). From the dual perspective, (Duchi et al., 2012) uses the dual averaging method and obtains a sublinear convergent algorithm. The work in (Necoara et al., 2017) applies the random block coordinate gradient descent on the dual objective to obtain linear convergence with a rate that depends on , the number of blocks being selected per iteration. When , multiple rounds of communications are needed to implement the method. Recently, (Scaman et al., 2017) applies the accelerated gradient descent methods on the dual problem of (1) to give a method named SSDA and its multistep communication variant MSDA and shows that the proposed methods are optimal. However, both SSDA and MSDA require computing the gradient of the conjugate function . All the above methods access the whole dataset in each iteration without exploiting the finite sum structure.
Stochastic Methods: By incorporating the SAGA approximation technique, Mokhtari & Ribeiro recently proposed a method named DSA to handle Problem (1) in a stochastic manner. In each iteration, it only computes the gradient of a single component function , which is significantly cheaper than the full gradient evaluation (Mokhtari & Ribeiro, 2016). DSA converges linearly, while the overall required complexity heavily depends on function and graph condition numbers.
We summarize the convergence rate, computation and communication cost of the aforementioned methods in Table 1.
3 Preliminary
3.1 Monotone Operator
Monotone operator is a tool for modeling optimization problems including convex minimization (Rockafellar et al., 1970) and minimax problem of convexconcave functions (Rockafellar, 1970). A relation is a monotone operator if
(2) 
is maximal monotone if there is no monotone operator that properly contains it. We say an operator is strongly monotone if
(3) 
and is cocoercive if
(4) 
The cocoercive property implies the maximality and the Lipschitz continuity of ,
(5) 
but not vise versa (Bauschke et al., 2017). However, if is both Lipschitz continuous and strongly monotone, it is cocoercive. We denote the identity operator by and define the resolvent of a maximal monotone operator as
(6) 
Finding the root of a maximal monotone operator is equivalent to find the fixed point of its resolvent:
(7) 
1ex and when , is equivalent to the proximal operator of function .
3.2 Convexconcave Formulation of AUC Maximization
Area Under the ROC Curve (AUC) (Hanley & McNeil, 1982) is a widely used metric for measuring performance of classification, defined as
(8) 
1ex where is the set of samples, and
is some scoring function. However, directly maximizing AUC is NPhard as it is equivalent to a combinatorial optimization problem
(Gao et al., 2013). Practical implementations take and replace the discontinuous indicator function with its convex surrogates, e.g. the loss(9) 
where and are the numbers of positive and negative instances. However, comprises of pairwise losses
(10) 
each of which depends on two data points. As discussed in (Colin et al., 2016), minimizing (9) in a decentralized manner remains a challenging task.
4 Problem Formulation
Consider a set of nodes which create a connected graph with the node set and the edge set . We assume that the edges are reciprocal, i.e., iff and denote as the neighborhood of node , i.e. .
For the decision variable , consider the problem of finding the root of the operator , where the operator is only available at node and is defined as the sum of Lipschitz continuous strongly monotone operators .
To handle this problem in a decentralized fashion we define as the local copy of at node and solve the program
(13) 
The finite sum minimization problem (1) is a special case of (13) by setting , and the relaxed AUC maximization (11) is captured by choosing with . Since is strongly monotone and Lipschitz continuous, it is cocoercive (Bauschke et al., 2017).
To have a more concrete understanding of the problem, we first introduce an equivalent formulation of Problem (13). Define the matrix as the concatenation of the local iterates and the operator as . Consider the mixing matrix satisfying the following conditions, which satisfies
(i) (Graph sparsity) If , then ;
(ii) (Symmetry) ;
(iii) (Null space property) ;
(iv) (Spectral property) .
It can be shown that Problem (13) is equivalent to
(14)  
subject to 
This is true since and therefore the condition implies that a matrix is feasible iff .
If we define , the optimality conditions of Problem (14) imply that there exists some , such that for and
(15) 
where is a solution of Problem (14). Note that . The first equation of (15) depicts the optimality of : if is a solution, every column of is in and hence there exists such that . We can simply take which gives . The second equation of (15) describes the consensus property of and is equivalent to the constraint of Problem (14).
Using (15), we formulate Problem (13) as finding the root of the following operator
(16) 
where the augmented variable matrix is obtain by concatenating with . Using the result in (Davis, 2015), it can be shown that is a maximally monotone operator, and hence its resolvent is well defined. Unfortunately, directly implementing the fixed point iteration requires access to global information which is infeasible in decentralized settings. Inspired by (Wu et al., 2016), we introduce the positive definite matrix
(17) 
and use the fixed point iteration of the resolvent of to find the root of (16) according to the recursion
(18) 
Note that since is positive definite, shares the same roots with , therefore the solutions of the fixed point updates of and are identical.
The main advantage of the recursion in (18) is that it can be implemented with a single round of local communication only. However, (18) is usually computationally expensive to evaluate. For instance, when , (18) degenerates to the update of PEXTRA (Shi et al., 2015b), which computes the proximal operator of in each iteration. The evaluation of such proximal operator is considered computational costly in general, especially for largescale optimization.
In the following section, we introduce an alternative approach that improves the update in (18) in terms of both computation and communication cost by stochastically approximating .
5 Decentralized Stochastic Backward Aggregation
In this section, we propose the Decentralized Stochastic Backward Aggregation (DSBA) algorithm for Problem (13). By exploiting the finite sum structure of each and the sparsity pattern in component operator , DSBA yields lower periteration computation and communication cost.
Let be a random sample, approximate by
(19) 
where is the history operator output maintained in the same manner with SAGA (Defazio et al., 2014). We further denote . Using such definition , we replace the operators and in (16) by their approximate versions, defined as
(20) 
Hence, the fixed point update (18) is changed to
(21) 
which by plugging in the definitions of , , , and can be written as
(22)  
(23) 
Computing the difference between two consecutive iterations of (22) and using (23) lead to the update of the proposed DSBA algorithm, for ,
(24) 
where . By setting , the update for step is given by
(25) 
Implementation on Node
We now focus on the detailed implementation on a single node .
The local version of the update (24) writes
(26) 
This update can be further simplified. Using the definition
(27) 
we have , and therefore the update in (26) can be simplified to
(28) 
Note that shares the same nonzero pattern as the dataset and is usually sparse. For the initial step , since , we have . However, we cannot directly carry out (28) since involves the unknown . To resolve this issue we define for
(29) 
Using (29) and (28), it can be easily verified that , therefore can be computed as
(30) 
Indeed, the outcome of the updates in (29)(30) is equivalent to the update in (28), and they can be computed in a decentralized manner. Also, for the initial step , the variable can be computed according to (30) with
(31) 
The resolvent (30
) can be obtained by solving a one dimensional equation for learning problems like Logistic Regression, and admits closed form solution for problems like least square and
relaxed AUC maximization. DSBA is summarized in Algorithm 1.Remark 5.1.
DSBA is related to DSA in the case that is the gradient of a function, i.e. . In each iteration, if we compute with
(32) 
i.e. we evaluate at the instead of , we recover the DSA method (Mokhtari & Ribeiro, 2016). In such gradient operator setting, when the is only a single node, DSBA degenerates to the PointSAGA method (Defazio, 2016).
5.1 Implementation with Sparse Communication
In existing decentralized methods, nodes need to compute the weighted averages of their neighbors’ iterates, which are dense in general. Therefore a dimensional full vector must be transmitted via every edge in each iteration.
In this section, we assume the output of every component operator is sparse, i.e. for all and show that DSBA can be implemented by only transmitting the usually sparse vector (27).
WLOG, we take the perspective of node to describe the communication and computation strategies. First, we define the topological distance from node to node by
(33) 
and we have, for all node with distance . Let the diameter of the network be .
All the communication in the network happens when computing . For and we unfold the iteration (29) by the definition of in (24),
(34) 
where Suppose that we have a communication strategy that satisfies the following assumption: before evaluating (5.1), node has the set . 3⃝ can be computed because computing each term of 3⃝, , only needs {}. Further, if we can inductively ensure that 1⃝ and every term 2⃝ are in the memory of node before computing (5.1), can be computed since 4⃝ is local information.
In the following, we introduce a communication strategy that satisfies the assumption and show that the inductions on 1⃝ and 2⃝ holds.
Communication: We group the nodes based on the distance: .
Define the set .
Let , we recursively define .
Our communication strategy is, in the iteration, sends the set to .
From such strategy, in iteration , node receives from the set .
Note that if appears in multiple neighbors of node , only the one with the minimum node index sends it to node .
Since , the desired set is obtained.
Computation: We now inductively show that 1⃝ and 2⃝ can be computed. At the beginning of iteration , assume that , , and are maintained in memory. According to the above discussion, can be computed and hence and can be obtained. To maintain the induction, we compute by its definition (28) where (by the communication strategy), , and (by induction) are already available to node . To obtain , we compute first and then compute recursively
(35) 
for , where the first term is from induction, the second term is in memory, and the last term is computed in 3⃝. We summarize our strategy in Algorithm 2.
As the choice of node is arbitrary, we use the aforementioned communication and computation strategies for all nodes. By induction, if each node generates correctly at iteration , we can show that and hence can also be correctly computed in the same manner. The computation complexity at each node is , dominated by step 1 in Algorithm 2.
The average communication complexity is of . WLOG, use node as a proxy of all nodes. The computation part requires the set to be received by node at time . Removing the duplicate, we have . Hence, the number of DOUBLEs received by node is of . Further, since that of data sent by all nodes equals to the amount of data received by all nodes, we have the result.
The local storage requirement of DSBA is . Aside from the storage for the dataset, a node stores a delayed copy of other nodes which costs a memory of , and due to the use of linear predictor the cost of storing gradient information at each node is , (Schmidt et al., 2017). Hence, the overall required storage is . As is the number of nonzero elements in the vector it follows that and hence . Further, if we assume every sample has more than N nonzero entries, dominates as well, we need a memory of .
6 Convergence Analysis
In this section, we study the convergence properties of the proposed DSBA method. To achieve this goal, we define a proper Lyapunov function for DSBA and prove its linear convergence to zero which leads to linear convergence of the iterates to the optimal solution . To do so, first we define and the sequence of matrices and as
(36) 
Recall the definition of in (15) and define as the concatenation of and , i.e., .
Lemma 6.1.
Proof.
See Section 9.1 in the supplementary material. ∎
The result in Lemma 6.1 shows the relation between the norm and its previous iterate . Therefore, to analyze the speed of convergence for we first need to derive bounds for the remaining terms in (38). To do so, we need to define a few more terms. By selecting component operator on node in the iteration, we define for all as
Computing requires to evaluate the resolvent of , but here we only define it for the analysis. In the actual procedure, we only select , compute , and set . Having such definition, we define two nonnegative sequences that are crucial to our analysis:
(39) 
(40) 
where the nonnegativity of the sequence is due to the monotonicity of each component operator . Define as the componentwise discrepancy between the historically evaluated stochastic gradients and gradients at the optimum
(41) 
where is maintained by the SAGA strategy. In the following lemma, we derive an upper bound on the expected inner product .
Lemma 6.2.
Proof.
See Section 9.2 in the supplementary material. ∎
The next lemma bounds the discrepancy between the average of the historically evaluated stochastic gradients and the gradients at the optimal point.
Comments
There are no comments yet.