Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication

05/25/2018 ∙ by Zebang Shen, et al. ∙ 0

Recently, the decentralized optimization problem is attracting growing attention. Most existing methods are deterministic with high per-iteration cost and have a convergence rate quadratically depending on the problem condition number. Besides, the dense communication is necessary to ensure the convergence even if the dataset is sparse. In this paper, we generalize the decentralized optimization problem to a monotone operator root finding problem, and propose a stochastic algorithm named DSBA that (i) converges geometrically with a rate linearly depending on the problem condition number, and (ii) can be implemented using sparse communication only. Additionally, DSBA handles learning problems like AUC-maximization which cannot be tackled efficiently in the decentralized setting. Experiments on convex minimization and AUC-maximization validate the efficiency of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last decade, decentralized learning has received a lot of attention in the machine learning community due to the rise of distributed high-dimensional datasets. This paper focuses on finding a global solution to learning problems in the setting where each node merely has access to a subset of data and are allowed to exchange information with their neighboring nodes only. Specifically, consider a connected network with nodes where each node has access to a local function which is the average of component functions , i.e. . Considering as the local variable of node , the problem of interest is

(1)

The formulation (1) captures problems in sensor network, mobile computation, and multi-agent control, where either efficiently centralizing data or globally aggregate intermediate results is unfeasible (Johansson, 2008; Bullo et al., 2009; Forero et al., 2010; Ribeiro, 2010).

Developing efficient methods for such problem has been one of the major efforts in the machine learning community. While early work dates back to 1980’s (Tsitsiklis et al., 1986; Bertsekas & Tsitsiklis, 1989), consensus based gradient descent and dual averaging methods with sublinear convergence have made their debut (Nedic & Ozdaglar, 2009; Duchi et al., 2012), which consist of two steps: all nodes (i) gather the (usually dense) iterates from theirs neighbors via communication to compute a weighted average, and (ii) update the average by the full gradient of the local to obtain new iterates. Following such protocol, successors with linear convergence have been proposed recently (Shi et al., 2015a; Mokhtari et al., 2016; Scaman et al., 2017).

Despite the progress, two entangled challenges, realized by the above interlacing steps, still remain. The first challenge is the computation complexity of existing methods. Real world tasks commonly suffer from the ill-conditionness of the underlying problem, which deteriorates the performance of existing methods due to their heavy dependence on the problem condition number (Shi et al., 2015a; Mokhtari et al., 2016). Besides, even a single node could contain a plethora of data points, which impedes the full local gradient evaluation required by most existing methods. The second challenge is the high communication overhead. The existing linear convergent methods overlooked such practical issue and simply adopt the dense communication strategy, which restrains their applications.

Furthermore, important problems like AUC maximization involves pairwise component functions which take input outside the local nodes. Multiple rounds of communications are necessary to estimate the gradient, which precludes the direct application of existing linear convergent algorithms. Only sublinear convergent algorithm exists

(Colin et al., 2016; Ying et al., 2016).

To bridge these gaps, we rephrase problem (1

) under the monotone operator framework and propose an efficient algorithm named Decentralized Stochastic Backward Aggregation (DSBA). In the computation step of DSBA, each node computes the resolvent of a stochastically approximated monotone operator to reduce the dependence on the problem condition number. Such resolvent admits closed form solution in problems like Ridge Regression. In the communication step of DSBA, each node receives the nonzero components of the difference between consecutive iterates to reconstruct the neighbors’ iterates. Since the

-relaxed AUC maximization problem is equivalent to the minimax problem of a convex-concave function, whose differential is a monotone operator, fitting it into our formulation is seamless. More specifically, our contributions are as follows:

  1. DSBA accesses a single data point in each iteration and converges linearly with fast rate. The number of steps required to accurate solution is , where is the condition number of the problem and is the condition number of the graph. This rate significantly improves over the existing stochastic decentralized solvers and most deterministic ones, which also holds for the -relaxed AUC maximization.

  2. In contrast to the dense vector transmission in existing methods, the inter-node communication is sparse in DSBA. Specifically, the per-iteration communication complexity is

    for DSBA and for all the other linear convergent methods, where is the sparsity of the dataset and is the problem dimension. When communication is a critical factor, our sparse communication scheme is more favorable.

Empirical studies on convex minimization and AUC maximization problems are conducted to validate the efficiency of our algorithm. Improvements are observed in both computation and communication.

Notations

We use the bold uppercase letters to denote matrices and bold lowercase letters to denote vectors. We refer the row of matrix by and refer the element in the row and column by . denotes the power of . is the projection operator to the range of .

2 Related Work

Method Convergence Rate Per-iteration Cost Communication Cost
EXTRA (Shi et al., 2015a)
DLM (Ling et al., 2015)
SSDA (Scaman et al., 2017)
DSA (Mokhtari & Ribeiro, 2016)
DSBA (this paper)
DSBA-s (this paper)
Table 1: is the condition number of the problem and be the condition number of the network graph, defined in section 6. is the max degree of the graph . is the sparsity of the dataset, i.e. the ratio of nonzero elements. is the complexity of solving a -dimensional equation, and is in problems like Ridge Regression. All the complexity are derived for problems with linear predictor.

Deterministic Methods: Directly solving the primal objective, the consensus-based Decentralized Gradient Descent (DGD) method (Nedic & Ozdaglar, 2009; Yuan et al., 2016) has been proposed, yielding sublinear convergence rate. EXTRA (Shi et al., 2015a) improves over DGD by incorporating information from the last two iterates and is shown to converge linearly. Alternatively, D-ADMM (Shi et al., 2014) directly applies ADMM method to problem (1) and achieves linear convergence. However, D-ADMM computes the proximal operator of in each iteration. To avoid such expensive proximal operator computation, Ling et al. propose a linearized variant of D-ADMM named DLM (Ling et al., 2015). There also have been some efforts to exploit second-order information for accelerating convergence in ill-condition problems (Mokhtari et al., 2017; Eisen et al., 2017). From the dual perspective, (Duchi et al., 2012) uses the dual averaging method and obtains a sublinear convergent algorithm. The work in (Necoara et al., 2017) applies the random block coordinate gradient descent on the dual objective to obtain linear convergence with a rate that depends on , the number of blocks being selected per iteration. When , multiple rounds of communications are needed to implement the method. Recently, (Scaman et al., 2017) applies the accelerated gradient descent methods on the dual problem of (1) to give a method named SSDA and its multi-step communication variant MSDA and shows that the proposed methods are optimal. However, both SSDA and MSDA require computing the gradient of the conjugate function . All the above methods access the whole dataset in each iteration without exploiting the finite sum structure.

Stochastic Methods: By incorporating the SAGA approximation technique, Mokhtari & Ribeiro recently proposed a method named DSA to handle Problem (1) in a stochastic manner. In each iteration, it only computes the gradient of a single component function , which is significantly cheaper than the full gradient evaluation (Mokhtari & Ribeiro, 2016). DSA converges linearly, while the overall required complexity heavily depends on function and graph condition numbers.

We summarize the convergence rate, computation and communication cost of the aforementioned methods in Table 1.

3 Preliminary

3.1 Monotone Operator

Monotone operator is a tool for modeling optimization problems including convex minimization (Rockafellar et al., 1970) and minimax problem of convex-concave functions (Rockafellar, 1970). A relation is a monotone operator if

(2)

is maximal monotone if there is no monotone operator that properly contains it. We say an operator is -strongly monotone if

(3)

and is -cocoercive if

(4)

The cocoercive property implies the maximality and the Lipschitz continuity of ,

(5)

but not vise versa (Bauschke et al., 2017). However, if is both Lipschitz continuous and strongly monotone, it is cocoercive. We denote the identity operator by and define the resolvent of a maximal monotone operator as

(6)

Finding the root of a maximal monotone operator is equivalent to find the fixed point of its resolvent:

(7)

-1ex and when , is equivalent to the proximal operator of function .

3.2 Convex-concave Formulation of AUC Maximization

Area Under the ROC Curve (AUC) (Hanley & McNeil, 1982) is a widely used metric for measuring performance of classification, defined as

(8)

-1ex where is the set of samples, and

is some scoring function. However, directly maximizing AUC is NP-hard as it is equivalent to a combinatorial optimization problem

(Gao et al., 2013). Practical implementations take and replace the discontinuous indicator function with its convex surrogates, e.g. the -loss

(9)

where and are the numbers of positive and negative instances. However, comprises of pairwise losses

(10)

each of which depends on two data points. As discussed in (Colin et al., 2016), minimizing (9) in a decentralized manner remains a challenging task.

For , define . (Ying et al., 2016) reformulates the maximization of function (9) as

(11)

where, for the function is given by

(12)

Such singleton formulation is amenable to decentralized framework because only depends on a single data point.

4 Problem Formulation

Consider a set of nodes which create a connected graph with the node set and the edge set . We assume that the edges are reciprocal, i.e., iff and denote as the neighborhood of node , i.e. .

For the decision variable , consider the problem of finding the root of the operator , where the operator is only available at node and is defined as the sum of Lipschitz continuous strongly monotone operators .

To handle this problem in a decentralized fashion we define as the local copy of at node and solve the program

(13)

The finite sum minimization problem (1) is a special case of (13) by setting , and the -relaxed AUC maximization (11) is captured by choosing with . Since is strongly monotone and Lipschitz continuous, it is cocoercive (Bauschke et al., 2017).

To have a more concrete understanding of the problem, we first introduce an equivalent formulation of Problem (13). Define the matrix as the concatenation of the local iterates and the operator as . Consider the mixing matrix satisfying the following conditions, which satisfies

(i) (Graph sparsity) If , then ;

(ii) (Symmetry) ;

(iii) (Null space property) ;

(iv) (Spectral property) .

It can be shown that Problem (13) is equivalent to

(14)
subject to

This is true since and therefore the condition implies that a matrix is feasible iff .

If we define , the optimality conditions of Problem (14) imply that there exists some , such that for and

(15)

where is a solution of Problem (14). Note that . The first equation of (15) depicts the optimality of : if is a solution, every column of is in and hence there exists such that . We can simply take which gives . The second equation of (15) describes the consensus property of and is equivalent to the constraint of Problem (14).

Using (15), we formulate Problem (13) as finding the root of the following operator

(16)

where the augmented variable matrix is obtain by concatenating with . Using the result in (Davis, 2015), it can be shown that is a maximally monotone operator, and hence its resolvent is well defined. Unfortunately, directly implementing the fixed point iteration requires access to global information which is infeasible in decentralized settings. Inspired by (Wu et al., 2016), we introduce the positive definite matrix

(17)

and use the fixed point iteration of the resolvent of to find the root of (16) according to the recursion

(18)

Note that since is positive definite, shares the same roots with , therefore the solutions of the fixed point updates of and are identical.

The main advantage of the recursion in (18) is that it can be implemented with a single round of local communication only. However, (18) is usually computationally expensive to evaluate. For instance, when , (18) degenerates to the update of P-EXTRA (Shi et al., 2015b), which computes the proximal operator of in each iteration. The evaluation of such proximal operator is considered computational costly in general, especially for large-scale optimization.

In the following section, we introduce an alternative approach that improves the update in (18) in terms of both computation and communication cost by stochastically approximating .

5 Decentralized Stochastic Backward Aggregation

In this section, we propose the Decentralized Stochastic Backward Aggregation (DSBA) algorithm for Problem (13). By exploiting the finite sum structure of each and the sparsity pattern in component operator , DSBA yields lower per-iteration computation and communication cost.

Let be a random sample, approximate by

(19)

where is the history operator output maintained in the same manner with SAGA (Defazio et al., 2014). We further denote . Using such definition , we replace the operators and in (16) by their approximate versions, defined as

(20)

Hence, the fixed point update (18) is changed to

(21)

which by plugging in the definitions of , , , and can be written as

(22)
(23)

Computing the difference between two consecutive iterations of (22) and using (23) lead to the update of the proposed DSBA algorithm, for ,

(24)

where . By setting , the update for step is given by

(25)
0:  consensus initializer , step size , , ;
1:  For all , initialize , set ;
2:  for  do
3:     Gather the iterates from neighbors ;
4:     Choose uniformly at random from the set ;
5:     Update according to (31) () or (29) ()
6:     Compute  from (30);
7:     Compute ;
8:     Set and for ;
9:  end for
Algorithm 1 DSBA for node

Implementation on Node
We now focus on the detailed implementation on a single node . The local version of the update (24) writes

(26)

This update can be further simplified. Using the definition

(27)

we have , and therefore the update in (26) can be simplified to

(28)

Note that shares the same nonzero pattern as the dataset and is usually sparse. For the initial step , since , we have . However, we cannot directly carry out (28) since involves the unknown . To resolve this issue we define for

(29)

Using (29) and (28), it can be easily verified that , therefore can be computed as

(30)

Indeed, the outcome of the updates in (29)-(30) is equivalent to the update in (28), and they can be computed in a decentralized manner. Also, for the initial step , the variable can be computed according to (30) with

(31)

The resolvent (30

) can be obtained by solving a one dimensional equation for learning problems like Logistic Regression, and admits closed form solution for problems like least square and

-relaxed AUC maximization. DSBA is summarized in Algorithm 1.

Remark 5.1.

DSBA is related to DSA in the case that is the gradient of a function, i.e. . In each iteration, if we compute with

(32)

i.e. we evaluate at the instead of , we recover the DSA method (Mokhtari & Ribeiro, 2016). In such gradient operator setting, when the is only a single node, DSBA degenerates to the Point-SAGA method (Defazio, 2016).

5.1 Implementation with Sparse Communication

In existing decentralized methods, nodes need to compute the weighted averages of their neighbors’ iterates, which are dense in general. Therefore a -dimensional full vector must be transmitted via every edge in each iteration.

In this section, we assume the output of every component operator is -sparse, i.e. for all and show that DSBA can be implemented by only transmitting the usually sparse vector (27).

WLOG, we take the perspective of node to describe the communication and computation strategies. First, we define the topological distance from node to node by

(33)

and we have, for all node with distance . Let the diameter of the network be .

All the communication in the network happens when computing . For and we unfold the iteration (29) by the definition of in (24),

(34)

where Suppose that we have a communication strategy that satisfies the following assumption: before evaluating (5.1), node has the set . 3⃝ can be computed because computing each term of 3⃝, , only needs {}. Further, if we can inductively ensure that 1⃝ and every term 2⃝ are in the memory of node before computing (5.1), can be computed since 4⃝ is local information.

In the following, we introduce a communication strategy that satisfies the assumption and show that the inductions on 1⃝ and 2⃝ holds.
Communication: We group the nodes based on the distance: . Define the set . Let , we recursively define . Our communication strategy is, in the iteration, sends the set to . From such strategy, in iteration , node receives from the set . Note that if appears in multiple neighbors of node , only the one with the minimum node index sends it to node . Since , the desired set is obtained.

Computation: We now inductively show that 1⃝ and 2⃝ can be computed. At the beginning of iteration , assume that , , and are maintained in memory. According to the above discussion, can be computed and hence and can be obtained. To maintain the induction, we compute by its definition (28) where (by the communication strategy), , and (by induction) are already available to node . To obtain , we compute first and then compute recursively

(35)

for , where the first term is from induction, the second term is in memory, and the last term is computed in 3⃝. We summarize our strategy in Algorithm 2.

As the choice of node is arbitrary, we use the aforementioned communication and computation strategies for all nodes. By induction, if each node generates correctly at iteration , we can show that and hence can also be correctly computed in the same manner. The computation complexity at each node is , dominated by step 1 in Algorithm 2.

The average communication complexity is of . WLOG, use node as a proxy of all nodes. The computation part requires the set to be received by node at time . Removing the duplicate, we have . Hence, the number of DOUBLEs received by node is of . Further, since that of data sent by all nodes equals to the amount of data received by all nodes, we have the result.

The local storage requirement of DSBA is . Aside from the storage for the dataset, a node stores a delayed copy of other nodes which costs a memory of , and due to the use of linear predictor the cost of storing gradient information at each node is , (Schmidt et al., 2017). Hence, the overall required storage is . As is the number of nonzero elements in the vector it follows that and hence . Further, if we assume every sample has more than N nonzero entries, dominates as well, we need a memory of .

0:  
1:  Compute from its definition;
2:  Compute from (5.1) and from (30);
3:  , update the gradient table;
4:  Compute from (5.1);
4:  
Algorithm 2 Computation on node at iteration

6 Convergence Analysis

In this section, we study the convergence properties of the proposed DSBA method. To achieve this goal, we define a proper Lyapunov function for DSBA and prove its linear convergence to zero which leads to linear convergence of the iterates to the optimal solution . To do so, first we define and the sequence of matrices and as

(36)

Recall the definition of in (15) and define as the concatenation of and , i.e., .

Lemma 6.1.

Consider the proposed DSBA method defined in Algorithm 1. By incorporating the definitions of the matrices and in (36), it can be shown that

(37)

and

(38)
Proof.

See Section 9.1 in the supplementary material. ∎

The result in Lemma 6.1 shows the relation between the norm and its previous iterate . Therefore, to analyze the speed of convergence for we first need to derive bounds for the remaining terms in (38). To do so, we need to define a few more terms. By selecting component operator on node in the iteration, we define for all as

Computing requires to evaluate the resolvent of , but here we only define it for the analysis. In the actual procedure, we only select , compute , and set . Having such definition, we define two nonnegative sequences that are crucial to our analysis:

(39)
(40)

where the nonnegativity of the sequence is due to the monotonicity of each component operator . Define as the component-wise discrepancy between the historically evaluated stochastic gradients and gradients at the optimum

(41)

where is maintained by the SAGA strategy. In the following lemma, we derive an upper bound on the expected inner product .

Lemma 6.2.

Consider the proposed DSBA method defined in Algorithm 1. Further, recall the defintions of the sequences , , and in (39), (40), and (41), respectively. If each operator is -cocoercive, it holds for any and that

(42)
Proof.

See Section 9.2 in the supplementary material. ∎

The next lemma bounds the discrepancy between the average of the historically evaluated stochastic gradients and the gradients at the optimal point.

Lemma 6.3.

Consider the DSBA method outlined in Algorithm 1. From the construction of and the definitions of and in (39) and (41), respectively, we have for ,