DeepAI
Log In Sign Up

Pick your Neighbor: Local Gauss-Southwell Rule for Fast Asynchronous Decentralized Optimization

07/15/2022
by   Marina Costantini, et al.
6

In decentralized optimization environments, each agent i in a network of n optimization nodes possesses a private function f_i, and nodes communicate with their neighbors to cooperatively minimize the aggregate objective ∑_i=1^n f_i. In this setting, synchronizing the nodes' updates incurs significant communication overhead and computational costs, so much of the recent literature has focused on the analysis and design of asynchronous optimization algorithms where agents activate and communicate at arbitrary times, without requiring a global synchronization enforcer. Nonetheless, in most of the work on the topic, active nodes select a neighbor to contact based on a fixed probability (e.g., uniformly at random), a choice that ignores the optimization landscape at the moment of activation. Instead, in this work we introduce an optimization-aware selection rule that chooses the neighbor with the highest dual cost improvement (a quantity related to a consensus-based dualization of the problem at hand). This scheme is related to the coordinate descent (CD) method with a Gauss-Southwell (GS) rule for coordinate updates; in our setting however, only a subset of coordinates is accessible at each iteration (because each node is constrained to communicate only with its direct neighbors), so the existing literature on GS methods does not apply. To overcome this difficulty, we develop a new analytical framework for smooth and strongly convex f_i that covers the class of set-wise CD algorithms – a class that directly applies to decentralized scenarios, but is not limited to them – and we show that the proposed set-wise GS rule achieves a speedup by a factor of up to the maximum degree in the network (which is of the order of Θ(n) in highly connected graphs). The speedup predicted by our theoretical analysis is subsequently validated in numerical experiments with synthetic data.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/07/2021

Decentralized Optimization with Heterogeneous Delays: a Continuous-Time Approach

In decentralized optimization, nodes of a communication network privatel...
04/18/2018

Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization

This paper addresses consensus optimization problems in a multi-agent ne...
10/10/2022

On the Performance of Gradient Tracking with Local Updates

We study the decentralized optimization problem where a network of n age...
05/20/2020

An Optimal Algorithm for Decentralized Finite Sum Optimization

Modern large-scale finite-sum optimization relies on two key aspects: di...
01/24/2019

Asynchronous Decentralized Optimization in Directed Networks

A popular asynchronous protocol for decentralized optimization is random...
11/06/2020

Communication-efficient Decentralized Local SGD over Undirected Networks

We consider the distributed learning problem where a network of n agents...
10/16/2018

Efficient Greedy Coordinate Descent for Composite Problems

Coordinate descent with random coordinate selection is the current state...

1 Introduction

A great number of timely applications require solving optimization problems over a network where nodes can only communicate with their direct neighbors. This may be due to the need of distributing storage and computation loads (e.g. training large machine learning models

[lian2017can]), or to avoid transferring data that is naturally collected in a decentralized manner, either due to the communication costs or to privacy reasons (e.g. sensor networks [wan2009event], edge computing [alrowaily2018secure]).

Specifically, we consider a setting where the nodes want to solve the decentralized optimization problem

(1)

where each local function is known only by node and nodes can exchange optimization values (parameters, gradients) but not the local functions themselves. We represent the communication network as a graph with nodes (agents) and edges, which are the links used by the nodes to communicate with their neighbors.

Problem (1) was formally introduced in [nedic2007rate] and widely studied ever since. A convenient reformulation often adopted in the literature assigns to each node a local variable and forces consensus between node pairs connected by an edge:

(2a)
subject to (2b)

where indicates that edge links nodes  and . Decentralized algorithms to solve (2) allow all nodes to find the minimum value of (1) by just communicating with their neighbors and updating their local variables. This is in contrast with broadcast AllReduce algorithms [rabenseifner2004optimization] or parallel distributed architectures [xiao2019dscovr], which were recently shown to be slower than decentralized schemes in some scenarios [lian2017can].

Here we use reformulation (2) to propose an asynchronous decentralized algorithm where nodes activate at any time uniformly at random, and once activated they choose one of their neighbors to make an update. Methods with such minimal coordination requirements avoid incurring extra costs of synchronization that may also slow down convergence, which is the reason why many algorithms for this asynchronous setting have been proposed in the literature [iutzeler2013asynchronous, wei2013convergence, xu2017convergence, pu2020push, srivastava2011distributed, ram2009asynchronous]

. However, most of these works assume that when a node activates, it simply selects the neighbor to contact randomly, based on a predefined probability distribution. This approach overlooks the possibility of letting nodes

choose the neighbor to contact taking into account the optimization landscape at the time of activation. Therefore, here we depart from the probabilistic choice and ask: can nodes pick the neighbor smartly to make the optimization process converge faster?

In this paper we give an affirmative answer and propose an algorithm that achieves this by solving the dual problem of (2). In the dual formulation, there is one dual variable per constraint , hence each dual variable can be associated with an edge in the graph. Our algorithm lets an activated node contact a neighbor so that together they update their shared variable with a gradient step. In particular, we propose to select the neighbor such that the updated is the one whose directional gradient for the dual function is the largest, and thus the one that provides the greatest cost improvement at that iteration. Such optimal choice for asynchronous decentralized optimization has not yet been considered in the literature.

Interestingly, the above protocol where a node activates and selects a to update can be seen as applying the coordinate descent (CD) method [nesterov2012efficiency] to solve the dual problem of (2), with the following key difference: unlike standard CD methods, now only a small subset of coordinates are accessible at each step, which are the coordinates associated with the edges connected to the node activated. Moreover, our proposal of updating the with the largest gradient is similar to the Gauss-Southwell (GS) rule[nutini2015coordinate], but applied only to the parameters accessible by the activated node.

We name such protocols set-wise CD algorithms and we analyze, in particular, both random uniform sampling and the GS rule for the coordinate selection within the accessible set. To the best of our knowledge, convergence rates for set-wise CD schemes have not yet been explored; hence, it is not known what speedup the GS rule can provide compared to uniform sampling in this setting. Furthermore, there are three difficulties that complicate the analysis and constitute the base of our contributions, namely: (i) for arbitrary graphs, the dual problem of (2) has an objective function that is not strongly convex, even if the primal functions are strongly convex, (ii) the fact that the GS rule is applied to a few coordinates only prevents the use of standard norms to obtain the linear rate, as commonly done for CD methods [nesterov2012efficiency, nutini2015coordinate, nutini2017let], and (iii) the fact that the coordinate sets are overlapping (i.e. non-disjoint) makes the problem even harder.

For this reason, we develop a methodology where we prove strong convexity in norms uniquely defined for each algorithm considered. In particular, for the set-wise GS rule this requires relating the norm that we originally define to an alternative norm that considers non-overlapping sets, for which the problem becomes easier and solvable analytically.

Finally, our results also apply to the parallel distributed setting where the parameter vector is stored at a single server and workers can update different subsets of its entries

[tsitsiklis1986distributed, peng2016arock, xiao2019dscovr]. We show an example in our simulations.

Our contributions can be summarized as follows:

  • We introduce the class of set-wise CD algorithms and analyze two variants to pick the coordinate to update in the activated set: one that uses uniform sampling (SU-CD), and another that applies the GS rule (SGS-CD).

  • We show that this class of algorithms can be used to solve (2) asynchronously, and we provide the linear convergence rates of the two variants considered when the primal functions are smooth and strongly convex.

  • To obtain these rates for SU-CD and SGS-CD, we prove strong convexity in uniquely-defined norms that, respectively (i) take into account the graph structure to show strong convexity in the linear subspace where the coordinate updates are applied, and (ii) account for both the random uniform node activation and the application of the GS rule to just a subset of the coordinates.

  • We show that the speedup of SGS-CD with respect to SU-CD can be up to (the size of the largest coordinate set), which is analogous to the that of the GS rule with respect to random uniform coordinate sampling in centralized CD [nutini2015coordinate].

2 Related work

A number of algorithms have been proposed to solve (1) in the asynchronous setup that we consider here. In [ram2009asynchronous], the activated node chooses a neighbor uniformly at random and both nodes average their primal local values. In [iutzeler2013asynchronous] the authors adapted the ADMM algorithm to the decentralized setting, but it was the ADMM of [wei2013convergence] the first one shown to converge at the same rate as the centralized ADMM. The algorithm of [xu2017convergence] tracks the average gradients to converge to the exact optimum instead of just a neighborhood around it, as many algorithms back then. The algorithm of [pu2020push] can be used on top of directed graphs, which impose additional challenges. A key novelty of our scheme, compared to this line of work, is that we consider the possibility of letting the nodes choose the neighbor to contact in order to make convergence faster.

Work [verma2021max] is, to the best of our knowledge, the only work similarly considering smart neighbor selection. The authors propose Max-Gossip, a version of [nedic2007rate] where the activated node averages its local (primal) parameter with that of the neighbor with whom the parameter difference in the largest. They consider convex scalar functions , and use Lyapunov analysis to prove convergence to an optimal value. In contrast, here we obtain linear convergence rates for smooth and strongly convex using duality theory.

Moreover, our rate results for SU-CD and SGS-CD extend the results in [nutini2015coordinate], where the GS rule was shown to be up to times faster than uniform sampling for , to the case where this choice is constrained to a subset of the coordinates only, sets have different sizes, each coordinate belongs to exactly two sets, and sets activate uniformly at random. This matches not only the decentralized case, but also parallel distributed settings such as [tsitsiklis1986distributed, peng2016arock, xiao2019dscovr].

3 Dual formulation

In this section, we define notation that will be used in the rest of the paper, obtain the dual problem of (2), and analyze the properties of the dual objective function. We will assume throughout that the functions are -smooth and -strongly convex:

We define the concatenated primal and dual variables and , respectively. The graph’s incidence matrix has exactly one 1 and one -1 per column , in the rows corresponding to nodes , and zeros elsewhere (the choice of sign for each node is irrelevant). We call the vector that has 1 in entry and 0 elsewhere; we define in the same way, with the only difference being the dimension. Vectors and are respectively the all-one and all-zero vectors, and is the identity matrix. Finally, we define the block arrays and , where is the Kronecker product.

We can rewrite now (2b) as , and the node variables as . The minimum value of (2) satisfies:

(3)

where (a) holds due to Lagrange duality and (b) holds by strong duality (see e.g. Sec. 5.4 in [boyd2004convex]). Functions are the Fenchel conjugates of the , and are defined as

Our set-wise CD algorithms converge to the optimal solution of (2) by solving (3). In particular, they update a single dual variable at each iteration and converge to some minimum value of . Since each is associated with an edge of the network, the set-wise CD algorithms can run asynchronously.

We now state the convexity properties of . Since the objective in (2a) is -smooth and -strongly convex in , with and , function is -smooth with , where

is the largest eigenvalue of

(Sec. 4 in [uribe2020dual]). We also define as the smallest non-zero eigenvalue111The “+” stresses that is the smallest strictly positive eigenvalue. of .

However, as shown next, function is not strongly convex in the standard L2 norm, which is a property that facilitates the performance analysis of many linear rate optimization methods in the literature.

Proposition 1.

is not strongly convex in .

Proof.

Since does not have full column rank in the general case (i.e., unless the graph is a tree), there exist such that and . ∎

Nevertheless, we can still show linear rates for the set-wise CD algorithms using the fact that is strongly convex in a linear subspace of , as stated next.

Proposition 2 (Appendix C of [hendrikx2019accelerated]).

is -strongly convex in the semi-norm , with .

In the definition of the semi-norm , denotes the pseudo-inverse of . A key fact for the proofs in the next section is that matrix is a projector onto , the column space of .

In order to make the definitions and notation simpler, in the next section we assume that , so that , , and the gradient of in the direction of is a scalar. After our theoretical analysis and presentation of the results, in Sec. 4.3 we discuss the modifications needed to adapt them to the case .

4 Set-wise Coordinate Descent Algorithms

In this section we present the set-wise CD algorithms, which can solve generic convex problems such as (3) optimally and asynchronously. We propose two set-wise CD algorithms: (i) one where the coordinate to update is selected uniformly at random within the accessible set of coordinates (SU-CD), and (ii) one where we pick the coordinate applying the GS rule to the coordinates in the available set (SGS-CD).

If coordinate is updated at iteration , under the simplification the generic CD update applied to  is:

(4)

where is the stepsize. Since is -smooth, choosing guarantees descent at each iteration [nutini2015coordinate]:

(5)

Eq. (5) will be the departure point to prove the linear convergence rates of SU-CD and SGS-CD.

We now define formally the set-wise CD algorithms.

Definition 1 (Set-wise CD algorithm).

In a set-wise CD algorithm, every coordinate is assigned to (potentially multiple) sets , such that all coordinates belong to at least one set. At any point in time, a set might activate with uniform probability among the . When a activates, the set-wise CD algorithm chooses a single coordinate to update using (4).

The next remark shows how the decentralized problem (2) can be solved asynchronously with set-wise CD algorithms.

Remark 1.

By letting (i) the coordinates in Definition 1 be the dual variables , and (ii) the be the sets of dual variables corresponding to the edges that are connected to each node , nodes can run a set-wise CD algorithm to solve (3) (and thus, also (2)) asynchronously.

In light of Remark 1, in the following we illustrate the steps that should be performed by the nodes to run the set-CD algorithms to find a . We first note that the gradient of in the direction of for is

(6)

Nodes can use (4) and (6) to update the variables that they have access to (i.e., those corresponding to the edges they are connected to) as follows: each node keeps in memory the current values of , which are needed to compute . Then, when edge needs to be updated (either because node activated and contacted , or vice versa), both and compute their respective terms in the right hand side of (6) and exchange them through their link. Finally, both nodes compute (6) and update their local copy of applying (4).

Algorithms 1 and 2 below detail these steps for SU-CD and SGS-CD, respectively. In the algorithms we have used to indicate the set of neighbors of node (note that ). Table 1 shows this and other set-related notation that will be frequently used in the sections that follow.

We now proceed to describe the SU-CD and SGS-CD algorithms in detail, and prove their linear convergence rates.

—ll— & Set of edges connected to node
& Set of neighbors of node
& Degree of node , i.e.
& Maximum degree in the network, i.e.
& Selector matrix of set (see Definition 2)
& Subset such that if
& Selector matrix of set
& Complement set of such that
& Selector matrix of set

Table 1: Set-related definitions

4.1 Set-wise Uniform CD (SU-CD)

In SU-CD, the activated node chooses the neighbor uniformly at random, as shown in Alg. 1. We can compute the per-iteration progress of SU-CD taking expectation in (5):

(7)

where and .

The standard procedure to show the linear convergence of CD in the centralized case is to lower-bound using the strong convexity of the function [nesterov2012efficiency, nutini2015coordinate]. However, since is not strongly convex (Prop. 1), we cannot apply this procedure to get the linear rate of SU-CD.

We can, however, use the strong convexity of in instead (Prop. 2). The next result gives the core of the proof.

Proposition 3.

It holds that

where is the dual norm of , defined as (see e.g. Sec. A.1.6 in [boyd2004convex])

(8)
Proof.

Note that such that , it holds that and thus . This means that , and therefore it holds that . Finally, since the dual norm of the L2 norm is again the L2 norm, we have that also , which gives the result. ∎

We now use Prop. 3 to prove the linear rate of SU-CD.

1:  Input: Functions , step , incidence matrix , graph 
2:  Initialize and
3:  for do
4:     Sample activated node uniformly
5:     Node picks neighbor
6:     Node computes and sends it to
7:     Node computes and sends it to
8:      Nodes use (6) to update their local copies of by
9:     
Algorithm 1 Set-wise Uniform CD (SU-CD)
Proposition 4 (Rate of SU-CD).

SU-CD converges as

Proof.

Since is strongly convex in with strong convexity constant (Prop. 2), it holds

Minimizing both sides of the above equation respect to as in Sec. 4 in [nutini2015coordinate] we get

(9)

and rearranging terms we can lower-bound .

Finally, we can use Prop. 3 to replace with in (7), and use the lower bound on given by (9) to get the result. ∎

Note that vector has coordinates, where the inequality holds with equality for regular graphs. We make the following remark.

Remark 2.

If is regular, the linear convergence rate of SU-CD is , which matches the rate of centralized uniform CD for strongly convex functions [nesterov2012efficiency, nutini2015coordinate], with the only difference that now the strong convexity constant is defined over norm .

In the next section we analyze SGS-CD and show that its convergence rate can be up to times that of SU-CD.

4.2 Set-wise Gauss-Southwell CD (SGS-CD)

1:  Input: Functions , step , incidence matrix , graph 
2:  Initialize and
3:  for do
4:     Sample activated node uniformly
5:     All compute and send it to
6:     Node computes
7:      Compute (equivalently, with (6) using the received
8:     Node selects
9:     Node sends to
10:      Nodes use (6) to update their local copies of by
11:     
Algorithm 2 Set-wise Gauss-Southwell CD (SGS-CD)

In SGS-CD, as shown in Alg. 2, the activated node selects the neighbor to contact applying the GS rule within the edges in :

and then satisfies . In order to make this choice, all nodes must send their to node (line 5 in Alg. 2). We discuss this additional communication step of SGS-CD with respect to SU-CD in Sec. 6.

To obtain the convergence rate of SGS-CD we will follow the steps taken for SU-CD in the proof of Prop. 4. As done for SU-CD, we start by computing the per-iteration progress taking expectation in (5) for SGS-CD:

(10)

Given this per-iteration progress, to proceed as we did for SU-CD we need to show (i) that the sum on the right hand side of (10) defines a norm, and (ii) that strong convexity holds in its dual norm. We start by defining the selector matrices , which will significantly simplify notation.

Definition 2 (Selector matrices).

The selector matrices select the coordinates of a vector in that belong to set . Note that any vertical stack of the unitary vectors gives a valid .

We can now show that the sum in (10) is a (squared) norm. Since the operation involves applying within each set , we will denote this norm , where the subscript SM stands for “Set-Max”.

Proposition 5.

The function is a norm in .

Proof.

Using and we can show that satisfies the triangle inequality. It is straightforward to show that and iff . ∎

Following the proof of Prop. 4, we would like to show that is strongly convex in the dual norm . Furthermore, we would like to compare the strong convexity constant with to quantify the speedup of SGS-CD with respect to SU-CD. It turns out, though, that computing is not easy at all; the main difficulty stems from the fact that sets are overlapping (or non-disjoint), since each coordinate belongs to both and . The first scheme in Figure 1 illustrates this fact for the 3-node clique.

Figure 1: Example of sets and one possibility for and

To circumvent this issue, we define a new norm (“Set-Max Non-Overlapping”) that we can directly relate to (Prop. 6) and whose value we can compute explicitly (Prop. 7), which will later allow us to relate its strong convexity constant to (Prop. 8).

Definition 3 (Norm ).

We assume that each coordinate is assigned to only one of the sets or , such that the new sets are non-overlapping (some sets can be empty), and all coordinates belong to exactly one set in . We name the selector matrices of these new sets , so that each possible choice of defines a different set . Then, we define

(11)

with the choice of non-overlapping sets

(12)

Note that the maximizations in (11) and (12) are coupled. We denote the value of that attains (11) as .

The definition of sets corresponds to assigning each edge to one of the two nodes at its endpoints, as illustrated in the second scheme of Figure 1. Therefore, for each possible pair we can define a complementary pair such that if was assigned to in , then it is assigned to in . This corresponds to assigning to the opposite endpoint (node) to the one originally chosen, as shown in the third scheme of Figure 1. With these definitions, it holds (potentially with some permutation of the rows):

We remark that the equality above holds for any corresponding to a feasible assignment , and in particular it hols for . This fact is used in the proof of the following proposition, which relates norms and . This will allow us to complete the analysis with , which we can compute explicitly (Prop. 7).

Proposition 6.

The value of the dual norm of , denoted , satisfies .

Proof.

By definition

By inspection we can tell that the that attains the supremum, denoted , will satisfy . We note now that

(13)

with

(14)

Note that if we evaluate (14) at , due to (12) we have . Also, again by inspection (now of problem (11)) we know that satisfies . Therefore, (13) says

from where we conclude that coordinate-wise it must hold , and thus . ∎

The next proposition gives the value of explicitly, which will be needed to compare the strong convexity constant with .

Proposition 7.

It holds that .

Proof.

Since the sets are non-overlapping and in (11) norm is applied per-set, the entries of will have and the sign will match that of the entries of , i.e. . The maximization of (11) then becomes

subject to

Factoring out in the objective and noting that , we can define and so that (11) now reads

The right hand side is the definition of the dual of the L2 norm evaluated at . Since the dual of the L2 norm is again the L2, we have . ∎

We can now prove the linear convergence rate of SGS-CD.

Proposition 8 (Rate of SGS-CD).

SGS-CD converges as

with

(15)
Proof.

We start by proving (15) by showing that strong convexity in implies strong convexity in , which will give the inequalities as a by-product of the analysis. Below we assume that ; the results here can then be directly applied to the proofs above because and their duals are applied to , which is always in (Prop 3).

For it holds that (Props. 3 and 7):

We also note that, using the Cauchy-Schwarz inequality and denoting the th entry of vector , it holds both that

where . We can summarize these relations as

Using these inequalities in the strong convexity definitions, similarly to [nutini2015coordinate], we get both

(16)

and

(17)

Equation (16) says that is at least -strongly convex in , and eq. (17) says that is at least -strongly convex in . Together they imply (15).

To get the rate of SGS-CD, and following the procedure of SU-CD, we need to lower-bound the per-iteration progress in (10). For this we will use the strong convexity in , which we can obtain from the strong convexity that we just proved for , as shown next.

Stating that is at least -strongly convex in and using Prop. 6 we obtain:

(18)

from where we conclude that .

Minimizing both sides of the first inequality in (18) respect to we obtain

(19)

which is analogous to (9), and rearranging terms gives a lower bound on . Using this lower bound in (10) and replacing gives the rate of SGS-CD.

Proposition 8 states that SGS-CD can be up to faster than SU-CD. This result is analogous to that of [nutini2015coordinate] for the GS rule respect to uniform sampling in centralized CD.

Although this is an upper bound and may not always be achievable, we can think of the following scenario where this gain is attained: let all sets have the same size , exactly out of the coordinates in each set have , and only one have . In this case, on average only times will SU-CD choose the coordinate that gives some improvement, while SGS-CD will do it at all iterations.

Note that this example requires the gradients of all coordinates to be independent, which is not verified in the decentralized optimization setting: according to eq. (6), for a to be zero, it must hold that . But unless this equality holds for all (i.e., unless the minimum has been attained), will continue changing, and the will differ. Thus, the gains of SGS-CD in this setting may not attain the upper bound.

Nevertheless, when it comes to parallel distributed setups, the coordinates are not necessarily coupled as in the decentralized case, and thus the speedup of SGS-CD is still achievable, as shown in our simulations below.

4.3 Case

To extend the proofs above for , the block arrays and should be used instead of and , and the selector matrices should be redefined in the same way. Then, all the operations that in the proofs above are applied per entry (scalar coordinate) of the vector , should now be applied to the magnitude of each vector coordinate of . Also, since , in this case the GS rule becomes .

5 Numerical Results

Figure 2: Comparison of the convergence rates of SU-CD and SGS-CD in two settings: decentralized optimization over a network (left plots), and parallel distributed computation with parameter server (right plots). Given the linear suboptimality reduction

, the thick transparent lines show the part of the curves used to estimate

and for SU-CD and SGS-CD, respectively. The ratio increases notably with , in agreement with the theory.

Figure 2 shows the remarkable speedup of SGS-CD with respect to SU-CD in both the decentralized (left plots) and the parallel distributed (right plots) settings.

For the decentralized setting we created two regular graphs of nodes and degrees and 12, respectively. The local functions were with , and if modulo and otherwise, where is the index of each node. We chose these so that each node would have (approximately) one neighbor out of the with whom the coordinate gradient would have maximum disagreement, thus maximizing the chances of observing differences between SU-CD and SGS-CD.

For the parallel distributed setting, we created a problem that was separable per-coordinate, and we tried to recreate the conditions described in the previous section to approximate the gain. We chose with and . Matrix was diagonal with its non-zero entries sampled from . We then created sets of coordinates such that each coordinate belonged to exactly two sets, similarly to the parallel distributed scenario with parameter server where each worker has access to a subset of the coordinates only. We simulated two different distributions of the coordinates: one with sets of coordinates each, and another with sets of coordinates each. Following the reasoning in the previous section, we set the initial value of coordinates in each set to (close to the optimal value ), and the one remaining to (far away from ).

The plots in Figure 2 show the steep rate gain of SGS-CD with respect to SU-CD as