Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: A Joint Gradient Estimation and Tracking Approach

10/13/2019
by   Haoran Sun, et al.
0

Many modern large-scale machine learning problems benefit from decentralized and stochastic optimization. Recent works have shown that utilizing both decentralized computing and local stochastic gradient estimates can outperform state-of-the-art centralized algorithms, in applications involving highly non-convex problems, such as training deep neural networks. In this work, we propose a decentralized stochastic algorithm to deal with certain smooth non-convex problems where there are m nodes in the system, and each node has a large number of samples (denoted as n). Differently from the majority of the existing decentralized learning algorithms for either stochastic or finite-sum problems, our focus is given to both reducing the total communication rounds among the nodes, while accessing the minimum number of local data samples. In particular, we propose an algorithm named D-GET (decentralized gradient estimation and tracking), which jointly performs decentralized gradient estimation (which estimates the local gradient using a subset of local samples) and gradient tracking (which tracks the global full gradient using local estimates). We show that, to achieve certain ϵ stationary solution of the deterministic finite sum problem, the proposed algorithm achieves an O(mn^1/2ϵ^-1) sample complexity and an O(ϵ^-1) communication complexity. These bounds significantly improve upon the best existing bounds of O(mnϵ^-1) and O(ϵ^-1), respectively. Similarly, for online problems, the proposed method achieves an O(m ϵ^-3/2) sample complexity and an O(ϵ^-1) communication complexity, while the best existing bounds are O(mϵ^-2) and O(ϵ^-2), respectively.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/04/2021

GT-STORM: Taming Sample, Communication, and Memory Complexities in Decentralized Non-Convex Learning

Decentralized nonconvex optimization has received increasing attention i...
02/12/2021

A hybrid variance-reduced method for decentralized stochastic non-convex optimization

This paper considers decentralized stochastic optimization over a networ...
06/20/2020

On the Divergence of Decentralized Non-Convex Optimization

We study a generic class of decentralized algorithms in which N agents j...
07/24/2019

Robust and Communication-Efficient Collaborative Learning

We consider a decentralized learning problem, where a set of computing n...
07/27/2022

INTERACT: Achieving Low Sample and Communication Complexities in Decentralized Bilevel Learning over Networks

In recent years, decentralized bilevel optimization problems have receiv...
08/17/2020

A near-optimal stochastic gradient method for decentralized non-convex finite-sum optimization

This paper describes a near-optimal stochastic first-order gradient meth...
05/27/2019

An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums

Modern large-scale finite-sum optimization relies on two key aspects: di...

1 Introduction

For modern large-scale information processing problems, performing centralized computation at a single computing node can require a massive amount of computational and memory resources. The recent advances of high-performance computing platforms enable us to utilize distributed resources to significantly improve the computation efficiency [1]. These techniques now become essential for many large-scale tasks such as training machine learning models. Modern decentralized optimization shows that partitioning the large-scale dataset into multiple computing nodes could significantly reduce the amount of gradient evaluation at each computing node without significant loss of any optimality [2]. Compared to the typical parameter-server type distributed system with a fusion center, decentralized optimization has its unique advantages in preserving data privacy, enhancing network robustness, and improving the computation efficiency [2, 3, 4, 5]. Furthermore, in many emerging applications such as collaborative filtering [6], federated learning [7], distributed beamforming [8] and dictionary learning [9], the data is naturally collected in a decentralized setting, and it is not possible to transfer the distributed data to a central location. Therefore, decentralized computation has sparked considerable interest in both academia and industry.

Motivated by these facts, in this paper we consider the following optimization problem,

(1)

where

denotes the loss function which is smooth (possibly non-convex), and

is the total number of such functions. We consider the scenario where each node can only access its local function , and can communicate with its neighbors via an undirected and unweighted graph . In this work, we consider two typical representations of the local cost functions:

  1. Finite-Sum Setting: Each is defined as the average cost of local samples, that is:

    (2)

    where is the total number of local samples at node , denotes the cost for th data sample at th node.

  2. Online Setting: Each is defined as the following expected cost

    (3)

    where denotes the data distribution at node .

To explicitly model the communication pattern, it is conventional to reformulate problem (1) as the following consensus problem, by introducing local variables

, and use the long vector

to stacks all the local variables: :

(4)

This way, the loss functions ’s become separable.

For the above decentralized non-convex problem (4), one essential task is to find an stationary solution such that

(5)

Note that the above solution quality measure encodes both the size of local gradient error for classical centralized non-convex problems and the consensus error for decentralized optimization. It is easy to verify that when goes to zero, an stationary solution for problem (1) is obtained.

Many modern decentralized methods can be applied to obtain the above mentioned stationary solution for problem (4). In the finite-sum setting (2), deterministic decentralized methods such as Primal-Dual, NEXT, SONATA, xFILTER [10, 11, 12, 13], which process the local dataset in full batches, typically achieve communication complexity (i.e., rounds of message exchanges are required to obtain stationary solution), and sample complexity (i.e., that many numbers of evaluations of local sample gradients are required) 111Note that for the finite sum problem (2), the “sample complexity” refers to the total number of samples accessed by the algorithms to compute sample gradient ’s. If the same sample is accessed times and each time the evaluated gradients are different, then the sample complexity increases by .. Meanwhile, stochastic methods such as PSGD, D, stochastic gradient push, GNSD [2, 14, 15, 16], which randomly pick subsets of local samples, achieve sample and communication complexity. These complexity bounds indicate that, when the sample size is large (i.e., ), the stochastic methods are preferred for lower sample complexity, but the deterministic methods still achieve lower communication complexity. On the other hand, in the online setting (3), only stochastic methods can be applied, and those methods again achieve sample and communication complexity [14].

1.1 Related Works

1.1.1 Decentralized Optimization

Decentralized optimization has been extensively studied for convex problems and can be traced back to the 1980s [17]. Many popular algorithms, including decentralized gradient descent (DGD) [3, 5], distributed dual averaging [18], EXTRA [19], distributed augmented Lagrangian method [20], adaptive diffusion [4, 21] and alternating direction method of multipliers (ADMM) [22, 1, 23, 24] have been studied in the literature. We refer the readers to the recent survey [25] and the references therein for a complete review. Recent works also include the study on optimal convergence rates with respect to the network dependency for strongly convex [26] and convex [27] problems. When the problem becomes non-convex, many algorithms such as primal-dual based methods [28, 10], gradient tracking based methods [11, 29], and non-convex extensions of DGD methods [30] have been proposed, where the iteration and communication complexity have been shown. Recently, optimal algorithm with respect to the network dependency has also been proposed in [13] with computation and communication complexity, where denotes the spectral gap of the communication graph . Note that the above algorithms all require full gradient evaluations per iteration, so when directly applied to solve problems where each takes the form in (2), they all require local data samples.

However, due to the requirement that each iteration of the algorithm needs a full gradient evaluation, the above batch methods can be computationally very demanding. One natural solution is to use the stochastic gradient to approximate the true gradient. Stochastic decentralized non-convex methods can be traced back to [31, 32], and recent advances including DSGD [33], PSGD [2], D [14], GNSD [16] and stochastic gradient push [15]

. However, the large variance coming from the stochastic gradient estimator and the use of diminishing step size slow down the convergence, resulting at least

sample and communication cost.

Recent works also include studies on developing distributed algorithms having second-order guarantees [34, 35, 36, 37]. This is an interesting research direction that further showcases the strength of decentralized algorithms. However, to limit the scope of this paper, we only focus on convergence issues of the decentralized method to first-order solutions (as defined in (5)).

1.1.2 Variance Reduction

Consider the following non-convex finite sum problem: . If we assume that has Lipschitz gradient, and directly apply the vanilla gradient descent (GD) method on , then it requires gradient evaluations to reach [38]. When

is large, it is usually preferable to process a subset of data each time. In this case, stochastic gradient descent (SGD) can be used to achieve an

convergence rate [39].

To bridge the gap between the GD and SGD, many variance reduced gradient estimators have been proposed, including SAGA [40] and SVRG [41]. The idea is to reduce the variance of the stochastic gradient estimators and substantially improves the convergence rate. In particular, the above approaches have been shown to achieve sample complexities of for finite sum problems [42, 43, 44] and for online problem [44]. Recent works further improve the above gradient estimators and achieve sample complexity for finite sum problems [45, 46, 47, 48] and sample complexity for online problems [46, 47]. At the same time, the sample complexity is shown to be optimal when [46]. However, it is important to mention that one has to be careful when comparing various complexity bounds. This is because, one key assumption that enables the variance reduced algorithms to achieve improved complexity with respect to is that, each component function has Lipschitz gradient (therefore they are “similar” in certain sense), while the vanilla GD only requires that the sum of the component functions has Lipschitz gradient.

1.1.3 Decentralized Variance Reduction

The variance reduced decentralized optimization has been extensively studied for convex problems. The DSA proposed in [49] combines the algorithm design ideas from EXTRA [19] and SAGA [40], and achieves the first expected linear convergence for decentralized stochastic optimization. Recent works also include the DSBA [50], diffusion-AVRG [51], ADFS [52], SAL-Edge [53], GT-SAGA [54], and Network-DANE [55]. In particular, the DSBA [50] introduces the monotone operator to reduce the dependence on the problem condition number compared to DSA [49]. Diffusion-AVRG combines the exact diffusion [56] with the AVRG [57], and extend the results to scenarios that the size of the data is unevenly distributed. ADFS [52] further uses randomized pairwise communication to achieve optimal network scaling. The work [53] combines the augmented Lagrange (AL) based method with SAGA [40] to allow flexible mixing weight selections. GT-SAGA [54] improves the joint dependence on the condition number and number of samples per node. The Network-DANE [55] studies the Newton-type method and establishes the linear convergence for quadratic losses. However, when the problem becomes non-convex, to the best of our knowledge, no algorithms with provable guarantees are available.

1.2 Our Contribution

Compared with the majority of the existing decentralized learning algorithms for either stochastic or deterministic problems, the focus of this work is given to both reducing the total communication and sample complexity. Specifically, we propose a decentralized gradient estimation and tracking (D-GET) approach, which uses a subset of samples to estimate the local gradients (by utilizing modern variance reduction techniques [46, 58]), while using the differences of past local gradients to track the global gradients (by leveraging the idea of decentralized gradient tracking [11, 59]). Remarkably, the proposed approach enjoys a sample complexity of and communication complexity of for finite sum problem (2), which outperforms all existing decentralized methods222Note that, as mentioned before, deterministic batch gradient based methods such as xFILTER, Prox-PDA, NEXT, EXTRA achieve an sample complexity. However, to be fair, one cannot directly compare those bounds with what can be achieved by sample based, variance reduced methods, since the assumptions on the Lipschitz gradients are slightly different. . The sample complexity rate is worse than the known sample complexity lower bound for centralized problem [46], and the communication complexity matches the existing communication lower bound [13] for decentralized non-convex optimization (in terms of the dependency in ). Furthermore, the proposed approach is also able to achieve sample complexity and communication complexity for the online problem (3), reducing the best existing bounds (such as those obtained in [14, 16]) by factors of and , respectively. We illustrate the main results of this work in Figure 1, and compare the gradient and communication cost for state-of-the-art decentralized non-convex optimization approaches in Table 1333For deterministic batch algorithms such as DGD, NEXT, Prox-PDA and xFILTER, the bounds are obtained directly by multiplying their respective convergence rates with , since when directly applied to solve finite-sum problems, each iteration requires full gradient evaluation.. Note that in Table 1, by constant stepsize we mean that stepsize is not dependent on the target accuracy , nor is it dependent on the iteration number.

(a) Sample Complexity (b) Communication Complexity
Figure 1: Comparison of the sample and communication complexities for a number of decentralized methods. Existing deterministic methods enjoy lower sample complexity at smaller sample sizes, but such complexity scales linearly when the number of samples increases. Stochastic methods generally suffer from high communication complexity. The proposed D-GET bridges the gap between existing deterministic and stochastic methods, and achieves the optimal sample and communication complexities. Note that online methods can also be applied for finite sum problems, thus the actual sample complexity of D-GET is the minimum rate of both cases.
Algorithm Constant Stepsize Finite-Sum Online Communication
DGD [30]
SONATA [12]
Prox-PDA [10]
xFILTER [13]
PSGD [2]
D [14]
GNSD [16]
D-GET (this work)
Lower Bound [46, 13] - -
Table 1: Comparison of algorithms on decentralized non-convex optimization

2 The Finite Sum Setting

In this section, we consider the non-convex decentralized optimization problem (4) with finite number of samples as defined in (2), which is restated below:

(P1)

We make the following standard assumptions on the above problem:

Assumption 1.

The objective function has Lipschitz continuous gradient with constant :

(6)

which also implies

(7a)
(7b)
(7c)
Assumption 2.

The mixing matrix is symmetric, and satisfying the following

(8)

where

denotes the second largest eigenvalue of

.

Note that many choices of mixing matrices satisfy the above condition. Here we give three commonly used mixing matrices [60, 61], where denotes the degree of node , and :

  • Metropolis-Hasting Weight

    (9)
  • Maximum-Degree Weight

    (10)
  • Laplacian Weight

    (11)

    If we use to denote the graph Laplacian matrix, and as the largest and second smallest eigenvalue, then one of the common choices of is .

Next, let us formally define our communication and sample complexity measures.

Definition 1.

(Sample Complexity) The Incremental First-order Oracle (IFO) is defined as an operation in which, one node takes a data sample , a point , and returns the pair . The sample complexity is defined as the total number of IFO calls required across the entire network to achieve an stationary solution defined in (5).

Definition 2.

(Communication Complexity) In one round of communication, each node is allowed to broadcast and received one -dimensional vector to and from its neighbors, respectively. Then the communication complexity is defined as the total rounds of communications required to achieve an stationary solution defined in (5).

2.1 Algorithm Design

In this section, we introduce the proposed algorithm named Decentralized Gradient Estimation and Tracking (D-GET), for solving problem (P1). To motivate our algorithm design, we can observe from our discussion in Section 1.1 that, the existing deterministic decentralized methods typically suffer from the high sample complexity, while the decentralized stochastic algorithms suffer from the high communication cost. Such a phenomenon inspires us to find a solution in between, which could simultaneously reduce the sample and the communication costs.

One natural solution is to incorporate the modern variance reduction techniques into the classical decentralized methods. Our idea is to use some variance reduced gradient estimator to track the full gradient of the entire problem, then perform decentralized gradient descent update. The gradient tracking step gives us fast convergence with a constant stepsize, while the variance reduction method significantly reduces the variation of the estimated gradient.

Unfortunately, the decentralized methods and variance reduction techniques cannot be directly combined. Compared with the existing decentralized and variance reduction techniques in the literature, the key challenges in the algorithm design and analysis are given below:

  • Due to the decentralized nature of the problem, none of the nodes can access the full gradient of the original objective function. The (possibly uncontrollable) network consensus error always exists during the whole process of implementing the decentralized algorithm. Therefore, it is not clear that the existing variance reduction methods could be applied at each individual node effectively, since all of those require accurate global gradient evaluation from time to time.

  • It is then natural to integrate some procedure that is able to approximate the global gradient. For example, one straightforward way to perform gradient tracking is to introduce a new auxiliary variable as the following [11, 16], which is updated by only using local estimated gradient and neighbors’ parameters:

    (12)

    where and are the samples selected at the and th iterations, respectively. If the tracked ’s were used in the (local) variance reduction procedure, there would be at least two main issues of decreasing the variance resulted from the tracked gradient as follows: i) at the early stage of implementing the decentralized algorithm, the consensus/tracking error may dominate the variance of the tracked gradient, since the message of the full gradient has not been sufficiently propagated through the network. Consequently, performing variance reduction on ’s will not be able to increase the quality of the full gradient estimation; ii) even assuming that there was no consensus error. Since only the stochastic gradients, i.e., , were used in the tracking, the ’s themselves had high variance, resulting that such (possibly low-quality) full gradient estimates may not be compatible to variance reduction methods as developed in the current literature (which often require full gradient evaluation from time to time).

The challenges discussed above suggest that it is non-trivial to design an algorithm that can be implemented in a fully decentralized manner, while still achieving the superior sample complexity and convergence rate achieved by state-of-the-art variance reduction methods. In this work, we propose an algorithm which uses a novel decentralized gradient estimation and tracking strategy, together with a number of other design choices, to address the issues raised above.

To introduce the algorithm, let us first define two auxiliary local variables and , where is designed to estimate the local full batch gradient by only using sample gradient , while is designed to track the global average gradient by utilizing ’s. After the local and global gradient estimates are obtained, the algorithm performs local update based on the direction of ; see the main steps below.

  • Local update using estimated gradient ( update): Each local node first combines its previous iterates with its local neighbors (by using the th row of weight matrix ), then makes a prediction based on the gradient estimate , i.e.,

    (13)
  • Estimate local gradients ( update): Each local node either directly calculates the full local gradient , or estimates its local gradient via an estimator using random samples, depending on the iteration , i.e.,

    (14)

    where is the interval in which local full gradient will be evaluated once.

  • Track global gradients ( update): Each local node combines its previous local estimate with its local neighbors , then makes a new estimation based on the fresh information , i.e.,

    (15)

In the following table, we summarize the proposed algorithm in a more compact form. Note that we use , , , and to denote the concatenation of the , , , and across all nodes.

Input:
,
for  do
      local communication & update
     if mod then
         Calculate the full gradient
          local gradient computation
     else
         Each node draws samples from with replacement
          local gradient estimation
     end if
      global gradient tracking
end for
Algorithm 1 D-GET Algorithm for the finite sum problem (P1)

Remark 1. The above algorithm can also be interpreted as a “double loop” algorithm, where each outer iteration (i.e., mod ) is followed by inner iterations (i.e., mod ). The inner loop aims to estimate the local gradient via stochastic sampling at every iteration , while the outer loop aims to reduce the estimation variance by recalculating the full batch gradient at every iterations. The local communication, update, and tracking steps are performed at both inner and outer iterations.

Remark 2. We further remark that, in D-GET, the total communication rounds is in the same order as the total number of iterations, since only two rounds of communications are performed per iteration, via broadcasting the local variable and to their neighbors, and combining local and ’s, . On the other hand, the total number of samples used per iteration is either (where inner iterations are executed) or (where outer iterations are executed).

Remark 3. Note that our and updates are reminiscent of the classical gradient tracking methods [11, 16], and update takes a similar form as the SARAH/SPIDER estimator [58, 46]. However, it is non-trivial to directly combine the gradient tracking and the variance reduction together, as we mentioned at the beginning of Section 2.1. The proposed D-GET uses a number of design choices to address these challenges. For example, two vectors and are used to respectively estimate the local and global gradients, in a way that the local gradient estimates do not depend on the (potentially inaccurate) global tracked gradients; to reduce the variance in , we occasionally use the full local gradient to perform tracking, etc. Nevertheless, the key challenge in the analysis is to properly bound the accumulated errors from the two estimates and .

2.2 Convergence Analysis

To facilitate our analysis, we first define the average iterates and among all nodes,

(16a)
(16b)
(16c)

Note that here we use to denote the overall iteration number. By the double loop nature of the algorithm, we can define the total number of outer iterations until iteration as below:

Before we formally conduct the analysis, we note three simple facts about Algorithm 1.

First, according to (13) and the definition (16a), the update rule of the average iterates can be expressed as:

(17)

Second, if the iteration satisfies mod (that is, when the outer iteration is executed), from (14) and (15) it is easy to check that the following relations hold (given ):

(18)
(19)

Third, if mod, we have the following relations:

(20)
(21)

Next, we outline the proof steps of the convergence rate analysis.

Step 1. We first show that the variance of our local and global gradient estimators can be bounded via and iterates. The bounds to be given below is tighter than the classical analysis of decentralized stochastic methods, which assume the variance are bounded by some universal constant [14, 2, 33]. This is an important step to obtain lower sample/communication complexity, since later we can show that the right-hand-side (RHS) of our bound shrinks as the iteration progresses.

Lemma 1.

(Bounded Variance) Under Assumption 1 - 2, the sequence generated by the inner loop of Algorithm 1 satisfies the following inequalities (for all ):

(22)
(23)

Step 2. We then study the descent on , which is the expected value of the cost function evaluated on the average iterates.

Lemma 2.

(Descent Lemma) Suppose Assumptions 1 - 2 hold, and for any satisfying mod, the following holds for some :

(24)

Then by applying Algorithm 1, we have the following relation for all ,

(25)

A key observation from Lemma 2 is that, in the RHS of (25), besides the negative term in , we also have several extra terms (such as and ) that cannot be made negative. Therefore, we need to find some potential function that is strictly descending per iteration.

Note that in (24) comes from the variance of in estimating the full local gradient at each outer loop . For Algorithm 1, where we calculate a full batch gradient per outer loop in step (18), it is clear that . However, we still would like to include in the above result because, later when we analyze the online version (where such a variance will no longer be zero), we can re-use the above result.

Step 3. Next, we introduce the contraction property, which combined with will be used to construct the potential function.

Lemma 3.

(Iterates Contraction) Using the Assumption 2 on and applying Algorithm 1, we have the following contraction property of the iterates:

(26)
(27)

where is some constant such that .

If we further assume for all satisfying mod, the following holds for some :

(28)

Then we have the following bound on successive difference of for all :

(29)

Again, comes from the variance of the estimating the local gradient in each outer loop, and we have for Algorithm 1. Note that (26) can also be written as following

(30)

One key observation here is that we have by properly choosing . Therefore, the RHS of the above equation can be made negative by properly selecting the stepsize .

Step 4. This step combines the descent estimates obtained in Step 2-3 to construct a potential function, by using a conic combination of , and .

Lemma 4.

(Potential Function) Constructing the following potential function

Under Assumption 1 - 2 and Algorithm 1, if we further pick and define and as in (24) and (28), we have

where

(31)
(32)
(33)
(34)

Step 5. We can then properly choose the stepsize , and make to be positive. Therefore, our solution quality measure can be expressed as the difference of the potential functions and the proof is complete.

Theorem 1.

Consider problem (P1) and under Assumption 1 - 2, if we pick and , then we have following results by applying Algorithm 1,

where denotes the lower bound of , and the constants are defined as following

in which denotes the second largest eigenvalue of the mixing matrix from (8), denotes a constant satisfying , and are defined in (31)-(33).

By directly applying the above result, we have the upper bound on gradient and communication cost by properly choosing based on .

Corollary 1.

To achieve the following stationary solution of problem (P1) by Algorithm 1,

the total number of iterations and communication rounds required are both in the order of , and the total number of samples evaluated across the network is in the order of .

Proof.

If we pick , then we can obtain following from Theorem 1