1 Introduction
For modern largescale information processing problems, performing centralized computation at a single computing node can require a massive amount of computational and memory resources. The recent advances of highperformance computing platforms enable us to utilize distributed resources to significantly improve the computation efficiency [1]. These techniques now become essential for many largescale tasks such as training machine learning models. Modern decentralized optimization shows that partitioning the largescale dataset into multiple computing nodes could significantly reduce the amount of gradient evaluation at each computing node without significant loss of any optimality [2]. Compared to the typical parameterserver type distributed system with a fusion center, decentralized optimization has its unique advantages in preserving data privacy, enhancing network robustness, and improving the computation efficiency [2, 3, 4, 5]. Furthermore, in many emerging applications such as collaborative filtering [6], federated learning [7], distributed beamforming [8] and dictionary learning [9], the data is naturally collected in a decentralized setting, and it is not possible to transfer the distributed data to a central location. Therefore, decentralized computation has sparked considerable interest in both academia and industry.
Motivated by these facts, in this paper we consider the following optimization problem,
(1) 
where
denotes the loss function which is smooth (possibly nonconvex), and
is the total number of such functions. We consider the scenario where each node can only access its local function , and can communicate with its neighbors via an undirected and unweighted graph . In this work, we consider two typical representations of the local cost functions:
FiniteSum Setting: Each is defined as the average cost of local samples, that is:
(2) where is the total number of local samples at node , denotes the cost for th data sample at th node.

Online Setting: Each is defined as the following expected cost
(3) where denotes the data distribution at node .
To explicitly model the communication pattern, it is conventional to reformulate problem (1) as the following consensus problem, by introducing local variables
, and use the long vector
to stacks all the local variables: :(4) 
This way, the loss functions ’s become separable.
For the above decentralized nonconvex problem (4), one essential task is to find an stationary solution such that
(5) 
Note that the above solution quality measure encodes both the size of local gradient error for classical centralized nonconvex problems and the consensus error for decentralized optimization. It is easy to verify that when goes to zero, an stationary solution for problem (1) is obtained.
Many modern decentralized methods can be applied to obtain the above mentioned stationary solution for problem (4). In the finitesum setting (2), deterministic decentralized methods such as PrimalDual, NEXT, SONATA, xFILTER [10, 11, 12, 13], which process the local dataset in full batches, typically achieve communication complexity (i.e., rounds of message exchanges are required to obtain stationary solution), and sample complexity (i.e., that many numbers of evaluations of local sample gradients are required) ^{1}^{1}1Note that for the finite sum problem (2), the “sample complexity” refers to the total number of samples accessed by the algorithms to compute sample gradient ’s. If the same sample is accessed times and each time the evaluated gradients are different, then the sample complexity increases by .. Meanwhile, stochastic methods such as PSGD, D, stochastic gradient push, GNSD [2, 14, 15, 16], which randomly pick subsets of local samples, achieve sample and communication complexity. These complexity bounds indicate that, when the sample size is large (i.e., ), the stochastic methods are preferred for lower sample complexity, but the deterministic methods still achieve lower communication complexity. On the other hand, in the online setting (3), only stochastic methods can be applied, and those methods again achieve sample and communication complexity [14].
1.1 Related Works
1.1.1 Decentralized Optimization
Decentralized optimization has been extensively studied for convex problems and can be traced back to the 1980s [17]. Many popular algorithms, including decentralized gradient descent (DGD) [3, 5], distributed dual averaging [18], EXTRA [19], distributed augmented Lagrangian method [20], adaptive diffusion [4, 21] and alternating direction method of multipliers (ADMM) [22, 1, 23, 24] have been studied in the literature. We refer the readers to the recent survey [25] and the references therein for a complete review. Recent works also include the study on optimal convergence rates with respect to the network dependency for strongly convex [26] and convex [27] problems. When the problem becomes nonconvex, many algorithms such as primaldual based methods [28, 10], gradient tracking based methods [11, 29], and nonconvex extensions of DGD methods [30] have been proposed, where the iteration and communication complexity have been shown. Recently, optimal algorithm with respect to the network dependency has also been proposed in [13] with computation and communication complexity, where denotes the spectral gap of the communication graph . Note that the above algorithms all require full gradient evaluations per iteration, so when directly applied to solve problems where each takes the form in (2), they all require local data samples.
However, due to the requirement that each iteration of the algorithm needs a full gradient evaluation, the above batch methods can be computationally very demanding. One natural solution is to use the stochastic gradient to approximate the true gradient. Stochastic decentralized nonconvex methods can be traced back to [31, 32], and recent advances including DSGD [33], PSGD [2], D [14], GNSD [16] and stochastic gradient push [15]
. However, the large variance coming from the stochastic gradient estimator and the use of diminishing step size slow down the convergence, resulting at least
sample and communication cost.Recent works also include studies on developing distributed algorithms having secondorder guarantees [34, 35, 36, 37]. This is an interesting research direction that further showcases the strength of decentralized algorithms. However, to limit the scope of this paper, we only focus on convergence issues of the decentralized method to firstorder solutions (as defined in (5)).
1.1.2 Variance Reduction
Consider the following nonconvex finite sum problem: . If we assume that has Lipschitz gradient, and directly apply the vanilla gradient descent (GD) method on , then it requires gradient evaluations to reach [38]. When
is large, it is usually preferable to process a subset of data each time. In this case, stochastic gradient descent (SGD) can be used to achieve an
convergence rate [39].To bridge the gap between the GD and SGD, many variance reduced gradient estimators have been proposed, including SAGA [40] and SVRG [41]. The idea is to reduce the variance of the stochastic gradient estimators and substantially improves the convergence rate. In particular, the above approaches have been shown to achieve sample complexities of for finite sum problems [42, 43, 44] and for online problem [44]. Recent works further improve the above gradient estimators and achieve sample complexity for finite sum problems [45, 46, 47, 48] and sample complexity for online problems [46, 47]. At the same time, the sample complexity is shown to be optimal when [46]. However, it is important to mention that one has to be careful when comparing various complexity bounds. This is because, one key assumption that enables the variance reduced algorithms to achieve improved complexity with respect to is that, each component function has Lipschitz gradient (therefore they are “similar” in certain sense), while the vanilla GD only requires that the sum of the component functions has Lipschitz gradient.
1.1.3 Decentralized Variance Reduction
The variance reduced decentralized optimization has been extensively studied for convex problems. The DSA proposed in [49] combines the algorithm design ideas from EXTRA [19] and SAGA [40], and achieves the first expected linear convergence for decentralized stochastic optimization. Recent works also include the DSBA [50], diffusionAVRG [51], ADFS [52], SALEdge [53], GTSAGA [54], and NetworkDANE [55]. In particular, the DSBA [50] introduces the monotone operator to reduce the dependence on the problem condition number compared to DSA [49]. DiffusionAVRG combines the exact diffusion [56] with the AVRG [57], and extend the results to scenarios that the size of the data is unevenly distributed. ADFS [52] further uses randomized pairwise communication to achieve optimal network scaling. The work [53] combines the augmented Lagrange (AL) based method with SAGA [40] to allow flexible mixing weight selections. GTSAGA [54] improves the joint dependence on the condition number and number of samples per node. The NetworkDANE [55] studies the Newtontype method and establishes the linear convergence for quadratic losses. However, when the problem becomes nonconvex, to the best of our knowledge, no algorithms with provable guarantees are available.
1.2 Our Contribution
Compared with the majority of the existing decentralized learning algorithms for either stochastic or deterministic problems, the focus of this work is given to both reducing the total communication and sample complexity. Specifically, we propose a decentralized gradient estimation and tracking (DGET) approach, which uses a subset of samples to estimate the local gradients (by utilizing modern variance reduction techniques [46, 58]), while using the differences of past local gradients to track the global gradients (by leveraging the idea of decentralized gradient tracking [11, 59]). Remarkably, the proposed approach enjoys a sample complexity of and communication complexity of for finite sum problem (2), which outperforms all existing decentralized methods^{2}^{2}2Note that, as mentioned before, deterministic batch gradient based methods such as xFILTER, ProxPDA, NEXT, EXTRA achieve an sample complexity. However, to be fair, one cannot directly compare those bounds with what can be achieved by sample based, variance reduced methods, since the assumptions on the Lipschitz gradients are slightly different. . The sample complexity rate is worse than the known sample complexity lower bound for centralized problem [46], and the communication complexity matches the existing communication lower bound [13] for decentralized nonconvex optimization (in terms of the dependency in ). Furthermore, the proposed approach is also able to achieve sample complexity and communication complexity for the online problem (3), reducing the best existing bounds (such as those obtained in [14, 16]) by factors of and , respectively. We illustrate the main results of this work in Figure 1, and compare the gradient and communication cost for stateoftheart decentralized nonconvex optimization approaches in Table 1^{3}^{3}3For deterministic batch algorithms such as DGD, NEXT, ProxPDA and xFILTER, the bounds are obtained directly by multiplying their respective convergence rates with , since when directly applied to solve finitesum problems, each iteration requires full gradient evaluation.. Note that in Table 1, by constant stepsize we mean that stepsize is not dependent on the target accuracy , nor is it dependent on the iteration number.
2 The Finite Sum Setting
In this section, we consider the nonconvex decentralized optimization problem (4) with finite number of samples as defined in (2), which is restated below:
(P1) 
We make the following standard assumptions on the above problem:
Assumption 1.
The objective function has Lipschitz continuous gradient with constant :
(6) 
which also implies
(7a)  
(7b)  
(7c) 
Assumption 2.
The mixing matrix is symmetric, and satisfying the following
(8) 
where
denotes the second largest eigenvalue of
.Note that many choices of mixing matrices satisfy the above condition. Here we give three commonly used mixing matrices [60, 61], where denotes the degree of node , and :

MetropolisHasting Weight
(9) 
MaximumDegree Weight
(10) 
Laplacian Weight
(11) If we use to denote the graph Laplacian matrix, and as the largest and second smallest eigenvalue, then one of the common choices of is .
Next, let us formally define our communication and sample complexity measures.
Definition 1.
(Sample Complexity) The Incremental Firstorder Oracle (IFO) is defined as an operation in which, one node takes a data sample , a point , and returns the pair . The sample complexity is defined as the total number of IFO calls required across the entire network to achieve an stationary solution defined in (5).
Definition 2.
(Communication Complexity) In one round of communication, each node is allowed to broadcast and received one dimensional vector to and from its neighbors, respectively. Then the communication complexity is defined as the total rounds of communications required to achieve an stationary solution defined in (5).
2.1 Algorithm Design
In this section, we introduce the proposed algorithm named Decentralized Gradient Estimation and Tracking (DGET), for solving problem (P1). To motivate our algorithm design, we can observe from our discussion in Section 1.1 that, the existing deterministic decentralized methods typically suffer from the high sample complexity, while the decentralized stochastic algorithms suffer from the high communication cost. Such a phenomenon inspires us to find a solution in between, which could simultaneously reduce the sample and the communication costs.
One natural solution is to incorporate the modern variance reduction techniques into the classical decentralized methods. Our idea is to use some variance reduced gradient estimator to track the full gradient of the entire problem, then perform decentralized gradient descent update. The gradient tracking step gives us fast convergence with a constant stepsize, while the variance reduction method significantly reduces the variation of the estimated gradient.
Unfortunately, the decentralized methods and variance reduction techniques cannot be directly combined. Compared with the existing decentralized and variance reduction techniques in the literature, the key challenges in the algorithm design and analysis are given below:

Due to the decentralized nature of the problem, none of the nodes can access the full gradient of the original objective function. The (possibly uncontrollable) network consensus error always exists during the whole process of implementing the decentralized algorithm. Therefore, it is not clear that the existing variance reduction methods could be applied at each individual node effectively, since all of those require accurate global gradient evaluation from time to time.

It is then natural to integrate some procedure that is able to approximate the global gradient. For example, one straightforward way to perform gradient tracking is to introduce a new auxiliary variable as the following [11, 16], which is updated by only using local estimated gradient and neighbors’ parameters:
(12) where and are the samples selected at the and th iterations, respectively. If the tracked ’s were used in the (local) variance reduction procedure, there would be at least two main issues of decreasing the variance resulted from the tracked gradient as follows: i) at the early stage of implementing the decentralized algorithm, the consensus/tracking error may dominate the variance of the tracked gradient, since the message of the full gradient has not been sufficiently propagated through the network. Consequently, performing variance reduction on ’s will not be able to increase the quality of the full gradient estimation; ii) even assuming that there was no consensus error. Since only the stochastic gradients, i.e., , were used in the tracking, the ’s themselves had high variance, resulting that such (possibly lowquality) full gradient estimates may not be compatible to variance reduction methods as developed in the current literature (which often require full gradient evaluation from time to time).
The challenges discussed above suggest that it is nontrivial to design an algorithm that can be implemented in a fully decentralized manner, while still achieving the superior sample complexity and convergence rate achieved by stateoftheart variance reduction methods. In this work, we propose an algorithm which uses a novel decentralized gradient estimation and tracking strategy, together with a number of other design choices, to address the issues raised above.
To introduce the algorithm, let us first define two auxiliary local variables and , where is designed to estimate the local full batch gradient by only using sample gradient , while is designed to track the global average gradient by utilizing ’s. After the local and global gradient estimates are obtained, the algorithm performs local update based on the direction of ; see the main steps below.

Local update using estimated gradient ( update): Each local node first combines its previous iterates with its local neighbors (by using the th row of weight matrix ), then makes a prediction based on the gradient estimate , i.e.,
(13) 
Estimate local gradients ( update): Each local node either directly calculates the full local gradient , or estimates its local gradient via an estimator using random samples, depending on the iteration , i.e.,
(14) where is the interval in which local full gradient will be evaluated once.

Track global gradients ( update): Each local node combines its previous local estimate with its local neighbors , then makes a new estimation based on the fresh information , i.e.,
(15)
In the following table, we summarize the proposed algorithm in a more compact form. Note that we use , , , and to denote the concatenation of the , , , and across all nodes.
Remark 1. The above algorithm can also be interpreted as a “double loop” algorithm, where each outer iteration (i.e., mod ) is followed by inner iterations (i.e., mod ). The inner loop aims to estimate the local gradient via stochastic sampling at every iteration , while the outer loop aims to reduce the estimation variance by recalculating the full batch gradient at every iterations. The local communication, update, and tracking steps are performed at both inner and outer iterations.
Remark 2. We further remark that, in DGET, the total communication rounds is in the same order as the total number of iterations, since only two rounds of communications are performed per iteration, via broadcasting the local variable and to their neighbors, and combining local and ’s, . On the other hand, the total number of samples used per iteration is either (where inner iterations are executed) or (where outer iterations are executed).
Remark 3. Note that our and updates are reminiscent of the classical gradient tracking methods [11, 16], and update takes a similar form as the SARAH/SPIDER estimator [58, 46]. However, it is nontrivial to directly combine the gradient tracking and the variance reduction together, as we mentioned at the beginning of Section 2.1. The proposed DGET uses a number of design choices to address these challenges. For example, two vectors and are used to respectively estimate the local and global gradients, in a way that the local gradient estimates do not depend on the (potentially inaccurate) global tracked gradients; to reduce the variance in , we occasionally use the full local gradient to perform tracking, etc. Nevertheless, the key challenge in the analysis is to properly bound the accumulated errors from the two estimates and .
2.2 Convergence Analysis
To facilitate our analysis, we first define the average iterates and among all nodes,
(16a)  
(16b)  
(16c) 
Note that here we use to denote the overall iteration number. By the double loop nature of the algorithm, we can define the total number of outer iterations until iteration as below:
Before we formally conduct the analysis, we note three simple facts about Algorithm 1.
First, according to (13) and the definition (16a), the update rule of the average iterates can be expressed as:
(17) 
Second, if the iteration satisfies mod (that is, when the outer iteration is executed), from (14) and (15) it is easy to check that the following relations hold (given ):
(18)  
(19) 
Third, if mod, we have the following relations:
(20)  
(21) 
Next, we outline the proof steps of the convergence rate analysis.
Step 1. We first show that the variance of our local and global gradient estimators can be bounded via and iterates. The bounds to be given below is tighter than the classical analysis of decentralized stochastic methods, which assume the variance are bounded by some universal constant [14, 2, 33]. This is an important step to obtain lower sample/communication complexity, since later we can show that the righthandside (RHS) of our bound shrinks as the iteration progresses.
Lemma 1.
Step 2. We then study the descent on , which is the expected value of the cost function evaluated on the average iterates.
Lemma 2.
A key observation from Lemma 2 is that, in the RHS of (25), besides the negative term in , we also have several extra terms (such as and ) that cannot be made negative. Therefore, we need to find some potential function that is strictly descending per iteration.
Note that in (24) comes from the variance of in estimating the full local gradient at each outer loop . For Algorithm 1, where we calculate a full batch gradient per outer loop in step (18), it is clear that . However, we still would like to include in the above result because, later when we analyze the online version (where such a variance will no longer be zero), we can reuse the above result.
Step 3. Next, we introduce the contraction property, which combined with will be used to construct the potential function.
Lemma 3.
(Iterates Contraction) Using the Assumption 2 on and applying Algorithm 1, we have the following contraction property of the iterates:
(26)  
(27) 
where is some constant such that .
If we further assume for all satisfying mod, the following holds for some :
(28) 
Then we have the following bound on successive difference of for all :
(29) 
Again, comes from the variance of the estimating the local gradient in each outer loop, and we have for Algorithm 1. Note that (26) can also be written as following
(30) 
One key observation here is that we have by properly choosing . Therefore, the RHS of the above equation can be made negative by properly selecting the stepsize .
Step 4. This step combines the descent estimates obtained in Step 23 to construct a potential function, by using a conic combination of , and .
Lemma 4.
Step 5. We can then properly choose the stepsize , and make to be positive. Therefore, our solution quality measure can be expressed as the difference of the potential functions and the proof is complete.
Theorem 1.
Consider problem (P1) and under Assumption 1  2, if we pick and , then we have following results by applying Algorithm 1,
where denotes the lower bound of , and the constants are defined as following
in which denotes the second largest eigenvalue of the mixing matrix from (8), denotes a constant satisfying , and are defined in (31)(33).
By directly applying the above result, we have the upper bound on gradient and communication cost by properly choosing based on .
Corollary 1.
Proof.
If we pick , then we can obtain following from Theorem 1