1 Introduction
Collaborative learning refers to the task of learning a common objective among multiple computing agents without any central node and by using ondevice computation and local communication among neighboring agents. Such tasks have recently gained considerable attention in the context of machine learning and optimization as they are foundational to several computing paradigms such as scalability to larger datasets and systems, data locality, ownership and privacy. As such, collaborative learning naturally arises in various applications such as distributed deep learning
(LeCun et al., 2015; Dean et al., 2012), multiagent robotics and path planning (Choi and How, 2010; Jha et al., 2016), distributed resource allocation in wireless networks (Ribeiro, 2010), to name a few.While collaborative learning has recently drawn significant attention due its decentralized implementation, it faces major challenges at the system level as well as algorithm design. The decentralized implementation of collaborative learning faces two major systems challenges: (i) significant slowdown due to straggling nodes, where a subset of nodes can be largely delayed in their local computation which slows down the wallclock time convergence of the decentralized algorithm; (ii) large communication overhead due to the message passing algorithm as the dimension of the parameter vector increases, which can further slow down the algorithm’s convergence time. Moreover, in the presence of these system bottlenecks, the efficacy of classical consensus optimization methods is not clear and needs to be revisited.
In this work we consider the general dataparallel setting where the data is distributed across different computing nodes, and develop decentralized optimization methods that do not rely on a central coordinator but instead only require local computation and communication among neighboring nodes. As the main contribution of this paper, we propose a stragglerrobust and communicationefficient algorithm for collaborative learning called QuanTimedDSGD
, which is a quantized and deadlinebased decentralized stochastic gradient descent method. We show that the proposed scheme provably improves upon on the convergence time of vanilla synchronous decentralized optimization methods. The key theoretical contribution of the paper is to develop the
first quantized decentralized nonconvex optimization algorithm with provable and exact convergence to a firstorder optimal solution.There are two key ideas in our proposed algorithm. To provide robustness against stragglers, we impose a deadline time for the computation of each node. In a synchronous implementation of the proposed algorithm, at every iteration all the nodes simultaneously start computing stochastic gradients by randomly picking data points from their local batches and evaluating the gradient function on the picked data point. By , each node has computed a random number of stochastic gradients from which it aggregates and generates a stochastic gradient for its local objective. By doing so, each iteration takes a constant computation time as opposed to deadlinefree methods in which each node has to wait for all their neighbours to complete their gradient computation tasks. To tackle the communication bottleneck in collaborative learning, we only allow the decentralized nodes to share with neighbours a quantized version of their local models. Quantizing the exchanged models reduces the communication load which is critical for large and dense networks.
We analyze the convergence of the proposed QuanTimedDSGD for strongly convex and nonconvex loss functions and under standard assumptions for the network, quantizer and stochastic gradients. In the strongly convex case, we show that QuanTimedDSGD exactly finds the global optimal for every node with a rate arbitrarily close to . In the nonconvex setting, QuanTimedDSGD provably finds firstorder optimal solutions as fast as . Moreover, the consensus error decays with the same rate which guarantees an exact convergence by choosing large enough . Furthermore, we numerically evaluate QuanTimedDSGD on benchmark datasets CIFAR10 and MNIST, where it demonstrates speedups of up to in the runtime compared to stateoftheart baselines.
Related Work: Decentralized consensus optimization has been studied extensively. The most popular firstorder choices for the convex setting are distributed gradient descenttype methods (Nedic and Ozdaglar, 2009; Jakovetic et al., 2014; Yuan et al., 2016; Qu and Li, 2017), augmented Lagrangian algortihms (Shi et al., 2015a, b), distributed variants of the alternating direction method of multipliers (ADMM) (Schizas et al., 2008; Boyd et al., 2011; Shi et al., 2014; Chang et al., 2015), dual averaging (Duchi et al., 2012; Tsianos et al., 2012), and several dual based strategies (Seaman et al., 2017; Scaman et al., 2018; Uribe et al., 2018). Recently, there have been some works which study nonconvex decentralized consensus optimization and establish convergence to a stationary point (Zeng and Yin, 2018; Hong et al., 2017, 2018; Sun and Hong, 2018; Scutari et al., 2017; Scutari and Sun, 2018; Jiang et al., 2017; Lian et al., 2017a).
The idea of improving communicationefficiency of distributed optimization procedures via messagecompression schemes goes a few decades back (Tsitsiklis and Luo, 1987), however, it has recently gained considerable attention due to the growing importance of distributed applications. In particular, efficient gradientcompression methods are provided in (Alistarh et al., 2017; Seide et al., 2014; Bernstein et al., 2018) and deployed in the distributed masterworker setting. In the decentralized setting, quantization methods were proposed in different convex optimization contexts with nonvanishing errors (Yuksel and Basar, 2003; Rabbat and Nowak, 2005; Kashyap et al., 2006; El Chamie et al., 2016; Aysal et al., 2007; Nedic et al., 2008). The first exact decentralized optimization method with quantized messages was given in (Reisizadeh et al., 2018; Zhang et al., 2018), and more recently, new techniques have been developed in this context for convex problems (Doan et al., 2018; Koloskova et al., 2019; Berahas et al., 2019; Lee et al., 2018a, b).
The straggler problem has been widely observed in distributed computing clusters (Dean and Barroso, 2013; Ananthanarayanan et al., 2010). A common approach to mitigate stragglers is to replicate the computing task of the slow nodes to other computing nodes (Ananthanarayanan et al., 2013; Wang et al., 2014), but this is clearly not feasible in collaborative learning. Another line of work proposed using coding theoretic ideas for speeding up distributed machine learning (Lee et al., 2018c; Tandon et al., 2016; Yu et al., 2017; Reisizadeh et al., 2019a, b), but they work mostly for masterworker setup and particular computation types such as linear computations or full gradient aggregation. The closest work to ours is (Ferdinand et al., 2019) that considers decentralized optimization for convex functions with deadline for local computations without considering communication bottlenecks and quantization as well as nonconvex functions. Another line of work proposes asynchronous decentralized SGD, where the workers update their models based on the last iterates received by their neighbors (Recht et al., 2011; Lian et al., 2017b; Lan and Zhou, 2018; Peng et al., 2016; Wu et al., 2017). While asynchronous methods are inherently robust to stragglers, they can suffer from slow convergence due to using stale models.
2 Problem Setup
In this paper, we focus on a stochastic learning model in which we aim to solve the problem
(1) 
where is a stochastic loss function, is our optimization variable, and
is a random variable with probability distribution
and is the expected loss function also called population risk. We assume that the underlying distribution of the random variable is unknown and we have access only to realizations of it. Our goal is to solve the loss associated with realizations of the random variable , which is also known as empirical risk minimization. To be more precise, we aim to solve the empirical risk minimization (ERM) problem(2) 
where is the empirical loss associated with the sample of random variables .
Collaborative Learning Perspective. Our goal is to solve the ERM problem in (2) in a decentralized manner over nodes. This setting arises in a plethora of applications where either the total number of samples is massive and data cannot be stored or processed over a single node or the samples are available in parts at different nodes and, due to privacy or communication constraints, exchanging raw data points is not possible among the nodes. Hence, we assume that each node has access to samples and its local objective is
(3) 
where is the set of samples available at node . Nodes aim to collaboratively minimize the average of all local objective functions, denoted by , which is given by
(4) 
Indeed, the objective functions and are equivalent if . Therefore, by minimizing the global objective function we also obtain the solution of the ERM problem in (2).
We can rewrite the optimization problem in (4) as a classical decentralized optimization problem as follows. Let be the decision variable of node . Then, (4) is equivalent to
(5) 
as the objective function value of (4) and (5) are the same when the iterates of all nodes are the same and we have consensus. The challenge in distributed learning is to solve the global loss only by exchanging information with neighboring nodes and ensuring that nodes’ variables stay close to each other. We consider a network of computing nodes characterized by an undirected connected graph with nodes and edges , and each node is allowed to exchange information only with its neighboring nodes in the graph , which we denote by .
In a stochastic optimization setting, where the true objective is defined as an expectation, there is a limit to the accuracy with which we can minimize given only samples, even if we have access to the optimal solution of the empirical risk . In particular, it has been shown that when the loss function is convex, the difference between the population risk and the empirical risk corresponding to samples with high probability is uniformly bounded by ; see (Bottou and Bousquet, 2008). Thus, without collaboration, each node can minimize its local cost
to reach an estimate for the optimal solution with an error of
. By minimizing the aggregate loss collaboratively, nodes reach an approximate solution of the expected risk problem with a smaller error of . Based on this formulation, our goal in the convex setting is to find a point for each node that attains the statistical accuracy, i.e., , which further implies .For a nonconvex loss function , however, is also nonconvex and solving the problem in (4) is hard, in general. Therefore, we only focus on finding a point that satisfies the firstorder optimality condition for (4) up to some accuracy , i.e., finding a point such that . Under the assumption that the gradient of loss is subGaussian, it has been shown that with high probability the gap between the gradients of expected risk and empirical risk is bounded by ; see (Mei et al., 2018). As in the convex setting, by solving the aggregate loss instead of local loss, each node finds a better approximate for a firstorder stationary point of the expected risk . Therefore, our goal in the nonconvex setting is to find a point that satisfies which also implies .
3 Proposed QuanTimedDSGD Method
In this section, we present our proposed QuanTimedDSGD algorithm that takes into account robustness to stragglers and communication efficiency in decentralized optimization. To ensure robustness to stragglers’ delay, we introduce a deadlinebased protocol for updating the iterates in which nodes compute their local gradients estimation only for a specific amount time and then use their gradient estimates to update their iterates. This is in contrast to the minibatch setting, in which nodes have to wait for the slowest machine to finish its local gradient computation. To reduce the communication load, we assume that nodes only exchange a quantized version of their local iterates. However, using quantized messages induces extra noise in the decision making process which makes the analysis of our algorithm more challenging. A detailed description of the proposed algorithm is as follows.
DeadlineBased Gradient Computation. Consider the current model available at node at iteration . Recall the definition of the local objective function at node defined in (3). The cost of computing the local gradient scales linearly by the number of samples assigned to the th node. A common solution to reduce the computation cost at each node for the case that is large is using a minibatch approximate of the gradient, i.e., each node picks a subset of its local samples to compute the stochastic gradient . A major challenge for this procedure is the presence of stragglers in the network: given minibatch size , all nodes have to compute the average of exactly stochastic gradients. Thus, all the nodes have to wait for the slowest machine to finish its computation and exchange its new model with the neighbors.
To resolve this issue, we propose a deadlinebased approach in which we set a fixed deadline for the time that each node can spend computing its local stochastic gradient estimate. Once the deadline is reached, nodes find their gradient estimate using whatever computation (minibatch size) they could perform. Thus, with this deadlinebased procedure, nodes do not need to wait for the slowest machine to update their iterates. However, their minibatch size and consequently the noise of their gradient approximation will be different. To be more specific, let denote the set of random samples chosen at time by node . Define as the stochastic gradient of node at time as
(6) 
for . If there are not any gradients computed by , i.e., , we set .
Computation Model. To illustrate the advantage of our deadlinebased scheme over the fixed minibatch scheme, we formally state the model that we use for the processing time of nodes in the network. We remark that our algorithms are oblivious to the choice of the computation model which is merely used for analysis. We define the processing speed of each machine as the number of stochastic gradients that it computes per second. We assume that the processing speed of each machine and iteration is a random variable , and ’s are i.i.d. with probability distribution . We further assume that the domain of the random variable is bounded and its realizations are in . If is the number of stochastic gradient which can be computed per second, the size of minibatch is a random variable given by .
In the fixed minibatch scheme and for any iteration , all the nodes have to wait for the machine with the slowest processing time before updating their iterates, and thus the overall computation time will be where is defined as . In our deadlinebased scheme there is a fixed deadline which limits the computation time of the nodes, and is chosen such that , while the minibatch scheme requires an expected time of . The gap between and depends on the distribution of , and can be unbounded in general growing with .
Quantized MessagePassing. To reduce the communication overhead of exchanging variables between nodes, we use quantization schemes that significantly reduces the required number of bits. More precisely, instead of sending , the th node sends which is a quantized version of its local variable to its neighbors . As an example, consider the low precision quantizer specified by scale factor and bits with the representable range . For any , the quantizer outputs
(7) 
Algorithm Update. Once the local variables are exchanged between neighboring nodes, each node uses its local stochastic gradient , its local decision variable , and the information received from its neighbors to update its local decision variable. Before formally stating the update of QuanTimedDSGD, let us define as the weight that node assigns to the information that it receive from node . If and are not neighbors . These weights are considered for averaging over the local decision variable and the quantized variables received from neighbors to enforce consensus among neighboring nodes. Specifically, at time , node updates its decision variable according to the update
(8) 
where and are positive scalars that behave as stepsize. Note that the update in (8) shows that the updated iterate is a linear combination of the weighted average of node ’s neighbors’ decision variable, i.e., , and its local variable and stochastic gradient . The parameter behaves as the stepsize of the gradient descent step with respect to local objective function and the parameter behaves as an averaging parameter between performing the distributed gradient update and using the previous decision variable . By choosing a diminishing stepsize we control the noise of stochastic gradient evaluation, and by averaging using the parameter we control randomness induced by exchanging quantized variables. The description of QuanTimedDSGD is summarized in Algorithm 1.
4 Convergence Analysis
In this section, we provide the main theoretical results for the proposed QuanTimedDSGD algorithm. We first consider strongly convex loss functions and characterize the convergence rate of QuanTimedDSGD for achieving the global optimal solution to the problem (4). Then, we focus on the nonconvex setting and show that the iterates generated by QuanTimedDSGD find a stationary point of the cost in (4) while the local models are close to each other and the consensus constraint is asymptotically satisfied. All the proofs are provided in the appendix (Section 6). We make the following assumptions on the weight matrix, the quantizer, and local objective functions.
Assumption 1.
The weight matrix with entries satisfies the following conditions: , and .
Assumption 2.
The random quantizer
is unbiased and variancebounded, i.e.,
and for any ; and quantizations are carried out independently.Assumption 1 implies that
is symmetric and doubly stochastic. Moreover, all the eigenvalues of
are in , i.e., (e.g. (Yuan et al., 2016)). We also denote bythe spectral gap associated with the stochastic matrix
, where .Assumption 3.
The function is smooth with respect to , i.e., for any and any , .
Assumption 4.
Stochastic gradients are unbiased and variance bounded, i.e., and
Note the condition in Assumption 4 implies that the local gradients of each node
are also unbiased estimators of the expected risk gradient
and their variance is bounded above by as it is defined as an average over realizations.4.1 Convex Setting
This section presents the convergence guarantees of the proposed QuanTimedDSGD method for smooth and strongly convex functions. The following assumption formally defines strong convexity.
Assumption 5.
The function is strongly convex, i.e., for any and we have that
Next, we characterize the convergence rate of QuanTimedDSGD for strongly convex objectives.
Theorem 1 (Strongly Convex Losses).
Theorem 1 guarantees the exact convergence of each local model to the global optimal even though the noises induced by random quantizations and stochastic gradients are nonvanishing with iterations. Moreover, such convergence rate is as close as desired to by picking the tuning parameter arbitrarily close to . We would like to highlight that by choosing a parameter closer to , the lower bound on the number of required iterations becomes larger. More details are available in the proof of Theorem 1 provided in the appendix.
Note that the coefficient of in (9) characterizes the dependency of our upper bound on the objective function condition number , graph connectivity parameter , and variance of error induced by quantizing our signals. Moreover, the coefficient of shows the effect of stochastic gradients variance as well as our deadlinebased scheme parameters .
Remark 1.
The expression represents the inverse of the effective batch size used in our QuanTimedDSGD method. To be more specific, If the deadline is large enough that in expectation all local gradients are computed before the deadline, i.e., , then our effective batch size is and the term is the dominant term in the maximization. Conversely, if is small and the number of computed gradients is smaller than the total number of local samples , the effective batch size is . In this case, is dominant term in the maximization. This observation shows that in (9) is the variance of minibatch gradient in QuanTimedDSGD.
Remark 2.
Using strong convexity of the objective function, one can easily verify that the last iterates of QuanTimedDSGD satisfy the suboptimality with respect to the empirical risk, where is the minimizer of the empirical risk . As the gap between the expected risk and the empirical risk is of , the overall error of QuanTimedDSGD with respect to the expected risk is .
4.2 NonConvex Setting
In this section, we characterize the convergence rate of QuanTimedDSGD for nonconvex and smooth objectives. As discussed in Section 2, we are interested in finding a set of local models which satisfy firstorder optimality condition approximately, while the models are close to each other and satisfy the consensus condition up to a small error. To be more precise, we are interested in finding a set of local models where their average (approximately) satisfy firstorder optimality condition, i.e., , while the iterates are close to their average, i.e., . If a set of local iterates satisfies these conditions we call them approximate solutions. Next theorem characterizes both firstorder optimality and consensus convergence rates and the overall complexity for achieving an approximate solutions.
Theorem 2 (NonConvex Losses).
The convergence rate in (10) indicates the proposed QuanTimedDSGD method finds firstorder stationary points with vanishing approximation error, even though the quantization and stochastic gradient noises are nonvanishing. Also, the approximation error decays as fast as with iterations. Theorem 2 also implies from (11) that the local models reach consensus with a rate of . Moreover, it shows that to find an approximate solution QuanTimedDSGD requires at most iterations.
5 Experimental Results
In this section, we numerically evaluate the performance of the proposed QuanTimedDSGD method in solving a class of nonconvex decentralized optimization problems. In particular, we compare the total runtime of QuanTimedDSGD scheme with the ones for three benchmarks which are briefly described below.

Decentralized SGD (DSGD) (Yuan et al., 2016): Each worker updates its decision variable as . We note that the exchanged messages are not quantized and the local gradients are computed for a fixed batch size.

Asynchronous DSGD (Lian et al., 2017b): Each worker updates its model without waiting to receive the updates of its neighbors, i.e. where denotes the most recent model for node . In our implementation, models are exchanged without quantization.
Data and Experimental Setup. We carry out two sets of experiments over CIFAR10 and MNIST datasets, where each worker is assigned with a sample set of size
for both datasets. For CIFAR10, we implement a binary classification using a fully connected neural network with one hidden layer with
neurons. Three color (RGB) matrices for each image are combined with ratio so that the input of the neural network is a vector of length (see (Dutta et al., 2018)). For MNIST, we use a fully connected neural network with one hidden layer of sizeto classify the input image into
classes. In experiments over CIFAR10, stepsizes are chosen as follows: for QuanTimedDSGD and QDSGD, and for DSGD and Asynchronous DSGD. In MNIST experiments, for QuanTimedDSGD and QDSGD, and for DSGD.We implement the unbiased low precision quantizer in (7) with various quantization levels , and we let denote the communication time of a vector without quantization. In order to ensure that the expected batch size used in each node is a target positive number , we choose the deadline , where is the random computation speed. The communication graph is a random ErdösRènyi graph with edge connectivity and nodes. The weight matrix is designed as where is the Laplacian matrix of the graph and .
Results. Figure 1 compares the total training runtime for the QuanTimedDSGD and DSGD schemes. On CIFAR10 (left), the same (effective) batchsizes, the proposed QuanTimedDSGD achieves speedups of up to compared to DSGD.
In Figure 2, we further compare these two schemes to QDSGD benchmark. Although QSGD improves upon the vanilla DSGD by employing quantization, however, the proposed QuanTimedDSGD illustrates speedup in training time over QDSGD (left).
To evaluate the straggler mitigation in the QuanTimedDSGD, we compare its runtime with Asynchronous DSGD benchmark in Figure 3 (left). While Asynchronous DSGD outperforms DSGD in training runtime by avoiding slow nodes, the proposed QuanTimedDSGD scheme improves upon Asynchronous DSGD by up to . These plots further illustrate that QuanTimedDSGD significantly reduces the training time by simultaneously handling the communication load by quantization and mitigating stragglers through a deadlinebased computation. The deadline time indeed can be optimized for the minimum training runtime, as illustrated in Figure 3 (right).
6 Appendix
Here we provide all the proofs and details which were skipped in the main document.
6.1 Bounding the Stochastic Gradient Noises
In our analysis for both convex and nonconvex scenarios, we need to have the noise of various stochastic gradient functions evaluated. Hence, let us start this section by the following lemma which bounds the variance of stochastic gradient functions under our customary Assumption 4.
Lemma 1.
Assumption 4 results in the followings for any and :
Proof.
The first five expressions (i)(v) in the lemma are immediate results of Assumption 4 together with the fact that the noise of the stochastic gradient scales down with the sample size. To prove (vi), let denote the sample set for which node has computed the gradients. We have
and therefore
∎
6.2 Proof of Theorem 1
To prove Theorem 1, we first establish two Lemmas 2 and 3 and then easily conclude the theorem from the two results.
The main problem is to minimize the global objective defined in (4). We introduce the following optimization problem which is equivalent the main problem:
(12)  
s.t. 
where the vecor denotes the concatenation of all the local models. Clearly, is the solution to (12). Using Assumption 1, the constraint in the alternative problem (12) can be stated as . Inspired by this fact, we define the following penalty function for every :
(13) 
and denote by the (unique) minimizer of . That is,
(14) 
Next lemma characterizes the deviation of the models generated by the QuanTimedDSGD method at iteration , that is from the optimizer of the penalty function, i.e. .
Lemma 2.
Proof of Lemma 2.
First note that the gradient of the penalty function defined in (13) is as follows:
(17) 
where denotes the concatenation of models at iteration . Now consider the following stochastic gradient function for :
(18) 
where
We let denote a sigma algebra that measures the history of the system up until time . According to Assumptions 2 and 4, the stochastic gradient defined above is unbiased, that is,
We can also write the update rule of QuanTimedDSGD method as follows:
(19) 
which also represents an iteration of the Stochastic Gradient Descent (SGD) algorithm with stepsize in order to minimize the penalty function over . We can bound the deviation of the iteration generated by QuanTimedDSGD from the optimizer as follows:
(20) 
where we used the fact that the penalty function is strongly convex with parameter . Moreover, we can bound the second term in RHS of (20) as follows:
(21) 
To derive (21), we used the facts that is smooth with parameter ; the quantizer is unbiased with variance (Assumption 2); stochastic gradients of the loss function are unbiased and variancebounded (Assumption 4 and Lemma 1). Plugging (21) in (20) yields
(22) 
To ease the notation, let denote the expected deviation of the models at iteration i.e. from the optimizer with respect to all the randomnesses from iteration . Therefore,
(23) 
For any and the proposed pick , we have
and therefore
Comments
There are no comments yet.