Robust and Communication-Efficient Collaborative Learning

07/24/2019 ∙ by Amirhossein Reisizadeh, et al. ∙ The University of Texas at Austin University of Pennsylvania The Regents of the University of California 0

We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks: stragglers' delay and communication overhead. In this paper, we tackle these bottlenecks by proposing a novel decentralized and gradient-based optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two main ideas: (i) we impose a deadline on the local gradient computations of each node at each iteration of the algorithm, and (ii) the nodes exchange quantized versions of their local models. The first idea robustifies to straggling nodes and the second alleviates communication efficiency. The key technical contribution of our work is to prove that with non-vanishing noises for quantization and stochastic gradients, the proposed method exactly converges to the global optimal for convex loss functions, and finds a first-order stationary point in non-convex scenarios. Our numerical evaluations of the QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate speedups of up to 3x in run-time, compared to state-of-the-art decentralized optimization methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Collaborative learning refers to the task of learning a common objective among multiple computing agents without any central node and by using on-device computation and local communication among neighboring agents. Such tasks have recently gained considerable attention in the context of machine learning and optimization as they are foundational to several computing paradigms such as scalability to larger datasets and systems, data locality, ownership and privacy. As such, collaborative learning naturally arises in various applications such as distributed deep learning

(LeCun et al., 2015; Dean et al., 2012), multi-agent robotics and path planning (Choi and How, 2010; Jha et al., 2016), distributed resource allocation in wireless networks (Ribeiro, 2010), to name a few.

While collaborative learning has recently drawn significant attention due its decentralized implementation, it faces major challenges at the system level as well as algorithm design. The decentralized implementation of collaborative learning faces two major systems challenges: (i) significant slow-down due to straggling nodes, where a subset of nodes can be largely delayed in their local computation which slows down the wall-clock time convergence of the decentralized algorithm; (ii) large communication overhead due to the message passing algorithm as the dimension of the parameter vector increases, which can further slow down the algorithm’s convergence time. Moreover, in the presence of these system bottlenecks, the efficacy of classical consensus optimization methods is not clear and needs to be revisited.

In this work we consider the general data-parallel setting where the data is distributed across different computing nodes, and develop decentralized optimization methods that do not rely on a central coordinator but instead only require local computation and communication among neighboring nodes. As the main contribution of this paper, we propose a straggler-robust and communication-efficient algorithm for collaborative learning called QuanTimed-DSGD

, which is a quantized and deadline-based decentralized stochastic gradient descent method. We show that the proposed scheme provably improves upon on the convergence time of vanilla synchronous decentralized optimization methods. The key theoretical contribution of the paper is to develop the

first quantized decentralized non-convex optimization algorithm with provable and exact convergence to a first-order optimal solution.

There are two key ideas in our proposed algorithm. To provide robustness against stragglers, we impose a deadline time for the computation of each node. In a synchronous implementation of the proposed algorithm, at every iteration all the nodes simultaneously start computing stochastic gradients by randomly picking data points from their local batches and evaluating the gradient function on the picked data point. By , each node has computed a random number of stochastic gradients from which it aggregates and generates a stochastic gradient for its local objective. By doing so, each iteration takes a constant computation time as opposed to deadline-free methods in which each node has to wait for all their neighbours to complete their gradient computation tasks. To tackle the communication bottleneck in collaborative learning, we only allow the decentralized nodes to share with neighbours a quantized version of their local models. Quantizing the exchanged models reduces the communication load which is critical for large and dense networks.

We analyze the convergence of the proposed QuanTimed-DSGD for strongly convex and non-convex loss functions and under standard assumptions for the network, quantizer and stochastic gradients. In the strongly convex case, we show that QuanTimed-DSGD exactly finds the global optimal for every node with a rate arbitrarily close to . In the non-convex setting, QuanTimed-DSGD provably finds first-order optimal solutions as fast as . Moreover, the consensus error decays with the same rate which guarantees an exact convergence by choosing large enough . Furthermore, we numerically evaluate QuanTimed-DSGD on benchmark datasets CIFAR-10 and MNIST, where it demonstrates speedups of up to in the run-time compared to state-of-the-art baselines.

Related Work: Decentralized consensus optimization has been studied extensively. The most popular first-order choices for the convex setting are distributed gradient descent-type methods (Nedic and Ozdaglar, 2009; Jakovetic et al., 2014; Yuan et al., 2016; Qu and Li, 2017), augmented Lagrangian algortihms (Shi et al., 2015a, b), distributed variants of the alternating direction method of multipliers (ADMM) (Schizas et al., 2008; Boyd et al., 2011; Shi et al., 2014; Chang et al., 2015), dual averaging (Duchi et al., 2012; Tsianos et al., 2012), and several dual based strategies (Seaman et al., 2017; Scaman et al., 2018; Uribe et al., 2018). Recently, there have been some works which study nonconvex decentralized consensus optimization and establish convergence to a stationary point (Zeng and Yin, 2018; Hong et al., 2017, 2018; Sun and Hong, 2018; Scutari et al., 2017; Scutari and Sun, 2018; Jiang et al., 2017; Lian et al., 2017a).

The idea of improving communication-efficiency of distributed optimization procedures via message-compression schemes goes a few decades back (Tsitsiklis and Luo, 1987), however, it has recently gained considerable attention due to the growing importance of distributed applications. In particular, efficient gradient-compression methods are provided in (Alistarh et al., 2017; Seide et al., 2014; Bernstein et al., 2018) and deployed in the distributed master-worker setting. In the decentralized setting, quantization methods were proposed in different convex optimization contexts with non-vanishing errors (Yuksel and Basar, 2003; Rabbat and Nowak, 2005; Kashyap et al., 2006; El Chamie et al., 2016; Aysal et al., 2007; Nedic et al., 2008). The first exact decentralized optimization method with quantized messages was given in (Reisizadeh et al., 2018; Zhang et al., 2018), and more recently, new techniques have been developed in this context for convex problems (Doan et al., 2018; Koloskova et al., 2019; Berahas et al., 2019; Lee et al., 2018a, b).

The straggler problem has been widely observed in distributed computing clusters (Dean and Barroso, 2013; Ananthanarayanan et al., 2010). A common approach to mitigate stragglers is to replicate the computing task of the slow nodes to other computing nodes (Ananthanarayanan et al., 2013; Wang et al., 2014), but this is clearly not feasible in collaborative learning. Another line of work proposed using coding theoretic ideas for speeding up distributed machine learning (Lee et al., 2018c; Tandon et al., 2016; Yu et al., 2017; Reisizadeh et al., 2019a, b), but they work mostly for master-worker setup and particular computation types such as linear computations or full gradient aggregation. The closest work to ours is (Ferdinand et al., 2019) that considers decentralized optimization for convex functions with deadline for local computations without considering communication bottlenecks and quantization as well as nonconvex functions. Another line of work proposes asynchronous decentralized SGD, where the workers update their models based on the last iterates received by their neighbors (Recht et al., 2011; Lian et al., 2017b; Lan and Zhou, 2018; Peng et al., 2016; Wu et al., 2017). While asynchronous methods are inherently robust to stragglers, they can suffer from slow convergence due to using stale models.

2 Problem Setup

In this paper, we focus on a stochastic learning model in which we aim to solve the problem


where is a stochastic loss function, is our optimization variable, and

is a random variable with probability distribution

and is the expected loss function also called population risk. We assume that the underlying distribution of the random variable is unknown and we have access only to realizations of it. Our goal is to solve the loss associated with realizations of the random variable , which is also known as empirical risk minimization. To be more precise, we aim to solve the empirical risk minimization (ERM) problem


where is the empirical loss associated with the sample of random variables .

Collaborative Learning Perspective. Our goal is to solve the ERM problem in (2) in a decentralized manner over nodes. This setting arises in a plethora of applications where either the total number of samples is massive and data cannot be stored or processed over a single node or the samples are available in parts at different nodes and, due to privacy or communication constraints, exchanging raw data points is not possible among the nodes. Hence, we assume that each node has access to samples and its local objective is


where is the set of samples available at node . Nodes aim to collaboratively minimize the average of all local objective functions, denoted by , which is given by


Indeed, the objective functions and are equivalent if . Therefore, by minimizing the global objective function we also obtain the solution of the ERM problem in (2).

We can rewrite the optimization problem in (4) as a classical decentralized optimization problem as follows. Let be the decision variable of node . Then, (4) is equivalent to


as the objective function value of (4) and (5) are the same when the iterates of all nodes are the same and we have consensus. The challenge in distributed learning is to solve the global loss only by exchanging information with neighboring nodes and ensuring that nodes’ variables stay close to each other. We consider a network of computing nodes characterized by an undirected connected graph with nodes and edges , and each node is allowed to exchange information only with its neighboring nodes in the graph , which we denote by .

In a stochastic optimization setting, where the true objective is defined as an expectation, there is a limit to the accuracy with which we can minimize given only samples, even if we have access to the optimal solution of the empirical risk . In particular, it has been shown that when the loss function is convex, the difference between the population risk and the empirical risk corresponding to samples with high probability is uniformly bounded by ; see (Bottou and Bousquet, 2008). Thus, without collaboration, each node can minimize its local cost

to reach an estimate for the optimal solution with an error of

. By minimizing the aggregate loss collaboratively, nodes reach an approximate solution of the expected risk problem with a smaller error of . Based on this formulation, our goal in the convex setting is to find a point for each node that attains the statistical accuracy, i.e., , which further implies .

For a nonconvex loss function , however, is also nonconvex and solving the problem in (4) is hard, in general. Therefore, we only focus on finding a point that satisfies the first-order optimality condition for (4) up to some accuracy , i.e., finding a point such that . Under the assumption that the gradient of loss is sub-Gaussian, it has been shown that with high probability the gap between the gradients of expected risk and empirical risk is bounded by ; see (Mei et al., 2018). As in the convex setting, by solving the aggregate loss instead of local loss, each node finds a better approximate for a first-order stationary point of the expected risk . Therefore, our goal in the nonconvex setting is to find a point that satisfies which also implies .

3 Proposed QuanTimed-DSGD Method

In this section, we present our proposed QuanTimed-DSGD algorithm that takes into account robustness to stragglers and communication efficiency in decentralized optimization. To ensure robustness to stragglers’ delay, we introduce a deadline-based protocol for updating the iterates in which nodes compute their local gradients estimation only for a specific amount time and then use their gradient estimates to update their iterates. This is in contrast to the mini-batch setting, in which nodes have to wait for the slowest machine to finish its local gradient computation. To reduce the communication load, we assume that nodes only exchange a quantized version of their local iterates. However, using quantized messages induces extra noise in the decision making process which makes the analysis of our algorithm more challenging. A detailed description of the proposed algorithm is as follows.

Deadline-Based Gradient Computation. Consider the current model available at node at iteration . Recall the definition of the local objective function at node defined in (3). The cost of computing the local gradient scales linearly by the number of samples assigned to the -th node. A common solution to reduce the computation cost at each node for the case that is large is using a mini-batch approximate of the gradient, i.e., each node picks a subset of its local samples to compute the stochastic gradient . A major challenge for this procedure is the presence of stragglers in the network: given mini-batch size , all nodes have to compute the average of exactly stochastic gradients. Thus, all the nodes have to wait for the slowest machine to finish its computation and exchange its new model with the neighbors.

To resolve this issue, we propose a deadline-based approach in which we set a fixed deadline for the time that each node can spend computing its local stochastic gradient estimate. Once the deadline is reached, nodes find their gradient estimate using whatever computation (mini-batch size) they could perform. Thus, with this deadline-based procedure, nodes do not need to wait for the slowest machine to update their iterates. However, their mini-batch size and consequently the noise of their gradient approximation will be different. To be more specific, let denote the set of random samples chosen at time by node . Define as the stochastic gradient of node at time as


for . If there are not any gradients computed by , i.e., , we set .

Computation Model. To illustrate the advantage of our deadline-based scheme over the fixed mini-batch scheme, we formally state the model that we use for the processing time of nodes in the network. We remark that our algorithms are oblivious to the choice of the computation model which is merely used for analysis. We define the processing speed of each machine as the number of stochastic gradients that it computes per second. We assume that the processing speed of each machine and iteration is a random variable , and ’s are i.i.d. with probability distribution . We further assume that the domain of the random variable is bounded and its realizations are in . If is the number of stochastic gradient which can be computed per second, the size of mini-batch is a random variable given by .

In the fixed mini-batch scheme and for any iteration , all the nodes have to wait for the machine with the slowest processing time before updating their iterates, and thus the overall computation time will be where is defined as . In our deadline-based scheme there is a fixed deadline which limits the computation time of the nodes, and is chosen such that , while the mini-batch scheme requires an expected time of . The gap between and depends on the distribution of , and can be unbounded in general growing with .

Quantized Message-Passing. To reduce the communication overhead of exchanging variables between nodes, we use quantization schemes that significantly reduces the required number of bits. More precisely, instead of sending , the -th node sends which is a quantized version of its local variable to its neighbors . As an example, consider the low precision quantizer specified by scale factor and bits with the representable range . For any , the quantizer outputs

1:Weights , total iterations , deadline
2:Set and compute
3:for  do
4:     Send to and receive
5:      Pick and evaluate stochastic gradients till reaching the deadline and generate according to (6)
6:     Update as follows:
7:end for
Algorithm 1 QuanTimed-DSGD at node

Algorithm Update. Once the local variables are exchanged between neighboring nodes, each node  uses its local stochastic gradient , its local decision variable , and the information received from its neighbors to update its local decision variable. Before formally stating the update of QuanTimed-DSGD, let us define as the weight that node assigns to the information that it receive from node . If and are not neighbors . These weights are considered for averaging over the local decision variable and the quantized variables received from neighbors to enforce consensus among neighboring nodes. Specifically, at time , node  updates its decision variable according to the update


where and are positive scalars that behave as stepsize. Note that the update in (8) shows that the updated iterate is a linear combination of the weighted average of node ’s neighbors’ decision variable, i.e., , and its local variable and stochastic gradient . The parameter behaves as the stepsize of the gradient descent step with respect to local objective function and the parameter behaves as an averaging parameter between performing the distributed gradient update and using the previous decision variable . By choosing a diminishing stepsize we control the noise of stochastic gradient evaluation, and by averaging using the parameter we control randomness induced by exchanging quantized variables. The description of QuanTimed-DSGD is summarized in Algorithm 1.

4 Convergence Analysis

In this section, we provide the main theoretical results for the proposed QuanTimed-DSGD algorithm. We first consider strongly convex loss functions and characterize the convergence rate of QuanTimed-DSGD for achieving the global optimal solution to the problem (4). Then, we focus on the nonconvex setting and show that the iterates generated by QuanTimed-DSGD find a stationary point of the cost in (4) while the local models are close to each other and the consensus constraint is asymptotically satisfied. All the proofs are provided in the appendix (Section 6). We make the following assumptions on the weight matrix, the quantizer, and local objective functions.

Assumption 1.

The weight matrix with entries satisfies the following conditions: , and .

Assumption 2.

The random quantizer

is unbiased and variance-bounded, i.e.,

and for any ; and quantizations are carried out independently.

Assumption 1 implies that

is symmetric and doubly stochastic. Moreover, all the eigenvalues of

are in , i.e., (e.g. (Yuan et al., 2016)). We also denote by

the spectral gap associated with the stochastic matrix

, where .

Assumption 3.

The function is -smooth with respect to , i.e., for any and any , .

Assumption 4.

Stochastic gradients are unbiased and variance bounded, i.e., and

Note the condition in Assumption 4 implies that the local gradients of each node

are also unbiased estimators of the expected risk gradient

and their variance is bounded above by as it is defined as an average over realizations.

4.1 Convex Setting

This section presents the convergence guarantees of the proposed QuanTimed-DSGD method for smooth and strongly convex functions. The following assumption formally defines strong convexity.

Assumption 5.

The function is -strongly convex, i.e., for any and we have that

Next, we characterize the convergence rate of QuanTimed-DSGD for strongly convex objectives.

Theorem 1 (Strongly Convex Losses).

If the conditions in Assumptions 15 are satisfied and step-sizes are picked as and for arbitrary , then for large enough number of iterations the iterates generated by the QuanTimed-DSGD algorithm satisfy


where , and .

Theorem 1 guarantees the exact convergence of each local model to the global optimal even though the noises induced by random quantizations and stochastic gradients are non-vanishing with iterations. Moreover, such convergence rate is as close as desired to by picking the tuning parameter arbitrarily close to . We would like to highlight that by choosing a parameter closer to , the lower bound on the number of required iterations becomes larger. More details are available in the proof of Theorem 1 provided in the appendix.

Note that the coefficient of in (9) characterizes the dependency of our upper bound on the objective function condition number , graph connectivity parameter , and variance of error induced by quantizing our signals. Moreover, the coefficient of shows the effect of stochastic gradients variance as well as our deadline-based scheme parameters .

Remark 1.

The expression represents the inverse of the effective batch size used in our QuanTimed-DSGD method. To be more specific, If the deadline is large enough that in expectation all local gradients are computed before the deadline, i.e., , then our effective batch size is and the term is the dominant term in the maximization. Conversely, if is small and the number of computed gradients is smaller than the total number of local samples , the effective batch size is . In this case, is dominant term in the maximization. This observation shows that in (9) is the variance of mini-batch gradient in QuanTimed-DSGD.

Remark 2.

Using strong convexity of the objective function, one can easily verify that the last iterates of QuanTimed-DSGD satisfy the sub-optimality with respect to the empirical risk, where is the minimizer of the empirical risk . As the gap between the expected risk and the empirical risk is of , the overall error of QuanTimed-DSGD with respect to the expected risk is .

4.2 Non-Convex Setting

In this section, we characterize the convergence rate of QuanTimed-DSGD for non-convex and smooth objectives. As discussed in Section 2, we are interested in finding a set of local models which satisfy first-order optimality condition approximately, while the models are close to each other and satisfy the consensus condition up to a small error. To be more precise, we are interested in finding a set of local models where their average (approximately) satisfy first-order optimality condition, i.e., , while the iterates are close to their average, i.e., . If a set of local iterates satisfies these conditions we call them -approximate solutions. Next theorem characterizes both first-order optimality and consensus convergence rates and the overall complexity for achieving an -approximate solutions.

Theorem 2 (Non-Convex Losses).

Under Assumptions 14, and for step-sizes and , QuanTimed-DSGD guarantees the following convergence and consensus rates:




for large enough number of iterations . Here denotes the average models at iteration .

The convergence rate in (10) indicates the proposed QuanTimed-DSGD method finds first-order stationary points with vanishing approximation error, even though the quantization and stochastic gradient noises are non-vanishing. Also, the approximation error decays as fast as with iterations. Theorem 2 also implies from (11) that the local models reach consensus with a rate of . Moreover, it shows that to find an -approximate solution QuanTimed-DSGD requires at most iterations.

5 Experimental Results

In this section, we numerically evaluate the performance of the proposed QuanTimed-DSGD method in solving a class of nonconvex decentralized optimization problems. In particular, we compare the total run-time of QuanTimed-DSGD scheme with the ones for three benchmarks which are briefly described below.

  • Decentralized SGD (DSGD) (Yuan et al., 2016): Each worker updates its decision variable as . We note that the exchanged messages are not quantized and the local gradients are computed for a fixed batch size.

  • Quantized Decentralized SGD (Q-DSGD) (Reisizadeh et al., 2018): Iterates are updated according to (8). Similar to QuanTimed-DSGD scheme, Q-DSGD employs quantized message-passing, however the gradients are computed for a fixed batch size in each iteration.

  • Asynchronous DSGD (Lian et al., 2017b): Each worker updates its model without waiting to receive the updates of its neighbors, i.e. where denotes the most recent model for node . In our implementation, models are exchanged without quantization.

Data and Experimental Setup. We carry out two sets of experiments over CIFAR-10 and MNIST datasets, where each worker is assigned with a sample set of size

for both datasets. For CIFAR-10, we implement a binary classification using a fully connected neural network with one hidden layer with

neurons. Three color (RGB) matrices for each image are combined with ratio so that the input of the neural network is a vector of length (see (Dutta et al., 2018)). For MNIST, we use a fully connected neural network with one hidden layer of size

to classify the input image into

classes. In experiments over CIFAR-10, step-sizes are chosen as follows: for QuanTimed-DSGD and Q-DSGD, and for DSGD and Asynchronous DSGD. In MNIST experiments, for QuanTimed-DSGD and Q-DSGD, and for DSGD.

We implement the unbiased low precision quantizer in (7) with various quantization levels , and we let denote the communication time of a -vector without quantization. In order to ensure that the expected batch size used in each node is a target positive number , we choose the deadline , where is the random computation speed. The communication graph is a random Erdös-Rènyi graph with edge connectivity and nodes. The weight matrix is designed as where is the Laplacian matrix of the graph and .

Figure 1: Comparison of QuanTimed-DSGD and vanilla DSGD methods for training a neural network on CIFAR-10 (left) and MNIST (right) datasets ().
Figure 2: Comparison of QuanTimed-DSGD, QDSGD, and vanilla DSGD methods for training a neural network on CIFAR-10 (left) and MNIST (right) datasets ().
Figure 3: Left: Comparison of QuanTimed-DSGD with Asynchronous DSGD and DSGD for training a neural network on CIFAR-10 (). Right: Effect of on the loss for CIFAR-10 ().

Results. Figure 1 compares the total training run-time for the QuanTimed-DSGD and DSGD schemes. On CIFAR-10 (left), the same (effective) batch-sizes, the proposed QuanTimed-DSGD achieves speedups of up to compared to DSGD.

In Figure 2, we further compare these two schemes to Q-DSGD benchmark. Although Q-SGD improves upon the vanilla DSGD by employing quantization, however, the proposed QuanTimed-DSGD illustrates speedup in training time over Q-DSGD (left).

To evaluate the straggler mitigation in the QuanTimed-DSGD, we compare its run-time with Asynchronous DSGD benchmark in Figure 3 (left). While Asynchronous DSGD outperforms DSGD in training run-time by avoiding slow nodes, the proposed QuanTimed-DSGD scheme improves upon Asynchronous DSGD by up to . These plots further illustrate that QuanTimed-DSGD significantly reduces the training time by simultaneously handling the communication load by quantization and mitigating stragglers through a deadline-based computation. The deadline time indeed can be optimized for the minimum training run-time, as illustrated in Figure 3 (right).

6 Appendix

Here we provide all the proofs and details which were skipped in the main document.

6.1 Bounding the Stochastic Gradient Noises

In our analysis for both convex and non-convex scenarios, we need to have the noise of various stochastic gradient functions evaluated. Hence, let us start this section by the following lemma which bounds the variance of stochastic gradient functions under our customary Assumption 4.

Lemma 1.

Assumption 4 results in the followings for any and :


The first five expressions (i)-(v) in the lemma are immediate results of Assumption 4 together with the fact that the noise of the stochastic gradient scales down with the sample size. To prove (vi), let denote the sample set for which node has computed the gradients. We have

and therefore

6.2 Proof of Theorem 1

To prove Theorem 1, we first establish two Lemmas 2 and 3 and then easily conclude the theorem from the two results.

The main problem is to minimize the global objective defined in (4). We introduce the following optimization problem which is equivalent the main problem:


where the vecor denotes the concatenation of all the local models. Clearly, is the solution to (12). Using Assumption 1, the constraint in the alternative problem (12) can be stated as . Inspired by this fact, we define the following penalty function for every :


and denote by the (unique) minimizer of . That is,


Next lemma characterizes the deviation of the models generated by the QuanTimed-DSGD method at iteration , that is from the optimizer of the penalty function, i.e. .

Lemma 2.

Suppose Assumptions 15 hold. Then, the expected deviation of the output of QuanTimed-DSGD from the solution to Problem (13) is upper bounded by


for , , any and , where

Proof of Lemma 2.

First note that the gradient of the penalty function defined in (13) is as follows:


where denotes the concatenation of models at iteration . Now consider the following stochastic gradient function for :



We let denote a sigma algebra that measures the history of the system up until time . According to Assumptions 2 and 4, the stochastic gradient defined above is unbiased, that is,

We can also write the update rule of QuanTimed-DSGD method as follows:


which also represents an iteration of the Stochastic Gradient Descent (SGD) algorithm with step-size in order to minimize the penalty function over . We can bound the deviation of the iteration generated by QuanTimed-DSGD from the optimizer as follows:


where we used the fact that the penalty function is strongly convex with parameter . Moreover, we can bound the second term in RHS of (20) as follows:


To derive (21), we used the facts that is smooth with parameter ; the quantizer is unbiased with variance (Assumption 2); stochastic gradients of the loss function are unbiased and variance-bounded (Assumption 4 and Lemma 1). Plugging (21) in (20) yields


To ease the notation, let denote the expected deviation of the models at iteration i.e. from the optimizer with respect to all the randomnesses from iteration . Therefore,


For any and the proposed pick , we have

and therefore