# On the Convergence of Memory-Based Distributed SGD

Distributed stochastic gradient descent (DSGD) has been widely used for optimizing large-scale machine learning models, including both convex and non-convex models. With the rapid growth of model size, huge communication cost has been the bottleneck of traditional DSGD. Recently, many communication compression methods have been proposed. Memory-based distributed stochastic gradient descent (M-DSGD) is one of the efficient methods since each worker communicates a sparse vector in each iteration so that the communication cost is small. Recent works propose the convergence rate of M-DSGD when it adopts vanilla SGD. However, there is still a lack of convergence theory for M-DSGD when it adopts momentum SGD. In this paper, we propose a universal convergence analysis for M-DSGD by introducing transformation equation. The transformation equation describes the relation between traditional DSGD and M-DSGD so that we can transform M-DSGD to its corresponding DSGD. Hence we get the convergence rate of M-DSGD with momentum for both convex and non-convex problems. Furthermore, we combine M-DSGD and stagewise learning that the learning rate of M-DSGD in each stage is a constant and is decreased by stage, instead of iteration. Using the transformation equation, we propose the convergence rate of stagewise M-DSGD which bridges the gap between theory and practice.

## Authors

• 11 publications
• 15 publications
• 29 publications
05/30/2019

### Global Momentum Compression for Sparse Communication in Distributed SGD

With the rapid growth of data, distributed stochastic gradient descent (...
02/16/2021

### IntSGD: Floatless Compression of Stochastic Gradients

We propose a family of lossy integer compressions for Stochastic Gradien...
10/08/2018

### Toward Understanding the Impact of Staleness in Distributed Machine Learning

Many distributed machine learning (ML) systems adopt the non-synchronous...
03/27/2020

### A Hybrid-Order Distributed SGD Method for Non-Convex Optimization to Balance Communication Overhead, Computational Complexity, and Convergence Rate

In this paper, we propose a method of distributed stochastic gradient de...
03/09/2020

### Communication-Efficient Distributed SGD with Error-Feedback, Revisited

We show that the convergence proof of a recent algorithm called dist-EF-...
09/20/2018

### Sparsified SGD with Memory

Huge scale machine learning problems are nowadays tackled by distributed...
02/01/2019

### Compressing Gradient Optimizers via Count-Sketches

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many machine learning models can be formulated as the following empirical risk minimization problem:

 minw∈RdF(w):=1nn∑i=1f(w;ζi), (1)

where denotes the model parameter, denotes the th training data, is number of training data, is the size of models. SGD (Robbins and Monro, 1951) is one of the efficient way to solve the empirical risk minimization problem. In each iteration, is updated by . Comparing to the batch methods, like gradient descent, it only needs to calculate one gradient in each iteration.

With the rapid growth of data, using SGD to solve the empirical risk minimization problem is time-consuming. Hence, distributed stochastic gradient descent (DSGD) has been the efficient method and many machine learning platforms (e.g. TensorFlow, PyTorch) adopt it. With

workers, it can be summarized as

 wt+1=wt−ηtp∑k=1gt,k (2)

where is the update vector calculated by

th worker and usually satisfies unbiased estimation

. Workers parallel calculate and the model parameter is updated by the summation of these with learning rate .

On the convergence of DSGD, it is equivalent to that of using single worker, which has the optimal rate for non-convex problems and for strongly convex problems (Dekel et al., 2012; Rakhlin et al., 2012; Li et al., 2014b). Besides, communication is another important research area in the culture of distributed optimization. And recently, more and more large models, like DenseNet (Huang et al., 2017), Bert (Devlin et al., 2018), are used in machine learning. It leads to huge communication cost which cannot be ignored. Hence, communication compression has attracted much attention for further reducing training time.

One branch of this research area is low precision presentation (also called quantization). On modern hardware, it uses bits to present a float number so that in DSGD, when one worker send or receive a dimension vector, the communication cost is . For a vector , low precision presentation methods quantize into bits presentation space, denoted as . It satisfies and the communication cost for is . Usually they need to divide the

coordinates into different buckets due to the quantization variance and then quantize them individually. Thus, the communication cost is

and the compression ratio is , where is the number of buckets. It is easy to get that .

Another branch is sparse communication. For a vector , these methods make it sparse, denoted as so that workers only need to send sparse vectors and can reduce the communication cost efficiently. In (Wang et al., 2018; Wangni et al., 2018), they use stochastic sparsity technique to get with unbiased guarantee, i.e. . Hence, these methods are equivalent to quantization ones mathematically (Wang et al., 2018). (Aji and Heafield, 2017; Lin et al., 2018; Alistarh et al., 2018; Stich et al., 2018) propose novel sparse communication methods that using memory gradient. Comparing to previous ones, is not necessarily the unbiased estimation of . It contains few coordinates of . After sending a sparse vector in each iteration, each worker stores those values which are not sent in the memory, i.e. . The are called memory gradient and it will be used in the next iteration. These methods are called memory-based distributed stochastic gradient descent (M-DSGD). (Aji and Heafield, 2017; Alistarh et al., 2018; Stich et al., 2018) are mainly based on vanilla SGD.  (Alistarh et al., 2018) proves the convergence rate for convex problems and (Stich et al., 2018) proposes the convergence for both convex and non-convex problems. The convergence conditions of them are listed in Table 1.  (Lin et al., 2018)

adopts momentum SGD and get better performance. Empirical results on cifar10 and imagenet show that they only need to send a approximately

dimension vector in each iteration without loss of generalization, which means the compression ratio is smaller than  (Lin et al., 2018). This is far better than that of quantization. However, there is still a lack of convergence theory for M-DSGD when it adopts momentum SGD.

In this paper, we focus on the convergence rate of M-DSGD with momentum. The main results and contributions are summarized below:

• We propose the transformation equation for M-DSGD. It describes the relation between M-DSGD and traditional DSGD. According to the transformation equation, we can transform M-DSGD to its corresponding DSGD.

• When M-DSGD adopts -momentum SGD, we prove the convergence rate for both convex and non-convex problems. When the momentum scalar is , it degenerates to that of using vanilla SGD (Aji and Heafield, 2017) and we also get the convergence rate.

• We combine M-DSGD and stagewise learning (Chen et al., 2019) that M-DSGD uses a constant learning rate in each stage, and decreases it by stage, which is usually adopted in practice. By the transformation equation, we prove the convergence rate of stagewise M-DSGD for a broad family of non-smooth and non-convex problems, which bridges the gap between theory and practice.

## 2 Preliminary

In this paper, we use to denote norm, use to denote the optimal solution of (1), use to denote one stochastic gradient with respect to mini-batch samples such that and , use to denote dot product, use to denote the vector , use

to denote identity matrix. For a vector

, we use to denote its th coordinate value. We make the following definitions: (bounded gradient)  is the -bounded () stochastic gradient of function if it satisfies , .

(smooth function) Function is -smooth () if , or equivalently .

(strong convex function) Function is -strong convex () if .

(weak convex function) Function is -weak convex () if .

The first three definitions are common in both convex and non-convex optimization. Throughout this paper, we assume that .

Recently, the weak convex property has attract much attention in non-convex optimization (Allen-Zhu, 2018a, b; Chen et al., 2019). For a -smooth function, it must be -weak convex. For a -weak convex function, we can add one regularization to make it convex so that we can use convex optimization tools for a weak convex problems.

## 3 Memory-based Distributed SGD

Assuming we have workers, the memory-based DSGD is presented in Algorithm 1. It can be implemented on many distributed platforms, like all-reduce, Parameter Server (Li et al., 2014a). Data are divided into partitions and stored on workers. Each worker calculates update vector. After aggregating the update vectors , it updates parameter . Since is sparse, is sparse as well so that M-DSGD can reduce the communication cost. Besides, each worker will store those coordinates which are not be sent, denoted as . It is called memory gradient. In some related work (Aji and Heafield, 2017; Alistarh et al., 2018), it is called also residuals, accumulated error.

### 3.1 Relation to Existing Sparse Communication Methods

Assume we have got , the update rule of M-DSGD can be written as

 gt,k= βgt−1,k+1pb∑ζi∈It,k∇f(wt;ζi), ut+1,k= (1−mt,k)⊙(gt,k+ut,k), wt+1= wt−ηtp∑k=1mt,k⊙(gt,k+ut,k).

The method in (Aji and Heafield, 2017) is a special case of M-DSGD by setting , which means (Aji and Heafield, 2017) adopts the vanilla SGD.

(Alistarh et al., 2018) and (Stich et al., 2018) also use the memory gradient to make the communication sparse. Their update rules can be written as

 gt,k= 1pb∑ζi∈It,k∇f(wt;ζi), ut+1,k= (1−mt,k)⊙(ηtgt,k+ut,k), wt+1= wt−p∑k=1mt,k⊙(ηtgt,k+ut,k).

We can see that they also use the vanilla SGD. Compared to M-DSGD with , the difference is that their memory gradient contains the learning rate . By setting , we re-write the update rule for and as:

 vt+1,k= ηtηt+1(1−mt,k)⊙(gt,k+vt,k), (3) wt+1= wt−ηtp∑k=1mt,k⊙(gt,k+vt,k). (4)

We observe that the update rule for is the same as that of M-DSGD. The difference is . In most convergence analysis for vanilla SGD, the learning rate is a constant or non-increasing. If is a constant, then it is totally the same as M-DSGD. If is a non-increasing sequence, on the one hand, in (3), . In the later convergence analysis, we will see that we should make the memory gradient norm as small as possible. Hence the scalar is unnecessary and can be dropped. On the other hand, at the point of asynchronous updating view (Lin et al., 2018), M-DSGD is more reasonable that should not contain the learning rate . Since the memory gradient denotes stale information, we should apply , which is smaller than , on when we use it to get .

(Lin et al., 2018) is the first work that adopts momentum SGD in M-DSGD. In (Lin et al., 2018), it uses a trick called momentum factor masking. Its update rule can be written as as

 ^gt,k= βgt−1,k+1pb∑ζi∈It,k,∇f(w;ζi) ut+1,k= (1−mt,k)⊙(^gt,k+ut,k), wt+1= wt−ηtp∑k=1mt,k⊙(^gt,k+ut,k), gt,k= (1−mt,k)⊙^gt,k. (% momentum factor masking)

After getting , each worker applies the same on to get . (Lin et al., 2018) considers the algorithm as a kind of asynchronous momentum SGD and the momentum factor masking can overcome the staleness effect. However, the is designed mainly based on . It has nothing to do with . The empirical results (Lin et al., 2018) on cifar10 using resnet110 show that the affect of momentum factor masking on top-1 accuracy is smaller than .

## 4 Transformation Equation

For convenience, we define a diagonal matrix such that to replace the symbol . Then the update rule for can be written as

 wt+1= wt−ηtp∑k=1Mt,k(gt,k+ut,k), (5) ut+1,k= (I−Mt,k)(gt,k+ut,k). (6)

According to (5) and (6), we can eliminate and obtain

 wt+1−ηtp∑k=1ut+1,k=wt−ηtp∑k=1(gt,k+ut,k). (7)

First, we consider the simplest cast that to show the relation between traditional DSGD and M-DSGD. For convenience, we denote which satisfies .

According to the above equation, we set and obtain

 zt+1= zt−ηt∇f(wt;It)+(ηt−ηt+1)p∑k=1ut+1,k = zt−ηt∇f(zt;It)(I)+ηt(∇f(zt;It)−∇f(wt;It))(II)+(ηt−ηt+1)p∑k=1ut+1,k(III) (8)

According to equation (8), we observe that:

• for the term , it is the update rule for in traditional DSGD;

• for the term , if is smooth, we have

• for the term , if , then

It implies that with certain assumptions, when we transform one traditional DSGD with initialization and learning rate to M-DSGD, it is equivalent to adding one small noise scaled by in each iteration. To the best of our knowledge, for most DSGD’s with convergence guarantee, the learning rate satisfies the condition . For example, is a constant and . When the noise is bounded and after being scaled by , it will not affect the convergence of . What’s more, when is small, will get close to which means converges at the same time. Now we can conclude that both and converge to the optimal solution.

In fact, in equation (8), we transform the update rule of to that of and get the convergence of benefitting from the update rule of . For general , we have the following theorem:

(Transformation) Let be a sequence generated by Algorithm 1 with learning rate . We set to be some linear combination of , and define , where is some function. Assume the following condition holds on:

 E∥zt−wt∥2≤Aγ2t,∀t≥0; (9) zt+1=zt−γtdt+αtet, (10)

where , , . Then we call (10) the transformation equation. If is -smooth with -bounded stochastic gradient, we have

 T−1∑t=0γtE[∥∇F(wt)∥2|wt]≤F(w0)−F(w∗)+CT−1∑t=0γ2t.

where .

In Theorem 4, we only need to be the linear combination of so that it is easy to conduct such a . Although is an unbiased estimation of full gradient at , benefitting from (9) which implies and are close enough and is the same order of magnitude as the variance of , (10) can be seen as updating by DSGD with learning rate . Hence, (10) transform the update rule of to that of and we call it the transformation equation. It describes the relation between M-DSGD and traditional DSGD. If

 T−1∑t=0γt→∞,T−1∑t=0γ2t/T−1∑t=0γt→0,as T→∞, (11)

then we can randomly choose from

with probability

, and get that .

## 5 Convergence

In this section, we are going to prove the convergence of of M-DSGD with -momentum for both convex and non-convex problems. For convenience, we denote and . Then according to (7), we have the update rule for :

 wt+1−ηt~ut+1=wt−ηt(~gt+~ut), (12)

where .

According to Theorem 4, our main task is establishing the transformation equation.

Let . By setting

 zt=wt+ρt−1~gt−1−ηt~ut, (13)

where , we have

 zt+1=zt−(ηt−ρt)∇f(wt;It)+(ηt−ηt+1)~ut+1. (14)

Lemma 5 gives out the transformation equation of M-DSGD with -momentum. We can see that is a linear combination of and . Since and , it is easy to make . We should design carefully. We propose two strategies:

• is a small constant, then , and ;

• , then . and .

It is easy to verify that both of the two strategies satisfy (9) and the learning rate condition (11). Specifically, we have the following lemma: Assume has the -bounded stochastic gradient, and . is defined in Lemma 5. Then we have

 E∥~gt∥2≤G2(1−β)2

and if is defined as one of the above two strategies, we have

 E∥zt−wt∥2≤2G2(1−β)2ρ2t−1+2U2η2t≤O(ρ2t−1)

Then we get the following convergence rate of M-DSGD:

(strong convex case) Let and is defined in Lemma 5. Assume is -smooth, -strong convex with -bounded stochastic gradient, and . By setting , we have

 1⌈T/2⌉T−1∑t=T−⌈T/2⌉E(F(wt)−F(w∗))≤ 3C+2G√2G2β2(1−β)2+2U2μT,

where .

(convex case) Let and is defined in Lemma 5. Assume is convex with -bounded stochastic gradient, and . By setting , we have

 T−1∑t=02√t+1E(F(wt)−F(w∗))≤∥w0−w∗∥2+T−1∑t=0Ct+1,

where . It implies the convergence rate.

(non-cnovex case) Let and is defined in Lemma 5. Assume is -smooth with -bounded stochastic gradient and . By setting , we have

 1(1−β)TT−1∑t=0∥∇F(wt)∥2≤F(w0)−F(w∗)Tη+Cη,

where . By taking , it is easy to get convergence rate.

If we design as stochastic batch gradient of , which means , it is a special case of M-DSGD with momentum by setting . For the strong convex and smooth case, by setting , we get the convergence rate. For the general convex case, by setting , we get the convergence rate. For non-convex and smooth case, by setting , we get the convergence rate. Please note that (3)(4) can also be transformed to the such a formulation that , where . So it has the same convergence rate.

## 6 M-DSGD meets stagewise learning

In the previous analysis for convergence of M-DSGD, we need to set the learning rate be a constant or . This is far from the practice. In fact, we never set a constant learning rate when training models. Usually, we decrease the learning rate after passing through the training data several times. For example, (He et al., 2016) decreases the learning rate by

at 32k and 48k iterations when training resnet on imagenet. What’s more, many models include non-smooth operations, like ReLu, max pooling. This is also far from the convergence condition in previous theorems. Recently,

(Chen et al., 2019) propose the stagewise learning to bridge the gap between theory and practice. In this section, we use stagewise learning for M-DSGD.

For convenience, we denote Algorithm 1 with constant learning rate as . is the function to optimize, is initialization, is a constant learning rate, is the momentum scalar, is the iteration numbers. Then we have Let . We define to be the sequence produced by so that , and . Let be the sequence transformed by Lemma 5. Assume , is convex with -bounded stochastic gradient, we have

 Eϕ(~w+)−ϕ(w∗ϕ)≤1−β2ηT∥w−w∗ϕ∥2+Cη

where , . Here we set to be convex but not necessarily strong convex.

Lemma 6 implies that M-DSGD with constant learning rate satisfies the condition of stagewise learning (Theorem 1 in (Chen et al., 2019)). And constant learning rate makes the transformation equation simple. Thus, we can use stagewise learning. We define

 ~ws+1=A(Fs,γ(⋅),~ws,ηs,β,Ts) (15)

where

 Fs,γ(w)=F(w)+12γ∥w−~ws∥2 (16)

Let be the sequence produced by , which means . If is -weak convex, then is -strong convex when . Hence, we can apply Lemma 6 on . Specifically, we have the following result:

Assume is -weak convex with -bounded stochastic gradient and . By setting , , we have

 (1+β)γ4S(S+1)S−1∑s=0(s+1)E∥∇Fγ(~ws)∥2≤F−F(w∗)+3^Cη0S+1

where is defined as

 Fγ(w)=minw′F(w′)+12γ∥w−w′∥2

, and .

In both Lemma 6 and Theorem 6, we do not need the smooth assumption for or . Hence, stagewise M-DSGD can solve a broad family of non-smooth and non-convex problems.

## 7 Choice of mt,k

In the convergence theorems, we need to be bounded. Since to be bounded, we only need to be bounded. According to the update rule for :

 gt,k= βgt−1,k+1pb∑