On the Convergence of Memory-Based Distributed SGD

05/30/2019
by   Shen-Yi Zhao, et al.
Nanjing University
0

Distributed stochastic gradient descent (DSGD) has been widely used for optimizing large-scale machine learning models, including both convex and non-convex models. With the rapid growth of model size, huge communication cost has been the bottleneck of traditional DSGD. Recently, many communication compression methods have been proposed. Memory-based distributed stochastic gradient descent (M-DSGD) is one of the efficient methods since each worker communicates a sparse vector in each iteration so that the communication cost is small. Recent works propose the convergence rate of M-DSGD when it adopts vanilla SGD. However, there is still a lack of convergence theory for M-DSGD when it adopts momentum SGD. In this paper, we propose a universal convergence analysis for M-DSGD by introducing transformation equation. The transformation equation describes the relation between traditional DSGD and M-DSGD so that we can transform M-DSGD to its corresponding DSGD. Hence we get the convergence rate of M-DSGD with momentum for both convex and non-convex problems. Furthermore, we combine M-DSGD and stagewise learning that the learning rate of M-DSGD in each stage is a constant and is decreased by stage, instead of iteration. Using the transformation equation, we propose the convergence rate of stagewise M-DSGD which bridges the gap between theory and practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/30/2019

Global Momentum Compression for Sparse Communication in Distributed SGD

With the rapid growth of data, distributed stochastic gradient descent (...
02/16/2021

IntSGD: Floatless Compression of Stochastic Gradients

We propose a family of lossy integer compressions for Stochastic Gradien...
10/08/2018

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Many distributed machine learning (ML) systems adopt the non-synchronous...
03/09/2020

Communication-Efficient Distributed SGD with Error-Feedback, Revisited

We show that the convergence proof of a recent algorithm called dist-EF-...
09/20/2018

Sparsified SGD with Memory

Huge scale machine learning problems are nowadays tackled by distributed...
02/01/2019

Compressing Gradient Optimizers via Count-Sketches

Many popular first-order optimization methods (e.g., Momentum, AdaGrad, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many machine learning models can be formulated as the following empirical risk minimization problem:

(1)

where denotes the model parameter, denotes the th training data, is number of training data, is the size of models. SGD (Robbins and Monro, 1951) is one of the efficient way to solve the empirical risk minimization problem. In each iteration, is updated by . Comparing to the batch methods, like gradient descent, it only needs to calculate one gradient in each iteration.

With the rapid growth of data, using SGD to solve the empirical risk minimization problem is time-consuming. Hence, distributed stochastic gradient descent (DSGD) has been the efficient method and many machine learning platforms (e.g. TensorFlow, PyTorch) adopt it. With

workers, it can be summarized as

(2)

where is the update vector calculated by

th worker and usually satisfies unbiased estimation

. Workers parallel calculate and the model parameter is updated by the summation of these with learning rate .

On the convergence of DSGD, it is equivalent to that of using single worker, which has the optimal rate for non-convex problems and for strongly convex problems (Dekel et al., 2012; Rakhlin et al., 2012; Li et al., 2014b). Besides, communication is another important research area in the culture of distributed optimization. And recently, more and more large models, like DenseNet (Huang et al., 2017), Bert (Devlin et al., 2018), are used in machine learning. It leads to huge communication cost which cannot be ignored. Hence, communication compression has attracted much attention for further reducing training time.

One branch of this research area is low precision presentation (also called quantization). On modern hardware, it uses bits to present a float number so that in DSGD, when one worker send or receive a dimension vector, the communication cost is . For a vector , low precision presentation methods quantize into bits presentation space, denoted as . It satisfies and the communication cost for is . Usually they need to divide the

coordinates into different buckets due to the quantization variance and then quantize them individually. Thus, the communication cost is

and the compression ratio is , where is the number of buckets. It is easy to get that .

Another branch is sparse communication. For a vector , these methods make it sparse, denoted as so that workers only need to send sparse vectors and can reduce the communication cost efficiently. In (Wang et al., 2018; Wangni et al., 2018), they use stochastic sparsity technique to get with unbiased guarantee, i.e. . Hence, these methods are equivalent to quantization ones mathematically (Wang et al., 2018). (Aji and Heafield, 2017; Lin et al., 2018; Alistarh et al., 2018; Stich et al., 2018) propose novel sparse communication methods that using memory gradient. Comparing to previous ones, is not necessarily the unbiased estimation of . It contains few coordinates of . After sending a sparse vector in each iteration, each worker stores those values which are not sent in the memory, i.e. . The are called memory gradient and it will be used in the next iteration. These methods are called memory-based distributed stochastic gradient descent (M-DSGD). (Aji and Heafield, 2017; Alistarh et al., 2018; Stich et al., 2018) are mainly based on vanilla SGD.  (Alistarh et al., 2018) proves the convergence rate for convex problems and (Stich et al., 2018) proposes the convergence for both convex and non-convex problems. The convergence conditions of them are listed in Table 1.  (Lin et al., 2018)

adopts momentum SGD and get better performance. Empirical results on cifar10 and imagenet show that they only need to send a approximately

dimension vector in each iteration without loss of generalization, which means the compression ratio is smaller than  (Lin et al., 2018). This is far better than that of quantization. However, there is still a lack of convergence theory for M-DSGD when it adopts momentum SGD.

strong convex convex nonconvex momentum
(Stich et al., 2018) - - no
(Alistarh et al., 2018) - no
Ours yes
Table 1: Convergence conditions in related works.

In this paper, we focus on the convergence rate of M-DSGD with momentum. The main results and contributions are summarized below:

  • We propose the transformation equation for M-DSGD. It describes the relation between M-DSGD and traditional DSGD. According to the transformation equation, we can transform M-DSGD to its corresponding DSGD.

  • When M-DSGD adopts -momentum SGD, we prove the convergence rate for both convex and non-convex problems. When the momentum scalar is , it degenerates to that of using vanilla SGD (Aji and Heafield, 2017) and we also get the convergence rate.

  • We combine M-DSGD and stagewise learning (Chen et al., 2019) that M-DSGD uses a constant learning rate in each stage, and decreases it by stage, which is usually adopted in practice. By the transformation equation, we prove the convergence rate of stagewise M-DSGD for a broad family of non-smooth and non-convex problems, which bridges the gap between theory and practice.

2 Preliminary

In this paper, we use to denote norm, use to denote the optimal solution of (1), use to denote one stochastic gradient with respect to mini-batch samples such that and , use to denote dot product, use to denote the vector , use

to denote identity matrix. For a vector

, we use to denote its th coordinate value. We make the following definitions: (bounded gradient)  is the -bounded () stochastic gradient of function if it satisfies , .

(smooth function) Function is -smooth () if , or equivalently .

(strong convex function) Function is -strong convex () if .

(weak convex function) Function is -weak convex () if .

The first three definitions are common in both convex and non-convex optimization. Throughout this paper, we assume that .

Recently, the weak convex property has attract much attention in non-convex optimization (Allen-Zhu, 2018a, b; Chen et al., 2019). For a -smooth function, it must be -weak convex. For a -weak convex function, we can add one regularization to make it convex so that we can use convex optimization tools for a weak convex problems.

3 Memory-based Distributed SGD

1:  Initialization: workers, , , batch size ;
2:  Set
3:  for  do
4:     for , each worker parallel do
5:        randomly picks one mini-batch training data with ;
6:        Calculate the stochastic gradient ;
7:        ;
8:        Generate a sparse vector ;
9:        Send ;
10:        , ;
11:     end for
12:     Aggregate: ;
13:     Update parameter: ;
14:  end for
Algorithm 1 Memory-based Distributed SGD (with momentum)

Assuming we have workers, the memory-based DSGD is presented in Algorithm 1. It can be implemented on many distributed platforms, like all-reduce, Parameter Server (Li et al., 2014a). Data are divided into partitions and stored on workers. Each worker calculates update vector. After aggregating the update vectors , it updates parameter . Since is sparse, is sparse as well so that M-DSGD can reduce the communication cost. Besides, each worker will store those coordinates which are not be sent, denoted as . It is called memory gradient. In some related work (Aji and Heafield, 2017; Alistarh et al., 2018), it is called also residuals, accumulated error.

3.1 Relation to Existing Sparse Communication Methods

Assume we have got , the update rule of M-DSGD can be written as

The method in (Aji and Heafield, 2017) is a special case of M-DSGD by setting , which means (Aji and Heafield, 2017) adopts the vanilla SGD.

(Alistarh et al., 2018) and (Stich et al., 2018) also use the memory gradient to make the communication sparse. Their update rules can be written as

We can see that they also use the vanilla SGD. Compared to M-DSGD with , the difference is that their memory gradient contains the learning rate . By setting , we re-write the update rule for and as:

(3)
(4)

We observe that the update rule for is the same as that of M-DSGD. The difference is . In most convergence analysis for vanilla SGD, the learning rate is a constant or non-increasing. If is a constant, then it is totally the same as M-DSGD. If is a non-increasing sequence, on the one hand, in (3), . In the later convergence analysis, we will see that we should make the memory gradient norm as small as possible. Hence the scalar is unnecessary and can be dropped. On the other hand, at the point of asynchronous updating view (Lin et al., 2018), M-DSGD is more reasonable that should not contain the learning rate . Since the memory gradient denotes stale information, we should apply , which is smaller than , on when we use it to get .

(Lin et al., 2018) is the first work that adopts momentum SGD in M-DSGD. In (Lin et al., 2018), it uses a trick called momentum factor masking. Its update rule can be written as as

After getting , each worker applies the same on to get . (Lin et al., 2018) considers the algorithm as a kind of asynchronous momentum SGD and the momentum factor masking can overcome the staleness effect. However, the is designed mainly based on . It has nothing to do with . The empirical results (Lin et al., 2018) on cifar10 using resnet110 show that the affect of momentum factor masking on top-1 accuracy is smaller than .

4 Transformation Equation

For convenience, we define a diagonal matrix such that to replace the symbol . Then the update rule for can be written as

(5)
(6)

According to (5) and (6), we can eliminate and obtain

(7)

First, we consider the simplest cast that to show the relation between traditional DSGD and M-DSGD. For convenience, we denote which satisfies .

According to the above equation, we set and obtain

(8)

According to equation (8), we observe that:

  • for the term , it is the update rule for in traditional DSGD;

  • for the term , if is smooth, we have

  • for the term , if , then

It implies that with certain assumptions, when we transform one traditional DSGD with initialization and learning rate to M-DSGD, it is equivalent to adding one small noise scaled by in each iteration. To the best of our knowledge, for most DSGD’s with convergence guarantee, the learning rate satisfies the condition . For example, is a constant and . When the noise is bounded and after being scaled by , it will not affect the convergence of . What’s more, when is small, will get close to which means converges at the same time. Now we can conclude that both and converge to the optimal solution.

In fact, in equation (8), we transform the update rule of to that of and get the convergence of benefitting from the update rule of . For general , we have the following theorem:

(Transformation) Let be a sequence generated by Algorithm 1 with learning rate . We set to be some linear combination of , and define , where is some function. Assume the following condition holds on:

(9)
(10)

where , , . Then we call (10) the transformation equation. If is -smooth with -bounded stochastic gradient, we have

where .

In Theorem 4, we only need to be the linear combination of so that it is easy to conduct such a . Although is an unbiased estimation of full gradient at , benefitting from (9) which implies and are close enough and is the same order of magnitude as the variance of , (10) can be seen as updating by DSGD with learning rate . Hence, (10) transform the update rule of to that of and we call it the transformation equation. It describes the relation between M-DSGD and traditional DSGD. If

(11)

then we can randomly choose from

with probability

, and get that .

5 Convergence

In this section, we are going to prove the convergence of of M-DSGD with -momentum for both convex and non-convex problems. For convenience, we denote and . Then according to (7), we have the update rule for :

(12)

where .

According to Theorem 4, our main task is establishing the transformation equation.

Let . By setting

(13)

where , we have

(14)

Lemma 5 gives out the transformation equation of M-DSGD with -momentum. We can see that is a linear combination of and . Since and , it is easy to make . We should design carefully. We propose two strategies:

  • is a small constant, then , and ;

  • , then . and .

It is easy to verify that both of the two strategies satisfy (9) and the learning rate condition (11). Specifically, we have the following lemma: Assume has the -bounded stochastic gradient, and . is defined in Lemma 5. Then we have

and if is defined as one of the above two strategies, we have

Then we get the following convergence rate of M-DSGD:

(strong convex case) Let and is defined in Lemma 5. Assume is -smooth, -strong convex with -bounded stochastic gradient, and . By setting , we have

where .

(convex case) Let and is defined in Lemma 5. Assume is convex with -bounded stochastic gradient, and . By setting , we have

where . It implies the convergence rate.

(non-cnovex case) Let and is defined in Lemma 5. Assume is -smooth with -bounded stochastic gradient and . By setting , we have

where . By taking , it is easy to get convergence rate.

If we design as stochastic batch gradient of , which means , it is a special case of M-DSGD with momentum by setting . For the strong convex and smooth case, by setting , we get the convergence rate. For the general convex case, by setting , we get the convergence rate. For non-convex and smooth case, by setting , we get the convergence rate. Please note that (3)(4) can also be transformed to the such a formulation that , where . So it has the same convergence rate.

6 M-DSGD meets stagewise learning

In the previous analysis for convergence of M-DSGD, we need to set the learning rate be a constant or . This is far from the practice. In fact, we never set a constant learning rate when training models. Usually, we decrease the learning rate after passing through the training data several times. For example, (He et al., 2016) decreases the learning rate by

at 32k and 48k iterations when training resnet on imagenet. What’s more, many models include non-smooth operations, like ReLu, max pooling. This is also far from the convergence condition in previous theorems. Recently,

(Chen et al., 2019) propose the stagewise learning to bridge the gap between theory and practice. In this section, we use stagewise learning for M-DSGD.

For convenience, we denote Algorithm 1 with constant learning rate as . is the function to optimize, is initialization, is a constant learning rate, is the momentum scalar, is the iteration numbers. Then we have Let . We define to be the sequence produced by so that , and . Let be the sequence transformed by Lemma 5. Assume , is convex with -bounded stochastic gradient, we have

where , . Here we set to be convex but not necessarily strong convex.

Lemma 6 implies that M-DSGD with constant learning rate satisfies the condition of stagewise learning (Theorem 1 in (Chen et al., 2019)). And constant learning rate makes the transformation equation simple. Thus, we can use stagewise learning. We define

(15)

where

(16)

Let be the sequence produced by , which means . If is -weak convex, then is -strong convex when . Hence, we can apply Lemma 6 on . Specifically, we have the following result:

Assume is -weak convex with -bounded stochastic gradient and . By setting , , we have

where is defined as

, and .

In both Lemma 6 and Theorem 6, we do not need the smooth assumption for or . Hence, stagewise M-DSGD can solve a broad family of non-smooth and non-convex problems.

7 Choice of

In the convergence theorems, we need to be bounded. Since to be bounded, we only need to be bounded. According to the update rule for :