1 Introduction
Many machine learning models can be formulated as the following empirical risk minimization problem:
(1) 
where denotes the model parameter, denotes the th training data, is number of training data, is the size of models. SGD (Robbins and Monro, 1951) is one of the efficient way to solve the empirical risk minimization problem. In each iteration, is updated by . Comparing to the batch methods, like gradient descent, it only needs to calculate one gradient in each iteration.
With the rapid growth of data, using SGD to solve the empirical risk minimization problem is timeconsuming. Hence, distributed stochastic gradient descent (DSGD) has been the efficient method and many machine learning platforms (e.g. TensorFlow, PyTorch) adopt it. With
workers, it can be summarized as(2) 
where is the update vector calculated by
th worker and usually satisfies unbiased estimation
. Workers parallel calculate and the model parameter is updated by the summation of these with learning rate .On the convergence of DSGD, it is equivalent to that of using single worker, which has the optimal rate for nonconvex problems and for strongly convex problems (Dekel et al., 2012; Rakhlin et al., 2012; Li et al., 2014b). Besides, communication is another important research area in the culture of distributed optimization. And recently, more and more large models, like DenseNet (Huang et al., 2017), Bert (Devlin et al., 2018), are used in machine learning. It leads to huge communication cost which cannot be ignored. Hence, communication compression has attracted much attention for further reducing training time.
One branch of this research area is low precision presentation (also called quantization). On modern hardware, it uses bits to present a float number so that in DSGD, when one worker send or receive a dimension vector, the communication cost is . For a vector , low precision presentation methods quantize into bits presentation space, denoted as . It satisfies and the communication cost for is . Usually they need to divide the
coordinates into different buckets due to the quantization variance and then quantize them individually. Thus, the communication cost is
and the compression ratio is , where is the number of buckets. It is easy to get that .Another branch is sparse communication. For a vector , these methods make it sparse, denoted as so that workers only need to send sparse vectors and can reduce the communication cost efficiently. In (Wang et al., 2018; Wangni et al., 2018), they use stochastic sparsity technique to get with unbiased guarantee, i.e. . Hence, these methods are equivalent to quantization ones mathematically (Wang et al., 2018). (Aji and Heafield, 2017; Lin et al., 2018; Alistarh et al., 2018; Stich et al., 2018) propose novel sparse communication methods that using memory gradient. Comparing to previous ones, is not necessarily the unbiased estimation of . It contains few coordinates of . After sending a sparse vector in each iteration, each worker stores those values which are not sent in the memory, i.e. . The are called memory gradient and it will be used in the next iteration. These methods are called memorybased distributed stochastic gradient descent (MDSGD). (Aji and Heafield, 2017; Alistarh et al., 2018; Stich et al., 2018) are mainly based on vanilla SGD. (Alistarh et al., 2018) proves the convergence rate for convex problems and (Stich et al., 2018) proposes the convergence for both convex and nonconvex problems. The convergence conditions of them are listed in Table 1. (Lin et al., 2018)
adopts momentum SGD and get better performance. Empirical results on cifar10 and imagenet show that they only need to send a approximately
dimension vector in each iteration without loss of generalization, which means the compression ratio is smaller than (Lin et al., 2018). This is far better than that of quantization. However, there is still a lack of convergence theory for MDSGD when it adopts momentum SGD.strong convex  convex  nonconvex  momentum  
(Stich et al., 2018)      no  
(Alistarh et al., 2018)    no  
Ours  yes 
In this paper, we focus on the convergence rate of MDSGD with momentum. The main results and contributions are summarized below:

We propose the transformation equation for MDSGD. It describes the relation between MDSGD and traditional DSGD. According to the transformation equation, we can transform MDSGD to its corresponding DSGD.

When MDSGD adopts momentum SGD, we prove the convergence rate for both convex and nonconvex problems. When the momentum scalar is , it degenerates to that of using vanilla SGD (Aji and Heafield, 2017) and we also get the convergence rate.

We combine MDSGD and stagewise learning (Chen et al., 2019) that MDSGD uses a constant learning rate in each stage, and decreases it by stage, which is usually adopted in practice. By the transformation equation, we prove the convergence rate of stagewise MDSGD for a broad family of nonsmooth and nonconvex problems, which bridges the gap between theory and practice.
2 Preliminary
In this paper, we use to denote norm, use to denote the optimal solution of (1), use to denote one stochastic gradient with respect to minibatch samples such that and , use to denote dot product, use to denote the vector , use
to denote identity matrix. For a vector
, we use to denote its th coordinate value. We make the following definitions: (bounded gradient) is the bounded () stochastic gradient of function if it satisfies , .(smooth function) Function is smooth () if , or equivalently .
(strong convex function) Function is strong convex () if .
(weak convex function) Function is weak convex () if .
The first three definitions are common in both convex and nonconvex optimization. Throughout this paper, we assume that .
Recently, the weak convex property has attract much attention in nonconvex optimization (AllenZhu, 2018a, b; Chen et al., 2019). For a smooth function, it must be weak convex. For a weak convex function, we can add one regularization to make it convex so that we can use convex optimization tools for a weak convex problems.
3 Memorybased Distributed SGD
Assuming we have workers, the memorybased DSGD is presented in Algorithm 1. It can be implemented on many distributed platforms, like allreduce, Parameter Server (Li et al., 2014a). Data are divided into partitions and stored on workers. Each worker calculates update vector. After aggregating the update vectors , it updates parameter . Since is sparse, is sparse as well so that MDSGD can reduce the communication cost. Besides, each worker will store those coordinates which are not be sent, denoted as . It is called memory gradient. In some related work (Aji and Heafield, 2017; Alistarh et al., 2018), it is called also residuals, accumulated error.
3.1 Relation to Existing Sparse Communication Methods
Assume we have got , the update rule of MDSGD can be written as
The method in (Aji and Heafield, 2017) is a special case of MDSGD by setting , which means (Aji and Heafield, 2017) adopts the vanilla SGD.
(Alistarh et al., 2018) and (Stich et al., 2018) also use the memory gradient to make the communication sparse. Their update rules can be written as
We can see that they also use the vanilla SGD. Compared to MDSGD with , the difference is that their memory gradient contains the learning rate . By setting , we rewrite the update rule for and as:
(3)  
(4) 
We observe that the update rule for is the same as that of MDSGD. The difference is . In most convergence analysis for vanilla SGD, the learning rate is a constant or nonincreasing. If is a constant, then it is totally the same as MDSGD. If is a nonincreasing sequence, on the one hand, in (3), . In the later convergence analysis, we will see that we should make the memory gradient norm as small as possible. Hence the scalar is unnecessary and can be dropped. On the other hand, at the point of asynchronous updating view (Lin et al., 2018), MDSGD is more reasonable that should not contain the learning rate . Since the memory gradient denotes stale information, we should apply , which is smaller than , on when we use it to get .
(Lin et al., 2018) is the first work that adopts momentum SGD in MDSGD. In (Lin et al., 2018), it uses a trick called momentum factor masking. Its update rule can be written as as
After getting , each worker applies the same on to get . (Lin et al., 2018) considers the algorithm as a kind of asynchronous momentum SGD and the momentum factor masking can overcome the staleness effect. However, the is designed mainly based on . It has nothing to do with . The empirical results (Lin et al., 2018) on cifar10 using resnet110 show that the affect of momentum factor masking on top1 accuracy is smaller than .
4 Transformation Equation
For convenience, we define a diagonal matrix such that to replace the symbol . Then the update rule for can be written as
(5)  
(6) 
According to (5) and (6), we can eliminate and obtain
(7) 
First, we consider the simplest cast that to show the relation between traditional DSGD and MDSGD. For convenience, we denote which satisfies .
According to the above equation, we set and obtain
(8) 
According to equation (8), we observe that:

for the term , it is the update rule for in traditional DSGD;

for the term , if is smooth, we have

for the term , if , then
It implies that with certain assumptions, when we transform one traditional DSGD with initialization and learning rate to MDSGD, it is equivalent to adding one small noise scaled by in each iteration. To the best of our knowledge, for most DSGD’s with convergence guarantee, the learning rate satisfies the condition . For example, is a constant and . When the noise is bounded and after being scaled by , it will not affect the convergence of . What’s more, when is small, will get close to which means converges at the same time. Now we can conclude that both and converge to the optimal solution.
In fact, in equation (8), we transform the update rule of to that of and get the convergence of benefitting from the update rule of . For general , we have the following theorem:
(Transformation) Let be a sequence generated by Algorithm 1 with learning rate . We set to be some linear combination of , and define , where is some function. Assume the following condition holds on:
(9)  
(10) 
where , , . Then we call (10) the transformation equation. If is smooth with bounded stochastic gradient, we have
where .
In Theorem 4, we only need to be the linear combination of so that it is easy to conduct such a . Although is an unbiased estimation of full gradient at , benefitting from (9) which implies and are close enough and is the same order of magnitude as the variance of , (10) can be seen as updating by DSGD with learning rate . Hence, (10) transform the update rule of to that of and we call it the transformation equation. It describes the relation between MDSGD and traditional DSGD. If
(11) 
then we can randomly choose from
with probability
, and get that .5 Convergence
In this section, we are going to prove the convergence of of MDSGD with momentum for both convex and nonconvex problems. For convenience, we denote and . Then according to (7), we have the update rule for :
(12) 
where .
According to Theorem 4, our main task is establishing the transformation equation.
Let . By setting
(13) 
where , we have
(14) 
Lemma 5 gives out the transformation equation of MDSGD with momentum. We can see that is a linear combination of and . Since and , it is easy to make . We should design carefully. We propose two strategies:

is a small constant, then , and ;

, then . and .
It is easy to verify that both of the two strategies satisfy (9) and the learning rate condition (11). Specifically, we have the following lemma: Assume has the bounded stochastic gradient, and . is defined in Lemma 5. Then we have
and if is defined as one of the above two strategies, we have
Then we get the following convergence rate of MDSGD:
(strong convex case) Let and is defined in Lemma 5. Assume is smooth, strong convex with bounded stochastic gradient, and . By setting , we have
where .
(convex case) Let and is defined in Lemma 5. Assume is convex with bounded stochastic gradient, and . By setting , we have
where . It implies the convergence rate.
(noncnovex case) Let and is defined in Lemma 5. Assume is smooth with bounded stochastic gradient and . By setting , we have
where . By taking , it is easy to get convergence rate.
If we design as stochastic batch gradient of , which means , it is a special case of MDSGD with momentum by setting . For the strong convex and smooth case, by setting , we get the convergence rate. For the general convex case, by setting , we get the convergence rate. For nonconvex and smooth case, by setting , we get the convergence rate. Please note that (3)(4) can also be transformed to the such a formulation that , where . So it has the same convergence rate.
6 MDSGD meets stagewise learning
In the previous analysis for convergence of MDSGD, we need to set the learning rate be a constant or . This is far from the practice. In fact, we never set a constant learning rate when training models. Usually, we decrease the learning rate after passing through the training data several times. For example, (He et al., 2016) decreases the learning rate by
at 32k and 48k iterations when training resnet on imagenet. What’s more, many models include nonsmooth operations, like ReLu, max pooling. This is also far from the convergence condition in previous theorems. Recently,
(Chen et al., 2019) propose the stagewise learning to bridge the gap between theory and practice. In this section, we use stagewise learning for MDSGD.For convenience, we denote Algorithm 1 with constant learning rate as . is the function to optimize, is initialization, is a constant learning rate, is the momentum scalar, is the iteration numbers. Then we have Let . We define to be the sequence produced by so that , and . Let be the sequence transformed by Lemma 5. Assume , is convex with bounded stochastic gradient, we have
where , . Here we set to be convex but not necessarily strong convex.
Lemma 6 implies that MDSGD with constant learning rate satisfies the condition of stagewise learning (Theorem 1 in (Chen et al., 2019)). And constant learning rate makes the transformation equation simple. Thus, we can use stagewise learning. We define
(15) 
where
(16) 
Let be the sequence produced by , which means . If is weak convex, then is strong convex when . Hence, we can apply Lemma 6 on . Specifically, we have the following result:
Assume is weak convex with bounded stochastic gradient and . By setting , , we have
where is defined as
, and .
7 Choice of
In the convergence theorems, we need to be bounded. Since to be bounded, we only need to be bounded. According to the update rule for :
Comments
There are no comments yet.