Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

11/20/2019
by   Cong Xie, et al.
0

Recent years have witnessed the growth of large-scale distributed machine learning algorithms – specifically designed to accelerate model training by distributing computation across multiple machines. When scaling distributed training in this way, the communication overhead is often the bottleneck. In this paper, we study the local distributed Stochastic Gradient Descent (SGD) algorithm, which reduces the communication overhead by decreasing the frequency of synchronization. While SGD with adaptive learning rates is a widely adopted strategy for training neural networks, it remains unknown how to implement adaptive learning rates in local SGD. To this end, we propose a novel SGD variant with reduced communication and adaptive learning rates, with provable convergence. Empirical results show that the proposed algorithm has fast convergence and efficiently reduces the communication overhead.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

12/31/2020

CADA: Communication-Adaptive Distributed Adam

Stochastic gradient descent (SGD) has taken the stage as the primary wor...
02/25/2021

Local Stochastic Gradient Descent Ascent: Convergence Analysis and Communication Efficiency

Local SGD is a promising approach to overcome the communication overhead...
07/26/2020

CSER: Communication-efficient SGD with Error Reset

The scalability of Distributed Stochastic Gradient Descent (SGD) is toda...
03/12/2021

EventGraD: Event-Triggered Communication in Parallel Machine Learning

Communication in parallel systems imposes significant overhead which oft...
05/03/2019

Performance Optimization on Model Synchronization in Parallel Stochastic Gradient Descent Based SVM

Understanding the bottlenecks in implementing stochastic gradient descen...
06/11/2019

Optimizing Pipelined Computation and Communication for Latency-Constrained Edge Learning

Consider a device that is connected to an edge processor via a communica...
10/19/2018

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Large-scale machine learning training, in particular distributed stochas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic Gradient Descent (SGD) and its variants are commonly used for training deep neural networks. To accelerate the training, it is common to distribute the computation to multiple GPUs/machines, which results in parallel versions of SGD. There are various ways to parallelize SGD in a distributed manner. A typical solution is to synchronously compute the gradients on multiple worker nodes, and take the average on the server node. Such a solution is equivalent to single-threaded SGD with large mini-batch sizes (Goyal et al., 2017; You et al., 2017a, b, 2019). By increasing the number of workers, the overall time consumed by training will be reduced.

However, in practice, it is difficult to achieve the ideal scalability of distributed SGD due to the communication overhead, which increases with the number of workers. When the number of workers is large enough, the communication overhead becomes the bottleneck of the distributed learning system. Thus, to achieve better scalability, it is necessary to reduce the communication overhead.

Various approaches have been proposed to reduce the communication overhead of distributed SGD, such as quantization (Seide et al., 2014; Strom, 2015; Wen et al., 2017; Alistarh et al., 2016; Bernstein et al., 2018; Karimireddy et al., 2019; Zheng et al., 2019) and sparsification (Aji and Heafield, 2017; Stich et al., 2018; Jiang and Agrawal, 2018). In this paper, we focus on local SGD, which reduces the communication overhead by skipping communication rounds, i.e., less frequent synchronization, and periodically averaging the models across the workers (Stich, 2018; Lin et al., 2018; Yu et al., 2018; Wang and Joshi, 2018; Yu et al., 2019).

Adaptive learning rate methods adapt coordinate-wise dynamic learning rates by accumulating the historical gradients. Examples include AdaGrad (McMahan and Streeter, 2010; Duchi et al., 2011)

, RMSProp 

(Tieleman and Hinton, 2012), AdaDelta (Zeiler, 2012), and Adam (Kingma and Ba, 2014). Along similar lines, recent research has shown that AdaGrad can converge without explicitly decreasing the learning rate (Ward et al., 2019; Zou et al., 2019) .

Nevertheless, it remains unclear how to modify adaptive learning rates in distributed SGD with infrequent synchronization. In this paper, we answer this question by combining local SGD and adaptive learning rates. We propose a novel variant for AdaGrad, and combine it with the concept of local SGD. To the best of our knowledge, this paper is the first to theoretically and empirically study local SGD with adaptive learning rates.

The main contributions of our paper are as follows:

  • We propose a new SGD algorithm with adaptive learning rate: AdaAlter, with provable convergence.

  • We propose a variant of AdaAlter: local AdaAlter, which reduces the communication overhead via infrequent synchronization.

  • We prove the convergence of the proposed algorithms for non-convex problems and non-IID workers.

  • We show empirically that the proposed algorithms converge quickly and scale well in practical applications.

2 Related Work

In this paper, we consider a centralized server-worker architecture, also known as the Parameter Server (PS) architecture (Li et al., 2014a, b; Ho et al., 2013). In general, PS is a distributed key-value store, which can be used for exchanging blocks of model parameters between the workers and the servers (Peng et al., 2019). A common alternative approach of PS is the AllReduce algorithm, which is typically implemented by MPI (Sergeev and Balso, 2018; Walker and Dongarra, 1996)

. Most of the existing large-scale distributed deep-learning frameworks, such as Tensorflow 

(Abadi et al., 2016)

, PyTorch 

(Steiner et al., 2019), and MXNet Chen et al. (2015) support either PS or AllReduce.

Similar to local SGD, there are other SGD variants that also reduce the communication overhead by skipping synchronization rounds. For example, federated learning (Konevcnỳ et al., 2016; McMahan et al., 2016) adopts local SGD with heterogeneous numbers of local steps and subsampling workers to train models on edge devices. EASGD (Zhang et al., 2014) periodically synchronizes the models on the workers and the servers with moving average.

Additional to communication compression, there are other approaches to improve scalability and accelerate training. For example, decentralized SGD (Shi et al., 2014; Yuan et al., 2013; Lian et al., 2017) avoids congesting the central server node and improves the scalability by removing the server, and letting the workers communicate with their neighbours only. Another technique is pipelining (Li et al., 2018), which overlaps the computation and the communication to hide the communication overhead.

In this paper, we focus on synchronous training, which blocks the global update until all the workers respond. In contrast, asynchronous training (Zinkevich et al., 2009; Niu et al., 2011; Zhao and Li, 2016) updates the global model immediately after any worker responds. Theoretical and empirical analysis (Dutta et al., 2018) suggests that synchronous training is more stable with less noise, but can also be slowed down by the global barrier across all the workers. Asynchronous training is generally faster, but needs to address instability and noisiness due to staleness.

3 Problem Formulation

We consider the following optimization problem:

where , for , is sampled from the local dataset on the th worker.

We solve this problem in a distributed manner with workers. Each worker trains the model on its local dataset. In each iteration, the th worker will sample a mini-batch of independent data points from the dataset , and compute the stochastic gradient , where .

Note that different devices have different local datasets, i.e., . Thus, samples drawn from different workers may have different expectations i.e. in general, .

Notation Description
Model parameter

Overall loss function in expectation

Total number and index of iterations
Stochastic gradient
The th coordinate of ,
The th coordinate of ,
on the th worker, ,
The th coordinate of ,
Hadamard (coordinate-wise) product

(all vectors are column vectors)

Table 1: Notations

4 Methodology

Before we formally introduce the proposed algorithms, we introduce two SGD variants that are highly related to our work: AdaGrad and local SGD. Then, we will first propose a new variant of SGD with adaptive learning rates: AdaAlter, and combine it with the concept of local SGD, which results in another new variant of SGD: local AdaAlter.

4.1 Preliminary

To help understand our proposed algorithm, we first introduce the classic SGD variant with adaptive learning rate: AdaGrad. The detailed algorithm is shown in Algorithm 1. The general idea is to accumulate the gradients in a coordinate-wise manner, and use such accumulation as the denominator to normalize the gradients. Such accumulation grows when the number of iterations grows, so that we do not need to explicitly decrease the learning rate .

1:  Initialize , ,
2:  for all iteration  do
3:     for all workers in parallel do
4:        ,
5:        
6:        
7:        
8:     end for
9:  end for
Algorithm 1 Distributed AdaGrad

In addition to AdaGrad, we also adopt the concept of local SGD to reduce the communication overhead. The vanilla local SGD algorithm is shown in Algorithm 2. Local SGD skips the communication rounds, and synchronizes/averages the model parameters for every iterations. Thus, on average, the communication overhead is reduced to , compared to fully synchronous SGD.

1:  Initialize
2:  for all iteration  do
3:     for all workers in parallel do
4:        ,
5:        
6:        if  then
7:           
8:        else
9:           Synchronize:
10:        end if
11:     end for
12:  end for
Algorithm 2 Local SGD

4.2 AdaAlter

In this section, we formally introduce AdaAlter, which is an alternative of AdaGrad. AdaAlter accumulates the denominators similar to AdaGrad. The major difference is that AdaAlter updates the model parameter before accumulating the denominator. The detailed algorithm is shown in Algorithm 3. Note that AdaGrad updates the denominator first, and then updates the model parameters, while AdaAlter updates the model parameters first, and then updates the denominator. This simple modification ensures that AdaAlter behaves similar to AdaGrad, yet makes it easier to combine with local SGD. The practical importance of switching the ordering of operation will be discussed in detail after we introduce the local AdaAlter algorithm in the next section.

1:  Initialize , ,
2:  for all iteration  do
3:     for all workers in parallel do
4:        ,
5:        
6:        
7:        
8:     end for
9:  end for
Algorithm 3 Distributed AdaAlter

4.3 Local AdaAlter

We propose a variant of AdaAlter, namely, local AdaAlter, which skips synchronization rounds and averages the model parameter and the accumulated denominator after every iterations. The detailed algorithm is shown in Algorithm 4. Note that in the communication rounds, AdaAlter has to synchronize not only the model parameters, but also the accumulated denominators across the workers. Thus, compared to the distributed AdaGrad (Algorithm 1), local AdaAlter (Algorithm 4) reduces the communication overhead to on average.

In AdaGrad, a small positive constant is added for the numerical stability, in case that the denominator is too small. However, in AdaAlter, acts as a placeholder for the yet-to-be-added . Thus, after local steps without synchronization, such placeholder becomes . The denominators are updated in the synchronization rounds only, which guarantees that the denominators are the same on different workers in the local iterations.

1:  Initialize , ,
2:  for all iteration  do
3:     for all workers in parallel do
4:        
5:        ,
6:        
7:        
8:        if  then
9:           ,
10:        else
11:           Synchronize:
12:           Synchronize:
13:        end if
14:     end for
15:  end for
Algorithm 4 Local AdaAlter

Similar to the fully synchronous AdaAlter, local AdaAlter updates the denominator after updating the model parameters. Switching the order of updates is essential for local AdaAlter, in order to enable lazy updates of the denominator, while keeping the denominator synchronized on different workers. The key idea is to use to substitute the actual accumulated denominator before synchronization. This is also the key for the convergence proof of local AdaAlter.

5 Theoretical Analysis

In this section, we prove the convergence of Algorithm 3 and Algorithm 4 for smooth but non-convex problems, with constant learning rate .

5.1 Assumptions

First, we introduce some assumptions, and a useful lemma for our convergence analysis.

Assumption 1.

(Smoothness) We assume that and are -smooth:

Assumption 2.

(Bounded gradients) For any stochastic gradient , we assume bounded coordinates , or simply .

Lemma 1.

(Zou et al. (2019), Lemma 15) For any non-negative sequence , we have

5.2 Main results

Based on the assumptions and lemma above, we have the following convergence guarantees. Detailed proofs can be found in the appendix.

We first prove the convergence of fully synchronous AdaAlter for smooth but non-convex problems.

Theorem 1.

(Convergence of AdaAlter (Algorithm 3)) Taking arbitrary , in Algorithm 3, and . Under Assumption 1 and 2, Algorithm 3 converges to a critical point:

where .

When , . When , . Thus, AdaAlter converges to a critical point when . Increasing the number of workers

reduces the variance.

Now, we prove the convergence of local AdaAlter for smooth but non-convex problems. To analyze Algorithm 4, we introduce the following auxiliary variable:

We show that the sequence converges to a critical point.

Theorem 2.

(Convergence of local AdaAlter (Algorithm 4)) Taking arbitrary , in Algorithm 4, and . Under Assumption 1 and 2, Algorithm 4 converges to a critical point:

When , . When , . Thus, local AdaAlter converges to a critical point when . Increasing the number of workers reduces the variance. Compared to the fully synchronous AdaAlter, local AdaAlter has the extra noise proportional to , which means that less synchronization results in larger noise. Thus, there is an inevitable trade-off between the reduction of the noise, and the reduction of the communication overhead.

6 Experiments

In this section, we empirically evaluate the proposed algorithms.

6.1 Datasets and Model Architecture

We conduct experiments on the 1B Word Benchmark dataset (Chelba et al., 2013), which is a publicly available benchmark for language models. The dataset is composed of about 0.8B words with a vocabulary of 793471 words, including sentence boundary markers. As a standard pre-processing procedure, all the sentences are shuffled and the duplicates are removed. We train the so-called Big LSTM model with 10% dropout (LSTM-2048-512 in Józefowicz et al. (2016)).

6.2 Evaluation Setup

Our experiments are conducted on a cluster of machines where each machine has 8 NVIDIA V100 GPUs (with 16GB memory). In the default setting, the model is trained on a single machine with 8 GPU workers, where the local batch size at each GPU is 256. We tune the learning rates in the range of

, and report the best results on the test dataset. Each experiment is composed of 50 epochs. In each epoch, the algorithm processes

data samples. We repeat each experiment 5 times, and take the average.

The typical measure used for language models is perplexity (PPL), which is the average per-word log-probability on the test dataset:

where is the predicted probability of word in the language model. We follow the standard procedure and sum over all the words.

6.2.1 Practical Remarks for AdaAlter

There are some additional remarks for using (local) AdaAlter in practice.

Warm-up Learning Rates: When using AdaAlter, we observe that its behavior is almost the same as AdaGrad. The only exception is that, at the very beginning, the denominator is too small for AdaAlter. Thus, we add a warm-up mechanism for AdaAlter:

where

is a hyperparameter. In the first

iterations, the learning rate will gradually increase from to . In our default setting where we use 8 GPU workers with batch size , we take and .

Scaling Learning Rates: The original baseline is conducted on 4 GPU workers with batch size and learning rate , where the actual overall batch size is . When the batch size increases by , it is a common strategy to re-scale the learning rate by or  (Goyal et al., 2017; You et al., 2017a, b, 2019). In our experiments, we conduct the experiments on 8 GPU workers with batch size , where the actual overall batch size is . Thus, it is reasonable to re-scale the learning rate in the range of . When tuning the learning rates, we found that taking results in the best performance.

Figure 1: Time of an epoch versus different number of workers. For all experiments, we take the batch size on each GPU worker. Each epoch processes data samples.
Figure 2: Throughput versus different number of workers. For all experiments, we take the batch size on each GPU worker.
(a) Test perplexity versus training time
(b) Test perplexity versus epochs
Figure 3: Perplexity on the test dataset with different algorithms. We conduct the experiments on 8 GPU workers with batch size on each GPU. For all experiments, we take the learning rate . For both distributed AdaAlter and local AdaAlter, we take warm-up steps.

6.3 Empirical Results

We evaluate the following performance metrics to test the reduction of communication overhead and the convergence of the proposed algorithms:

  • The time consumed by one epoch versus different number of GPU workers.

  • The throughput (the overall number of samples processed per second) versus different number of GPU workers.

  • Perplexity on the test dataset versus time.

  • Perplexity on the test dataset versus the number of epochs.

Note that in all the experiments, we take , .

6.3.1 Reduction of Communication

We first evaluate the reduction of the communication overhead of the proposed algorithms. In Figure 1 and 2, we illustrate the time consumed by one epoch and the throughput with different numbers of workers and different algorithms. We test local AdaAlter with different synchronization periods . It is shown that local AdaAlter efficiently reduces the communication overhead.

The baseline “Local AdaAlter, ” is evaluated by manually removing the communication, i.e., the synchronization never happens. The baseline “Ideal computation-only overhead” is evaluated by manually removing not only the communication, but also the data-loading, which uses a batch of dummy data to avoid the overhead of loading real data samples. These two baselines illustrate the ideal lower bounds of the training time, by removing the overheads other than the computation.

6.3.2 Convergence

We test the convergence of the proposed algorithms. In Figure 3, we illustrate the perplexity on the test dataset with different algorithms. Compared to vanilla distributed AdaGrad, local AdaAlter converges almost the same with the same number of epochs, but takes much less time. To reach the same perplexity, local AdaAlter can reduce almost 30% of the training time.

In Table 2

, we report the perplexity and consumed time at the end of training for different algorithms. We can see that local AdaAlter produces comparable performance to the fully synchronous AdaGrad and AdaAlter, on the test dataset, with much less training time, and acceptable variance. Note that we do not illustrate the standard deviation in Figure 

3, since it is too small to recognize.

Method Test PPL Time (hours)
AdaGrad 98.05
AdaAlter 98.47
Local AdaAlter
69.17
67.41
65.49
64.22
Table 2: Test PPL and time at the end of training

6.4 Discussion

We can see that the fully synchronous AdaGrad or AdaAlter are very slow. Local AdaAlter reduces almost 30% of the training time compared to the fully synchronous AdaGrad or AdaAlter.

As we expected, Figure 3 and Table 2 show that larger reduces more communication overhead, but also results in worse perplexity, which validates our theoretical analysis in Theorem 2: when increases, the noise in the convergence also increases. Taking gives the best trade-off between the communication overhead and the variance.

Interestingly, as shown in Figure 3(b), local AdaAlter has slightly better perplexity on the test dataset, compared to the fully synchronous AdaGrad and AdaAlter. Although, our theorems indicate that local AdaAlter has larger variance compared to the fully synchronous version, such conclusion only applies to the training loss. In fact, there is previous work (Lin et al., 2018) showing that local SGD potentially generalizes better than the fully synchronous SGD, which makes our results not surprising. We also notice that when is too large, such benefit will be overwhelmed by the large noise.

We also observe that almost all the algorithms do not scale well when changing the number of workers from 4 to 8. The major reason is that all the workers are placed in the same machine, which has limited CPU resources. When there are too many workers, the data-loading also becomes a bottleneck, which is shown in the gap between “Local AdaAlter, ” and “Ideal computation-only overhead” in Figure 1. That is also the reason why different does not show much difference when using 8 GPU workers.

7 Conclusion

We propose a novel SGD algorithm: AdaAlter, and its variant with reduced communication, namely, local AdaAlter. We show that the algorithm provably converges. Our empirical results also show that the proposed algorithm can accelerate training. In future work, we will apply our algorithms to other datasets and applications, and optimize the performance systemically.

References

Appendix

References

Appendix A Proofs

Theorem 1.

Taking arbitrary in Algorithm 3, and . Under Assumption 1 and 2, Algorithm 3 converges to a critical point:

where .

Proof.

For convenience, we denote as the th coordinate of the gradient . Using -smoothness, we have

Note that , where .

Conditional on and , taking expectation on both sides, using , we have

Note that , thus, we have . Then, we have

If , then we have

Otherwise, denoting , we have

Thus, denoting , we have

By re-arranging the terms, we have