The recent success of deep learning approaches for domains like speech recognition (dnn-speech)
and computer vision(batchnorm) stems from many algorithmic improvements but also from the fact that the size of available training data has grown significantly over the years, together with the computing power, in terms of both CPUs and GPUs. While a single GPU often provides algorithmic simplicity and speed up to a given scale of data and model, there exist an operating point where a distributed implementation of training algorithms for deep architectures becomes necessary.
Currently, popular distributed training algorithms include mini-batch versions of stochastic gradient descent (SGD) and other stochastic optimization algorithms such as AdaGrad(duchi2011adaptive)
, RMSProp(tieleman2012lecture), and ADAM (kingma2014adam). Unfortunately, bulk-synchronous implementations of stochastic optimization are often slow in practice due to the need to wait for the slowest machine in each synchronous batch. To circumvent this problem, practitioners have resorted to asynchronous approaches which emphasize speed by using potentially stale information for computation. While asynchronous training have proven to be faster than their synchronous counterparts, they often result in convergence to poorer results.
In this paper111This is an extension of our ICLR 2016 workshop extended abstract (chen2016revisiting)., we revisit synchronous learning, and propose a method for mitigating stragglers in synchronous stochastic optimization. Specifically, we synchronously compute a mini-batch gradient with only a subset of worker machines, thus alleviating the straggler effect while avoiding any staleness in our gradients. The primary contributions of our paper are:
Illustration of how gradient staleness in asynchronous training negatively impacts test accuracy and is exacerbated by deep models.
Measurement of machine response times for synchronous stochastic optimization in a large deployment of 100 GPUs, showing how stragglers in the tail end affect convergence speed.
Proposal of synchronous stochastic optimization with backup workers to mitigate straggler effects without gradient staleness.
Establishing the need to measure both speed of convergence and test accuracy of optimum for empirical validation.
Empirical demonstration that our proposed synchronous training method outperforms asynchronous training by converging faster and to better test accuracies.
describes asynchronous stochastic optimization and presents experimental evidence of gradient staleness in deep neural network models. We present our approach in Section3, and exhibit straggler effects that motivate the approach. We then empirically evaluate our approach in Sections 4. Related work is discussed in Section 5, and we conclude in Section 6.
1.1 Preliminaries and Notation
Given a dataset , our goal is to learn the parameters
of a model with respect to an empirical loss function, defined as , where is the loss with respect to a datapoint and the model .
A first-order stochastic optimization algorithm achieves this by iteratively updating using a stochastic gradient computed at a randomly sampled , producing a sequence of models . Stochastic optimization algorithms differ in their update equations. For example, the update of SGD is , where is the learning rate or step size at iteration . A mini-batch version of the stochastic optimization algorithm computes the stochastic gradient over mini-batch of size instead of a single datapoint, i.e., , where ’s are randomly sampled from . We will often evaluate performance on an exponential moving average with decay rate .
Our interest is in distributed stochastic optimization using worker machines in charge of computing stochastic gradients that are sent to parameter servers. Each parameter server is responsible for storing a subset of the model, and performing updates on . In the synchronous setting, we will also introduce additional backup workers for straggler mitigation.
2 Asynchronous Stochastic Optimization
An approach for a distributed stochastic gradient descent algorithm was presented in dean:2012, consisting of two main ingredients. First, the parameters of the model are distributed on multiple servers, depending on the architecture. This set of servers are called the parameter servers. Second, there can be multiple workers processing data in parallel and communicating with the parameter servers. Each worker processes a mini-batch of data independently of the others, as follows:
The worker fetches from the parameter servers the most up-to-date parameters of the model needed to process the current mini-batch;
It then computes gradients of the loss with respect to these parameters;
Finally, these gradients are sent back to the parameter servers, which then updates the model accordingly.
Since each worker communicates with the parameter servers independently of the others, this is called Asynchronous Stochastic Gradient Descent (Async-SGD), or more generally, Asynchronous Stochastic Optimization (Async-Opt). A similar approach was later proposed by adam:2014. Async-Opt is presented in Algorithms 1 and 2.
In practice, the updates of Async-Opt are different than those of serially running the stochastic optimization algorithm for two reasons. Firstly, the read operation (Algo 1 Line 1) on a worker may be interleaved with updates by other workers to different parameter servers, so the resultant may not be consistent with any parameter incarnation . Secondly, model updates may have occurred while a worker is computing its stochastic gradient; hence, the resultant gradients are typically computed with respect to outdated parameters. We refer to these as stale gradients, and its staleness as the number of updates that have occurred between its corresponding read and update operations.
Understanding the theoretical impact of staleness is difficult work and the topic of many recent papers, e.g. recht2011hogwild; duchi2013estimation; leblond2016asaga; reddi2015variance; de2015taming; mania2015perturbed, most of which focus on individual algorithms, under strong assumptions that may not hold up in practice. This is further complicated by deep models with multiple layers, since the times at which model parameters are read and which gradients are computed and sent are dependent on the depth of the layers (Figure 2). To better understand this dependence in real models, we collected staleness statistics on a Async-Opt run with 40 workers on a 18-layer Inception model (inception_v3)
trained on the ImageNet Challenge dataset(imagenet), as shown in Table 1.
Despite the abovementioned problems, Async-Opt has been shown to be scale well up to a few dozens of workers for some models. However, at larger scales, increasing the number of machines (and thus staleness of gradients) can result in poorer trained models.
2.1 Impact of staleness on test accuracy
We explore how increased staleness contributes to training of poorer models. In order to mimic the setting on a smaller scale, we trained a state-of-the-art MNIST CNN model but simulated staleness by using old gradients for the parameter updates. Details of the model and training are provided in Appendix A.1.
The best final classification error on a test set was 0.36%, which increases to 0.47% with average gradient staleness of 20 steps, and up to 0.79% with 50 steps (see Figure 2).
Once the average simulated staleness was chosen to be more than 15 steps, the results started to significantly deteriorate and the training itself became much less stable. We had to employ following tricks to prevent the results from blowing up:
Slowly increase the staleness over the first 3 epochs of training. This mimics increasing the number of asynchronous workers and is also very important in practice for some of the models we experimented with (e.g. large word-level language models). The trick was not relevant with a simulated staleness less than 15 but became crucial for larger values.
Use lower initial learning rates when staleness is at least 20, which reduces a frequency of explosions (train error goes to 90%). This observation is similar to what we found in other experiments - we were able to use much larger learning rates with synchronous training and the results were also more stable.
Even with above tricks the divergence occurs occasionally and we found that restarting training from random weights can lead to more successful runs. The best results were then chosen based on validation set performance.
3 Revisting Synchronous Stochastic Optimization
Both dean:2012 and adam:2014 use versions of Async-SGD where the main potential problem is that each worker computes gradients over a potentially old version of the model. In order to remove this discrepancy, we propose here to reconsider a synchronous version of distributed stochastic gradient descent (Sync-SGD), or more generally, Synchronous Stochastic Optimization (Sync-Opt), where the parameter servers wait for all workers to send their gradients, aggregate them, and send the updated parameters to all workers afterward. This ensures that the actual algorithm is a true mini-batch stochastic gradient descent, with an effective batch size equal to the sum of all the mini-batch sizes of the workers.
While this approach solves the staleness problem, it also introduces the potential problem that the actual update time now depends on the slowest worker. Although workers have equivalent computation and network communication workload, slow stragglers may result from failing hardware, or contention on shared underlying hardware resources in data centers, or even due to preemption by other jobs.
To alleviate the straggler problem, we introduce backup workers (tail-at-scale) as follows: instead of having only workers, we add extra workers, but as soon as the parameter servers receive gradients from any workers, they stop waiting and update their parameters using the gradients. The slowest workers’ gradients will be dropped when they arrive. Our method is presented in Algorithms 3, 4.
3.1 Straggler effects
The use of backup workers is motivated by the need to mitigate slow stragglers while maximizing computation. We investigate the effect of stragglers on Sync-Opt model training here.
We ran Sync-Opt with workers, backups, and 19 parameter servers on the Inception model. Using one variable as a proxy, we collected for each iteration both the start time of the iteration and the time when the th gradient of that variable arrived at the parameter server. These times are presented in Figure 4 for . Note that 80% of the 98th gradient arrives in under 2s, whereas only 30% of the final gradient do. Furthermore, the time to collect the final few gradients grows exponentially, resulting in wasted idle resources and time expended to wait for the slowest gradients. This exponential increase is also seen in Figure 4.
Thus, one might choose to drop slow stragglers to decrease the iteration time. However, using fewer machines implies a smaller effective mini-batch size and thus greater gradient variance, which in turn could require more iterations for convergence. We examine this relationship by running Sync-Opt222 Since we are interested in the gradient quality and convergence behavior but not running time in this experiment, the backups serve only to reduce our data collection time but do not affect our analysis. with and , and note the number of iterations required for convergence in Figure 6. Additional details of this training are provided in Appendix A.2. As is doubled from 50 to 100, the number of iterations to converge nearly halves from to .
Hence, there is a trade-off between dropping more stragglers to reduce iteration time, and waiting for more gradients to improve the gradient quality. Consider a hypothetical setting where we have machines, and we wish to choose the best configuration of and to minimize running time to convergence. For each configuration, we can estimate the iterations required from Figure 6
(linearly interpolating for values offor which we did not collect data). We can multiply this with the mean iteration times (Figure 4) to obtain the running time required to converge for each setting of and . These results are shown in Figure 6, indicating that , converges fastest. Therefore, this motivates our choice to use a few backup workers for mitigating stragglers.
In this section, we present our empirical comparisons of synchronous and asynchronous distributed stochastic optimization algorithms as applied to models such as Inception and PixelCNN. All experiments in this paper are using the TensorFlow system(tensorflow2015-whitepaper).
4.1 Metrics of comparison: Faster convergence, Better optimum
We are interested in two metrics of comparison for our empirical validation: (1) test error or accuracy, and (2) speed of convergence. We point out that for non-convex deep learning models, it is possible to converge faster to a poorer local optimum. Here we show a simple example with Inception using different learning rates.
We ran Sync-Opt on Inception with and , but varied the initial learning rate between 1.125 and 9.0. (Learning rates are exponentially decreased with iterations.) Table 2 shows that smaller converge faster, but to poorer test precisions. Focusing on speed on an early phase of training could lead to misleading conclusions if we fail to account for eventual convergence. For example, Figure 6(b) shows that using reaches precision faster than , but is slower for , and fails to reach higher precisions.
We conducted experiments on the Inception model (inception_v3) trained on ImageNet Challenge dataset (imagenet)
, where the task is to classify images out of 1000 categories. We used several configurations, varyingfrom 53 to 212 workers. Additional details of the training are provided in Appendix A.3. An epoch is a synchronous iteration for Sync-Opt, or a full pass of updates for Async-Opt, which represent similar amounts of computation. Results of this experiment are presented in Figure 8.
Figure 7(b) shows that Sync-Opt outperforms Async-Opt in test precision: Sync-Opt attains 0.5% better test precision than Async-Opt for comparable workers. Furthermore, Sync-Opt converges 6h and 18h faster than Async-Opt for 106 and 212 workers respectively, and is 3h slower when 53 workers are used, as seen in Figure 7(d). This difference in speed is largely due to the fewer epochs (Figure 7(c)) needed by Sync-Opt, but comparable or better epoch time (Figure 7(e)).
4.3 PixelCNN Experiments
The second model we experimented on is PixelCNN (oord2016conditional), a conditional image generation deep neural network, which we train on the CIFAR-10 (krizhevsky2009learning) dataset. Configurations of workers were used; for Sync-Opt, we always used backup worker. Additional details are provided in Appendix A.4.
Convergence of the test negative log likelihood (NLL) on PixelCNN is shown in Figure 8(a), where lower is better. Observe that Sync-Opt obtains lower NLL than Async-Opt; in fact, Async-Opt is even outperformed by serial RMSProp with worker, with degrading performance as increases from 8 to 16. Figure 8(b) further shows the time taken to reach test NLL. Sync-Opt reduces the time to reach from 40h to h; this NLL is not even achieved by Async-Opt.
5 Related Work
Multicore and distributed optimization algorithms have received much attention in recent years. Asynchronous algorithms include recht2011hogwild; duchi2013estimation; zhang2015deep; reddi2015variance; leblond2016asaga. Implementations of asynchronous optimization include xing2015petuum; li2014scaling; adam:2014. Attempts have also been made in zinkevich2010parallelized and zhang2015splash to algorithmically improve synchronous SGD.
An alternative solution, “softsync”, was presented in zhang2015staleness, which proposed batching gradients from multiple machines before performing an asynchronous SGD update, thereby reducing the effective staleness of gradients. Similar to our proposal, softsync avoids stragglers by not forcing updates to wait for the slowest worker. However, softsync allows the use of stale gradients but we do not. The two solutions provide different explorations of the trade-off between high accuracy (by minimizing staleness) and fast throughput (by avoiding stragglers).
watcharapichat2016ako introduces a distributed deep learning system without parameter servers, by having workers interleave gradient computation and communication in a round-robin pattern. Like Async-Opt, this approach suffers from staleness. We also note that in principle, workers in Sync-Opt can double as parameter servers and execute the update operations and avoid the need to partition hardware resources between workers and servers.
das2016distributed analyzes distributed stochastic optimization and optimizes the system by solving detailed system balance equations. We believe this approach is complimentary to our work, and could potentially be applied to guide the choice of systems configurations for Sync-Opt.
keskar2016large suggests that large batch sizes for synchronous stochastic optimization leads to poorer generalization. Our effective batch size increases linearly with the number of workers . However, we did not observe this effect in our experiments; we believe we are not yet in the large batch size regime examined by keskar2016large.
6 Conclusion and Future Work
Distributed training strategies for deep learning architectures will become ever more important as the size of datasets increases. In this work, we have shown how both synchronous and asynchronous distributed stochastic optimization suffer from their respective weaknesses of stragglers and staleness. This has motivated our development of synchronous stochastic optimization with backup workers, which we show to be a viable and scalable strategy.
We are currently experimenting with different kinds of datasets, including word-level language models where parts of the model (the embedding layers) are often very sparse, which involves very different communication constraints. We are also working on further improving the performance of synchronous training like combining gradients from multiple workers sharing the same machine before sending them to the parameter servers to reduce the communication overhead. An alternative of using time-outs instead of backup workers is also being explored.
Appendix A Details of Models and Training
a.1 MNIST CNN, Section 2.1
The model used in our experiments is a 4-layer CNN that have 3x3 filters with max-pooling and weight normalization in every layer. We trained the model with SGD for 25 epochs and evaluated performance on the exponential moving averageusing a decay rate of . Initial learning rate was set to be 0.1 and linearly annealed to 0 in the last 10 epochs. We also used small image rotations and zooms as a data augmentation scheme.
a.2 Inception, Section 3.1
For our straggler experiments, we trained the Inception (inception_v3) model on the ImageNet Challenge dataset (imagenet). 10 parameter servers were used, and each worker was equipped with a k40 GPU.
The underlying optimizer was RMSProp with momentum, with decay of 0.9 and momentum of 0.9. Mini-batch size was used. Initial learning rates were set at , which we found to provide good test precisions for Inception. Learning rates were also exponentially decreased with decay rate as , where is the number of mini-batches in the dataset.
Test precisions were evaluated on the exponential moving average using .
a.3 Inception, Section 4.2
For experiments comparing Async-Opt and Sync-Opt on the Inception model in Section 4.2, each worker is equipped with a k40 GPU. For workers, 17 parameter servers were used; for workers, we used 27 parameter servers; and 37 parameter servers were used for .
In the asynchronous training mode, gradient clipping is also needed for stabilization, which requires each worker to collect the gradient across all layers of the deep model, compute the global normand then clip all gradient accordingly. However, synchronization turns out to be very stable so gradient clipping is no longer needed, which means that we can pipeline the update of parameters in different layers: the gradient of top layers’ parameters can be sent to parameter servers while concurrently computing gradients for the lower layers.
The underlying optimizer is RMSProp with momentum, with decay of 0.9 and momentum of 0.9. Mini-batch size was used. Initial learning rates for Async-Opt were set to 0.045; for Sync-Opt, we found as a rule-of-thumb that a learning rate of worked well for this model. Learning rates were then exponentially decayed with decay rate as for Async-Opt, where is the number of mini-batches in the dataset. For Sync-Opt, we learning rates were also exponentially decreased at rate of , so that the learning rates after computing the same number of datapoints are comparable for Async-Opt and Sync-Opt.
Test precisions were evaluated on the exponential moving average using .
a.4 PixelCNN, Section 4.3
The PixelCNN (oord2016conditional) model was trained on the CIFAR-10 (krizhevsky2009learning) dataset. Configurations of workers each with a k80 GPU, and 10 parameter servers were used. For Sync-Opt, we always used backup worker. The underlying optimizer is RMSProp with momentum, using decay of 0.95 and momentum of 0.9. Initial learning rates were set to and slowly decreased to after 200,000 iterations. Mini-batch size was used.