The parameter server framework [dean2012large][ho2013more]
has been widely adopted to distributing the training of large deep neural network (DNN) models[chen2015mxnet][zhang2017poseidon]. The framework consists of multiple workers and a logical server
that maintains globally shared parameters, typically represented as dense or sparse vectors and matrices[li2014scaling], and it supports two approaches: model parallelism and data parallelism [chen2014big]. In this paper we focus on data parallelism. Data parallelism refers to partitioning (sharding) of large training data into smaller equal size shards and assigning them to workers. Then, the entire DNN model is replicated to each worker. During the training, each worker trains the replica model using its assigned data shard, sends the locally computed gradients (via push operation) to the server that maintains globally shared parameters (weights) and receives back updated global weights from the server (via pull operation). That weight synchronization step is critical as it provides to the server a means of controlling the iteration throughput (to boost the convergence speed in wall-clock time) and the quality of convergence (i.e., the accuracy).
Due to its importance a number of synchronization models have been proposed, the most important of which are the asynchronous parallel (ASP), the bulk synchronous parallel (BSP), and the stale synchronous parallel (SSP). ASP [dean2012large] is the simplest model as it assumes no weight synchronization — workers always receive different versions of weights from the server at every iteration. BSP [gerbessiotis1994direct] is the most celebrated synchronization model. A critical component of it is the barrier synchronization, where workers reaching a barrier have to wait until all other workers have reached it, as well (see Figure 1). During the training phase of a DNN model, each worker, at each iteration, computes the model gradients based on the local data shard and the local weights (originally from the server) and sends the gradients to the server. The server aggregates the gradients of all workers, performs weight update (as one synchronization) and signals the workers to retrieve the latest weights for the next iteration. The workers replace their local weights with the latest weights from the server and start a new iteration. SSP [ho2013more] provides an intermediate approach to the two extremes achieved by the ASP and the BSP models. It performs synchronization, but mitigates the strict synchronization requirement of BSP. In principle, it monitors the iteration difference between the fastest and the slowest workers and restricts it to be within a threshold via enforcing synchronization on both workers upon the excess of the threshold.
The aforementioned models exhibit certain limitations. In ASP there is no need for synchronization, so the waiting time overhead of the workers is eliminated. However, the convergence in the training might be dramatically affected due to inconsistent weight updates. On the other hand, a prevalent shortcoming of the BSP is the strict synchronization requirement it imposes. As shown in Figure 1, all workers are waiting for each other by a synchronization barrier. Each barrier represents the time of the weight synchronization among workers and a superstep represents the time between subsequent barriers. In BSP-like models the superstep is fixed to a number of iterations and all workers have to wait for the straggler at the end of their iterations ( is typical), such as in [wangadaptive]. In SSP, while the strict synchronization requirement of BSP is removed, there is still a requirement to manually set the threshold that controls the iteration difference among workers, which remains fixed throughout the training period. Further, SSP does not consider the computational capacity of each worker but merely count on the number of iterations of each worker.
To ameliorate the shortcoming of current synchronization models, we propose ElasticBSP, a model that aims to relax the strict synchronization requirement of the classic BSP for better convergence. Contrary to SSP, the proposed model considers the computational capacity of workers, accordingly, offers more flexibility and adaptability during the training phase, without sacrificing on the accuracy of the trained model. The key idea of ElasticBSP is that the time the barrier is imposed varies and each superstep can permit a different number of iterations per worker, offering elasticity (see Figure 1). We also propose an efficient method that materializes the model, named ZipLine. ZipLine consists of two phases. First, future iteration intervals (timestamps) of each worker are predicted at run time based on their most recent intervals, assuming a stable environment. Then, a one-pass algorithm operates over the predicted intervals of all workers and performs a lookahead greedy algorithm to determine the next synchronization time (i.e., a time that the overall workers’ waiting time overhead is minimized). The algorithm can effectively balance the trade-off between accuracy and convergence speed, in order to accommodate different environments or applications. The major contributions of this work are as follows:
we propose ElasticBSP, a novel synchronization model for scaling the training of distributed deep learning models. ElasticBSP replaces the strict synchronization requirement of other BSP-like models with an online decision making about the best time to impose the next synchronization barrier. The model guarantees the convergence for a large number of iterations.
we design and develop ZipLine, a one-pass algorithm that can efficiently materialize the ElasticBSP model. ZipLine performs online optimization with lookahead to predict the next best synchronization time. It also outperforms sensible baselines.
we present a thorough experimental evaluation of our ElasticBSP model materialized by the ZipLine on two deep learning models on two popular image classification datasets. The results demonstrate that ElasticBSP converges much faster than BSP and to a higher accuracy than BSP and other state-of-the-art alternatives.
The remainder of the paper is organized as follows. Section II introduces our proposed ElasticBSP model and its properties. Section III formally defines the problem of interest. In Section IV, we present algorithmic details of sensible baselines and our proposed method ZipLine to materialize ElasticBSP. Section V presents an experimental evaluation of the methods. We review the related work in Section VI and conclude in Section VII.
Ii Elastic Bulk Synchronous Parallel Model
In this section, we propose a novel synchronization model that has the premise to ameliorate drawbacks of current models, without sacrificing their benefits.
The BSP model guarantees the convergence on training the DNN models since it is logically functioning as a single server. However, it introduces a large waiting time overhead due to having to wait for the slowest worker in every single iteration (a mini-batch). On the other hand, the ASP model does not perform any synchronization, so waiting time for synchronization is minimal, however, it is risky to be used due to its asynchronous scheme that renders the convergence uncertain [zhou2018distributed]. The SSP model offers an intermediate solution to the above two extremes. It guarantees the convergence [ho2013more] when the number of iterations is large and the user specified threshold is small. However, it depends on manually fine-tuning the hyper-parameter which is non-trivial.
Motivated by the limitations of the current state-of-the-art synchronization models, we propose ElasticBSP. ElasticBSP aims to relax the strict synchronization requirement of BSP. The key properties of ElasticBSP are the following:
The server deals with sequential decision making regarding the best time that the next synchronization barrier should be imposed (a time when the minimum waiting time for the entire system is achieved). The decision is based on a prediction model that utilizes information about the most recent time interval of each worker available to the server to predict its future intervals. The prediction is based on an online optimization with lookahead and assumes a specific limit on how many future intervals for each worker should be considered. The need for a specific limit comes from the need to control the algorithm’s run time, since that can increase exponentially as the lookahead limit increases.
The convergence guarantee of the model follows the theoretical analysis of SSP [ho2013more], where a small iteration difference exists in some period (a superstep). In the case of ElasticBSP, the iteration difference is bounded by the lookahead limit in some period that is defined by the next best synchronization time. By the end of the period , the synchronization barrier is posed to all the workers where gradients aggregation is carried out on the server, similarly to BSP, the weights are synchronized.
ElasticBSP offers elasticity in the sense that the distance between two consequent synchronization barriers is not fixed, but it is determined online. In addition, the waiting time is not determined by a fixed iteration difference between the fastest and the slowest workers (as in SSP), but based on the optimal time to synchronize in order to minimize the waiting time. Moreover, the synchronization time is always bounded within the lookahead limit , so it will not simulate the ASP model.
Iii Problem Framework
Most data centers follow the high availability criteria practise [benz2013dependability]
, it is realistic to assume that the cluster is running in a stable environment where each iteration time interval (including batch processing and gradient computing) of a worker is similar in a short period. If the worker is not responding in a reasonable time, it will be taken out from the distributed system (and the algorithm in our case). Note that our algorithm is orthogonal to the fault torrent problem. Then, we can heuristically predict the future iteration intervals for workers (see Figure2) based on their most recent iterations.
The Problem. For workers in a cluster, each worker has to process many iterations in a training where each iteration time interval on the same worker is similar. Each iteration interval is measured by the starting and the ending timestamps of processing an iteration. Suppose we predict future iterations for each worker. For any worker , it has a set containing a list of starting and ending timestamps of iterations. Most of both timestamps are overlapped for the subsequent iterations. Thus, we only need to use the ending timestamps to measure an iteration . Mathematically we define the set where stands for an ending timestamp of worker and . The set contains iterations of worker . We need to find a set containing ending timestamps, one from each set , are closest to each other on the timeline. The maximum and minimum difference of these ending timestamps is the waiting time for a synchronization. The smallest timestamp indicates the time-spot for the fastest worker starts waiting whereas the largest timestamp indicates the synchronization barrier to which all workers have to stop for the synchronization.
From each of these sets , we pick one element to form a new set . The difference between the maximum and the minimum numbers of the set is defined as . The slowest worker and the fastest worker finish their current iteration at time and respectively. is the waiting time of the fastest worker. Thus, dominates the overall waiting time for a synchronization since other workers’ waiting time were overlapped by the fastest worker’s. We are looking for the optimal set which gives the minimum from all possible combinations of . Hence, our objective function is:
To solve the proposed problem, we first investigate the brute force approach. We analyze the naive brute force searching, naive search and develop an optimized version of brute force algorithm named FullGridScan since it is infeasible to implement the naive search as scaling the number of workers. We next introduce our approach ZipLine to bring down the computation complexity. Lastly, we show the computation and space complexity of the two approaches in Table I.
Naive search. In order to find the minimum difference , a straightforward approach is to use Brute Force. It first checks all possible combinations of selecting a single element from sets where each set has elements. There are combinations. Second, it computes their values and finds the minimum value from all values. The set which yields the minimum value is the object we are looking for. The computation complexity of this approach is . The space complexity is to hold the combinations.
GridScan. An optimized heuristic brute force algorithm (Algorithm 1) as a basis component for FullGridScan. We consider the predicted iterations’ timestamps for workers form a matrix where each row of the matrix represents a worker and each row has predicted iteration points (timestamps) for worker . Designate any point in , we can always find a point from other rows with the shortest distance to it. Let these closest points from other rows along with the designated point in set and we obtain . Accordingly, designate a row of points, we can find sets of s associated to every point of the designated row. Finally, we can find the set from sets of s with the minimum . To guarantee we do not miss any early point (on the timeline), we designate the row with the minimum (earliest) timestamp (i.e., ) as the designated row to start the search which costs . The total computation complexity is . The outer loop over the points on the designated row costs iterations and the inner loop over each points in rows (workers) constructs one combination of distinct value points costs iterations as there is points per row. During the search, we only need to keep the set with the minimum waiting time per point in the designated row which requires storage space . Along with the storage for points, the space complexity is .
FullGridScan. In GridScan, combinations (of ) are constructed and each of which waiting time is computed. We expect some critical combinations (containing the smaller waiting time ) may be missed. In order to cover more useful combinations during the search, FullGridScan rotates the designated row of GridScan in turn to repeat Algorithm 1 without the line 4 till all rows (workers) in are covered. It rapidly increases the computation complexity to . FullGridScan therefore covers combinations in total versus combinations explored in GridScan. The storage complexity however remains the same as GridScan.
ZipLine. ZipLine scans through the data points only once in linear complexity as shown in Figure 3. In ZipLine (Algorithm 2), we first merge all sets into one large set and sort its elements in ascending order by their value (ending timestamps) where and . We consider the elements are sorted from left to right in position of the set . Second, we define a set with the constraint that it contains one timestamp from each worker at any time as we will use to scan every element of following the timeline from left to right. Intuitively, the set only checks the superscript value of each element to prevent duplication of the same worker . If the new timestamp from worker is added, the old (duplicate) timestamp of worker in is removed. Third, we let the set scan the set by iterating one element from at a time. At the beginning of the scanning procedure, we initialize by filling elements from the very left of to while satisfying its constraint till has timestamps from workers. Then, we compute the minimum and maximum difference (i.e., waiting time) of based on the element value . Assuming is at the initialization, we store to and to . Next, we add one element from the left of to per iteration till is empty. In each iteration, we compute and compare its value with . If is smaller than , we store to and to . After iterations, we attain the optimal set . The algorithm only uses space to store . In each iteration, we also iterate through the set to remove the duplicate element as the new one is added. This operation maintains the invariant (constraint) of and costs . Therefore, the total computation complexity is and the space complexity is for storing .
V Experimental Evaluation
In this section, we run experiments that aim to evaluate:
The runtime performance of ZipLine to the FullGridScan baseline algorithm. The scalability of ZipLine as a function of the number of workers and the parameter .
The performance of ElasticBSP
compared to the classic BSP and other state-of-the-art synchronization models. Which one converges faster and to a higher accuracy? Which one reaches to a fixed number of epochs faster?
Dataset: We generate the datasets based on realistic scenarios to evaluate the performance of algorithms. Table II lists the different scales of configurations of datasets for the evaluation.
Environment: The overhead experiments of ZipLine and baseline algorithms are running on a server with 24x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and 64GB ram.
V-a ZipLine Performance Comparison
In Table II, we evaluate the algorithms with 15 predicted iterations for each worker. We use 150 predicted iterations to evaluate the scalability of the algorithms to . The computation time cost of each algorithm is the average of 10 trials.
|Algorithm||10 Workers||100 Workers||1000 Workers|
The combinations of elements from Matrix increases in exponential as the number of workers scales or in polynomial as predicted iterations increments since the combinations is which we described in section IV. Table II shows that as the number of workers increases the computation time of FullGridScan increases much faster than ZipLine. For a fixed number of workers, when the number of predicted iterations per worker increases, the computation time of FullGridScan increases much faster than others. GridScan can be an alternative when the heuristic result is acceptable and the number of workers is larger than 10.
V-B Distributed Deep Learning using ElasticBSP
We compare the performance of ElasticBSP with BSP, SSP and ASP by training DNN models from scratch under each of them on a distributed environment. We set a small threshold =3 for SSP to ensure the convergence and achieve higher accuracy [ho2013more]. For ElasticBSP, we set , the number of predicted future iterations per worker, to 15, 30, 60, 120 and 240 respectively. We ran each experiment three trails and chose the medium result based on the test accuracy.
Environment: We implement ElasticBSP into MXNet [chen2015mxnet] which supports BSP and ASP models. The experiments are running on 4 IBM POWER8 machines. Each machine has 4 NVIDIA P100 GPUs, 512 GB ram and 210 cores.
Datasets & DNN models: We train downsized AlexNet [krizhevsky2012imagenet], ResNet-50 [he2016deep] on datasets CIFAR-10 and CIFAR-100 [krizhevsky2009learning].
V-B1 Downsized AlexNet
We set mini-batch size to 128, epoch to 400, learning rate 0.001 and weight decay 0.0005. ElasticBSP converges faster and to a higher accuracy than other distributed paradigms (see Figure 4(a)). BSP converges slower than ASP and SSP but reaches to higher accuracy than both. The increase of introduces more predicted elements to be computed by ZipLine to determine the optimal synchronization time, therefore, increases the computation overhead. As a result, when becomes larger, it offers nothing but consumes more training time. To this model training, =240 costs extra training time to finish 400 epochs compared to the smaller values. On this model, SSP, ASP, ElasticBSP (=15,30) and BSP complete the fixed 400 epochs in ascending order.
We set mini-batch size to 128, epoch to 300, learning rate 0.5 and decay 0.1 at epoch 200. The results are shown in Figure 4(b). ElasticBSP converges faster and to a slightly higher accuracy than BSP. Although ASP and SSP converge faster than ElasticBSP and BSP, both cost much more training time to complete 300 epochs. Besides, ElasticBSP converges to a slightly higher accuracy than ASP and SSP. ASP and SSP have no bulk synchronization barriers thus have more iteration throughput causing faster convergence. But larger iteration throughput introduces more frequent communications between workers and server and so increases the number of weight updates. However, weight update has to be computed in sequence (as mentioned in Section I). Thus, their tasks are queued on the server which introduces extra delay. A thorough discussion on why ASP and SSP converge faster but take more training time than BSP can be read in [zhao2019dynamic]. On this model, ElasticBSP, BSP, SSP and ASP complete the fixed 300 epochs in ascending order.
Discussion: Above DNN models show that ElasticBSP converges to higher accuracy than BSP and takes less training time when is not too large. Note the different performances of ElasticBSP on the two DNN models are expected since AlexNet contains 2 fully connected layers whereas ResNets has no fully connected layers. Fully connected layers require much less computation time compared to convolutional layers while their representation requires much more parameters than convolutional layers which leads to a large model size. Convolutional networks without fully connected layers such as ResNets takes much more computing time but consumes less communication time due to its smaller model size as to fully connected layer networks. When the ratio of communication time and computation time is small, there is less training time can be saved. More detailed analysis of the different behavior on DNN models with different ratio of computation time and communication time can be read in [wangadaptive]. [zhao2019dynamic] also provides detailed rationality on the different performances of distributed training using ASP, BSP and SSP on different DNN models.
Vi Related work
A number of important works closely related to our research has already been cited throughout the manuscript. Here, we elaborate on three alternative models that have been proposed to mitigate the slow down caused by the straggler problem of the classic BSP. A-BSP [wang2018aggressive] handles the straggler problem by terminating the iteration job corresponding to the slowest worker once the fastest workers have completed their jobs. That way, the waiting time is eliminated. The remaining data of the terminated job of the slowest worker is prioritized in the next iteration. This design is limited to the CPU cluster where samples are processed one after another. But in a GPU cluster, a batch of samples are processed all at once in parallel; GPU takes a batch of samples per iteration and computes the gradients. Decreasing the data of a batch (iteration) does not reduce the computation time of GPU. Furthermore, GPU does not support preempt [bauer2011cudadma]. Terminating the job (iteration) means losing all the computed result on that batch of data. Chen et al. [chen2016revisiting] deal with the straggler problem by adding extra backup workers to the distributed training with workers. In this approach, workers are running for the model training. For each iteration, the server only accepts the first randomly arrived gradient updates from the faster workers and moves on to the next iteration. The gradients from the slower workers are dropped. It does save on waiting time of the faster workers but the computing resources of the slower workers in random iterations are wasted during the training. ADACOMM [wangadaptive] uses periodic-averaging SGD (PASGD) for bulk synchronization in which workers are doing local updates for iterations before a weight synchronization. That way, the communication time of both uploading gradients and downloading weights from the server per iteration is saved for
iterations. The straggler problem is not addressed in this work. ADACOMM estimates the optimalfor a bulk synchronization of local weights based on the training loss. Our ElasticBSP predicts the optimal synchronization time for all workers where each worker has different as opposed to in ADACOMM is uniformly assigned to all workers.
In this paper, we proposed ElasticBSP for distributed DNN model training using the parameter server framework. ElasticBSP is relaxing the bulk synchronization requirement of classic BSP and allows asynchronous gradient updates to a certain extent to ensure the quality of convergence and achieve higher accuracy. As a result, it increases the iteration throughput of the workers. ElasticBSP operates in two phases per weight synchronization; first future iterations for each worker are predicted. Then, ZipLine is applied to determine the optimal next synchronization barrier that minimizes the overall workers’ waiting time overhead. ZipLine is a greedy one-pass algorithm and adds a minimal overhead on the server, so it can be easily ported in popular distributed machine learning frameworks. The experimental results show that ElasticBSP provides faster convergence than classic BSP and achieves higher (or comparable) accuracy on the test data sets than other state-of-the-art synchronization models.
This work is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), IBM Canada and the Big Data Research Analytics and Information Network (BRAIN) Alliance established by Ontario Research Fund - Research Excellence Program (ORF-RE).