Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

08/12/2019 ∙ by Shigang Li, et al. ∙ ETH Zurich Institute of Science and Technology Austria 0

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Motivation

Deep learning models are on a steep trajectory to becoming the most important workload on parallel and distributed computer systems. Early convolutional networks demonstrated groundbreaking successes in computer vision, ranging from image classification to object detection 

(Huang et al., 2017; Simonyan and Zisserman, 2014)

. More recent developments in recurrent and transformer networks enable impressive results in video classification, natural language processing for machine translation, question answering, text comprehension, and synthetic text generation. The latter models contain more than 1.5 billion parameters and take weeks to train 

(Devlin et al., 2018; Radford et al., 2018). Other demanding neural networks are trained on the largest supercomputers to achieve scientific breakthroughs (Mathuriya et al., 2018; Kurth et al., 2018). Furthermore, the models are growing exponentially in size, OpenAI is predicting a 10x growth each year (Amodei and Hernandez, 2018) potentially leading to artificial general intelligence. In order to support this development, optimizing the training procedure is most important.

The training procedure of deep learning is highly parallel but dominated by communication (Ben-Nun and Hoefler, 2018). Most parallel training schemes use data parallelism where full models are trained with parts of the dataset and parameters are synchronized at the end of each iteration. The total size of allreduce grows with the model size, which ranges from a few megabytes (Huang et al., 2017) to several gigabytes (Radford et al., 2018) and grows quickly. The allreduce operation is not atomic and it can be split into layer-wise reductions, which can easily be overlapped with the layer computation using non-blocking collectives (Hoefler et al., 2007; Awan et al., 2017). Yet, the optimal scaling of an allreduce of size is at best in processes (Patarasuk and Yuan, 2009; Hoefler and Moor, 2014; Renggli et al., 2018). Thus, growing process counts will reduce the parallel efficiency and eventually make the reduction a scaling bottleneck.

The communication aspects of deep learning have been investigated in many different contexts (Sergeev and Del Balso, 2018; Renggli et al., 2018), see the survey for an overview (Ben-Nun and Hoefler, 2018). In this work, we identify load imbalance as an additional barrier to scalability. When some processes finish the computation later than others, all processes will wait for the last one at the blocking allreduce function. Load imbalance can be caused by the system itself, for example, when training on multi-tenant cloud systems (Schad et al., 2010; Iosup et al., 2011; Jackson et al., 2010) or by system or network noise (Hoefler et al., 2009, 2010) in high-performance machines. A second, and more prominent cause of imbalance is inherent imbalance in the computation that causes varying load across different processes. While noise from the system is generally low on well-maintained HPC machines (Hoefler et al., 2010), the inherent load imbalance of the training workloads cannot easily be avoided. Natural language processing tasks have sentences of highly varying length while video processing tasks have videos with different number of frames. For example, the training dataset of UCF101 (Soomro et al., 2012) contains videos that range from 29 to 1,776 frames.

Several researchers have shown that the training process itself is quite robust with respect to bounded errors. In fact, data augmentations such as Cutout (Devries and Taylor, 2017) and Dropout (Ba and Frey, 2013) introduce random errors and omissions into the training process to improve generalization properties. Several packages take advantage of this robustness and employ three techniques in tandem: (1) communicated weights are quantized to more compact number representations (Seide et al., 2014; Strom, 2015), (2) only the most significant weights are sent during each allreduce (Renggli et al., 2018; Alistarh et al., 2018), and (3) updates are only sent to limited (random) neighborhoods using gossip algorithms (Lian et al., 2018). We propose to exploit this robustness in a new way: we perform the allreduce eagerly in that we ignore the input gradients of processes that come late in order to not delay all processes. The communication partners are selected based on their workload (which can be randomized) and the allreduce itself is performed with high-performance reduction topologies (Hoefler and Moor, 2014) in logarithmic depth. We call our method eager Stochastic Gradient Decent (eager-SGD), as a counterpart to synchronous SGD (synch-SGD) (Ben-Nun et al., 2019; Sergeev and Del Balso, 2018; Awan et al., 2017). Fig. 1 shows the difference between synch-SGD and eager-SGD.

Figure 1. Synch-SGD vs eager-SGD under load imbalance. w are the weights in training step .

Specifically, we propose to relax the allreduce operation to partial collectives in eager-SGD. A partial collective is an asynchronous operation where a subset of the processes can trigger and complete the collective operation. Absentee processes follow a predefined protocol, such as contributing potentially outdated data. We define two partial collectives — solo allreduce, a wait-free operation that one process triggers; and majority allreduce, in which the majority must participate.

Our theoretical analysis shows that solo allreduce does not guarantee bounded error, as necessary in SGD, yet empirically converges in cases of moderate load imbalance. Majority allreduce is proven to bound the error, but is not completely wait-free. The statistical guarantee, however, is sufficient to both train deep neural networks and avoid the delays. We show that solo and majority collectives are suitable for different cases, depending on load imbalance severity. Our main contributions are:

  • A detailed analysis of workload imbalance in deep learning training.

  • Definition and implementation of partial collectives, specifically solo and majority allreduce.

  • Eager-SGD for asynchronous decentralized distributed training of neural networks with proof of convergence.

  • An experimental study of convergence and training speed for multiple networks, achieving 1.27 speedup over SGD on a video classification task without losing accuracy.

2. Load-Imbalance in Deep Learning

Load imbalance widely exists in the training of deep learning models, which can be caused by either the applications or the system itself (Schad et al., 2010; Iosup et al., 2011; Jackson et al., 2010; Hoefler et al., 2009, 2010).

2.1. Video Processing

(a) Video length distribution.
(b) Runtime distribution on a P100 GPU (batch size=16).
Figure 2. Load imbalance in the training of an LSTM model on UCF101 (Soomro et al., 2012).

Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997)

is a type of unit cell in Recurrent Neural Networks (RNN). In video classification tasks, LSTMs are used 

(Yue-Hei Ng et al., 2015a; Donahue et al., 2014; Ballas et al., 2015)

to process a sequence of frames for a video as input (optionally following convolutional neural networks that preprocess the images to features), and output a probability distribution over a set of classes. Due to the recurrent structure of the network, the computational overhead is proportional to the number of frames in the input video.

Fig. 1(a) shows the video length distribution (number of frames) over all 9,537 videos in the training dataset of UCF101 (Soomro et al., 2012). The video length is distributed between 29 and 1,776 frames, with a median frame count of 167 and standard deviation of 97. Fig. 1(b)

shows the runtime distribution over the 1,192 sampled batches in two epochs to train a 2,048-wide single-layer LSTM model on video frame features. As is standard in variable-length training, videos with similar lengths are grouped into buckets for performance. The runtime is distributed from 201 ms to 3,410 ms, with a mean runtime of 1,235 ms and standard deviation of 706 ms. These statistics above show that training an LSTM model for video classification exhibits inherent load imbalance.

2.2. Language Processing

Figure 3. Runtime distribution on a P100 GPU (batch size = 64), using a Transformer model on WMT16.

Transformers (Vaswani et al., 2017) are sequence-to-sequence models that translate a sequence of words from one language to another. Different from RNN, a Transformer network replaces the recurrent structure with an attention mechanism. To train the Transformer model, the computation overhead increases with the length of the input (and output) sentences. Typically, the sentences in the training dataset for a language model have various lengths, and thus the workload is unbalanced across different batches. Fig. 3 shows the runtime distribution over the 20,653 randomly sampled batches in 1/3 epoch to train a Transformer on the WMT16 dataset. The runtime is distributed from 179 ms to 3,482 ms with a mean of 475 ms and standard deviation of 144 ms, which shows the inherent load imbalance in language model training.

2.3. Training in the Cloud

Figure 4. Runtime distribution on Google Cloud with 2xV100 GPUs (batch size=256, ResNet-50 on ImageNet).

Performance variability is common in cloud computing (Schad et al., 2010; Iosup et al., 2011; Jackson et al., 2010). Fig. 4 shows the runtime distribution over the sampled batches for 5 epochs of training for the classic ResNet-50 model (He et al., 2016) on ImageNet (Deng et al., 2009), on a standard Google Cloud instance (n1-standard-16 with 2x Nvidia V100 GPUs). The runtime is distributed from 399 ms to 1,892 ms with a mean of 454 ms and standard deviation of 116 ms. Since ResNet-50 on ImageNet has the same input size for different batches, the load imbalance is caused mainly by the system. Compared with imbalanced applications (e.g., Transformer, LSTM), the load imbalance on cloud servers is relatively light.

3. Distributed Deep Learning

Deep neural networks are continuously differentiable functions that are composed of multiple operators, representable by a directed acyclic graph (LeCun et al., 2015)

. The gradient of those functions can be computed using the backpropagation algorithm 

(LeCun et al., 1998)

, processing the nodes in the DAG in a reverse topological order. Deep learning frameworks, such as TensorFlow 

(Abadi et al., 2015), typically execute parallel operations in the DAG in arbitrary order.

Supervised deep neural network training typically involves first-order optimization in the form of Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951). SGD optimizes the expected loss value over the “true” distribution of input samples by descending in the direction of a random subset of the training samples (minibatch). In a distributed data-parallel setting, the SGD algorithm (Algorithm 1) consists of multiple learner processes, each of which updates a global view of the parameters according to a different random minibatch at the same time. Given an update rule and local minibatch of size , the learners modify the global view of the parameters by using an average of the gradients obtained by the agents.

1:for  to  do
2:       Sample elements from dataset
3:       Obtain parameters from global view
7:      Update global view of parameters to
8:end for
Algorithm 1 Distributed Minibatch SGD

A straightforward manner to maintain a global view is using a Parameter Server (PS) architecture (Dean et al., 2012), where one or several nodes assume the role of a PS, broadcasting up-to-date weights (line 3) to learners prior to each step and aggregating gradients from them (line 7). This enables the PS to asynchronously update the global view (Recht et al., 2011), or require a fraction of learners to send gradients before progressing to the next step (Ho et al., 2013).

As the PS model is generally not scalable, another mode of operation implements SGD using collective operations. In such implementations, accumulating the gradients (line 7) is done via an allreduce operation, where each learner contains its own local view of the weights (Ben-Nun and Hoefler, 2018). Horovod (Sergeev and Del Balso, 2018) is one such implementation over the TensorFlow framework, which also fuses several allreduce operations into one in order to reduce overhead. However, due to the arbitrary order of execution imposed by the frameworks, Horovod uses a master process for negotiation communication (achieving consensus on which parameters are sent).

A more scalable method, used in the Deep500 DSGD optimizer (Ben-Nun et al., 2019), is to ensure an order of communication execution by adding control dependencies into the computation DAG, as shown in Fig. 5. In the backward pass, the allreduce operations are executed in a specific order after finishing the local gradient computation. Note that synchronizing gradient order can be avoided completely using non-blocking collectives (Message Passing Interface Forum, 2015). In this mode, each gradient communication message is assigned to an agreed-upon numeric tag, and multiple allreduce operations may be in-flight concurrently. Prior to updating the local view of the weights, a waitall command must be issued. All in all, these approaches reduce overhead in imbalanced loads by overlapping communication and computation, but do not mitigate it completely.

Figure 5. Adding control dependency in the computation DAG, using a block of ResNet-50 as an example.

4. Partial Collective Operations

A collective communication involves a set of processes that cooperate to progress their internal state. Some of these operations, e.g., allreduce, implicitly synchronize the participants: the operation cannot terminate before the slowest process joins it. We define these collectives as synchronous and introduce a new class of partial collectives that relax the synchronization. We now discuss two variants of partial collectives: solo and majority.

4.1. Solo Collectives

A solo collective (Di Girolamo et al., 2015) is a wait-free operation, which forces the slow processes to execute the collective as soon as there is one process executing it. This process, called initiator, is in charge of informing the others to join the collective. While solo collectives remove the synchronization delays, they change the semantics of collective operations, which may now complete by using stale data from the slow processes.

4.1.1. Schedule Activation

We define a schedule as a set of operations that a process executes in order to globally progress the collective operation. In particular, a schedule is a directed acyclic graph (DAG) where the vertices are operations and the edges are happens-before dependencies among them. We define the following operations:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=*]

  • Point-to-point communications: sends and receives.

  • Computations: simple computations defined between two arrays of data items. The type of the data items is defined according to the MPI basic types (Message Passing Interface Forum, 2015).

  • Non-operations (NOP): complete immediately and are only used to build dependencies.

Operations can be dependent on zero, one, or more other operations (with and or or logic) of the same schedule.

The main difference between synchronous and solo collectives is the time at which processes activate (i.e., starts executing) their schedule. For synchronous collectives, the schedule is executed only when a process reaches the collective function call (e.g., MPI_Allreduce). We define this activation as internal. For solo collectives, an external activation is also possible: the processes start executing the schedule because of an activation message received from the initiator, which starts broadcasting it immediately after the internal activation of its schedule. In particular, a solo collective is composed of two schedules: one for broadcasting the activation and the other one for executing the collective operation.

In a solo collective, any process can become the initiator, hence any process must be capable of broadcasting the activation message. The activation broadcast is implemented as a modified version of the recursive doubling communication scheme: this is equivalent to the union of binomial trees (optimal for small message broadcast, like the activation) rooted at the different nodes.

Figure 6. Solo collective activation (left) and process schedule (right). Operations are represented by circles: blue = send, green = receive, orange = computation, white = NOP. A dashed border means the operation can be fired as soon as one of its dependencies are satisfied.
Activation example

Fig. 6 shows a solo allreduce example. On the left, we show the global communications view that is split in two phases: activation and allreduce. The highlighted communication shows the activation path if the initiator is, e.g., process P3. For the allreduce, we use a recursive doubling implementation. Note that any collective implementation that can be expressed as a schedule can be linked to the activation phase. On the right we show the internal schedule of process P3. An internal activation (i.e., P3 making the function call explicitly) translates in the execution of NOP 0 (N0): this leads to the send operations S0 and S1 being fired to start broadcasting the activation message and to the execution of N1, which signals the activation of the allreduce schedule. Alternatively, if P3 is not the initiator, it will receive a message in receive R0 or R1: if the activation is received by R0, then P3 has to forward the activation message to P1 with send S1 (i.e., P3 is an internal node of the activation binomial tree). Also in this case NOP N1 will be executed, leading to the execution of the allreduce schedule.

Multiple initiators

Multiple processes may join the collective at the same time: in this case we need to ensure that the collective is executed only once. To address this issue, we set the operations to be consumable, meaning that the same operation cannot be executed twice. For example, let us assume that nodes P2 and P3 reach their internal activation at the same time. When P3 receives the activation message from P2 (i.e., through R0) there are two possible cases: 1) S1 is still not consumed and then it is executed; 2) S1 has been fired due to the internal activation and will not be executed a second time. NOPs are also consumable, hence N1 (i.e., the activation) can be executed only once.

Persistent schedules

Processes can be asked to join a solo collective only once before they reach their internal activation: once the schedule is executed, it needs to be re-created by the application in order to be executed again. To enable multiple asynchronous executions of solo collectives, we introduce persistent schedules. Such schedules transparently replicate themselves once executed, able to serve a new solo collective without requiring application intervention. Multiple executions of the same solo collective overwrite the data in the receive buffer, which always contains the value of the latest execution.

4.2. Majority Collectives

An issue of solo collectives is that if one or few processes are always faster then the others, then the collective will always complete by taking the stale data of the slower processes. In cases like DNN training, this scenario may negatively impact the convergence because the training will advance only considering the updates of few processes. To overcome this issue, we introduce majority collectives, which requires at least half of the processes to join before completing. We implement majority collectives by not letting any process become the initiator, as in solo collectives. Instead, at each execution of a persistent schedule, the processes designate an initiator by randomly selecting a rank (consensus is achieved by using the same seed for all the processes). When a process joins the collective (i.e., internal activation), it checks whether it is the designated initiator: only in that case it keeps running the internal activation followed by the actual collective schedule.

We now discuss how the above described implementation can provide a statistical guarantee that at least half of the processes on average contribute to the collective. Suppose the same collective operation is called by many iterations, such as in model training. We sort all the processes by the time they reach a collective operation. Since the probability that any process is specified as the initiator is equal to 1/, the expectation of the randomly specified initiator is the /2-th process among the sorted processes, namely on average half of the processes reach the collective operation earlier than the initiator. For a workload distribution with one mode and a tail, such as in Figs. 23, and 4, the probability that part of the processes reach the collective at a similar time to the initiator is high; then, more than half of the processes on average actively participate in the operation.

4.3. Asynchronous Execution by Library Offloading

The schedule of a partial collective can be asynchronously executed with respect to the application. We develop a communication library that allows to express communication schedules and offload their execution to the library itself. The schedule execution can take place on the application thread (i.e., when the application enters the library), or on an auxiliary thread. Once the application creates and commits a schedule, the library starts executing all the operations that have no dependencies. The remaining ones are executed as their dependencies are satisfied.

4.4. Discussion

Offloading the schedule execution to the network interface card (NIC) can provide different advantages such as asynchronous execution, lower latency, and streaming processing. Di Girolamo et al. (Di Girolamo et al., 2015) show how solo collectives can be offloaded to Portals 4 (Barrett et al., 2018) NICs by using triggered operations. This approach is limited by the amount of NIC resources that bounds the number of times a persistent schedule can be executed without application intervention. In fact, persistent schedules cannot replicate themselves with Portals 4: if we want to asynchronously execute the same schedule times that we need to make and offload of its copies. After these executions, the host CPU needs to setup the schedule again. This limit can be removed by implementing the schedule execution with the sPIN programming model (Hoefler et al., 2017), which allows to execute user-defined code on the NIC. A sPIN implementation of partial collective operations would then be able to replicate the schedule on-the-fly upon completion.

5. Eager-SGD

Algorithm 2 illustrates the main procedure of eager-SGD. Instead of calling a synchronous allreduce in the distributed optimizer (Fig. 5) to accumulate the gradients, eager-SGD uses the partial allreduce operations (Line 7). Either solo or majority allreduce can be used depending on the severity of load imbalance.

Fig. 7 presents an example of how eager-SGD works with partial collectives, in which w and G represent the weights and the gradients calculated on process at training step , respectively, and represents the update rule. In step , suppose process P1 is faster than process P0. P1 finishes the computation of G and then triggers the partial allreduce operation. Since P0 does not finish the computation of G at this time, it only passively contributes null gradients G to the partial allreduce at step . After P0 finishes the computation of G, it finds out that the partial allreduce at step is already finished by checking the results in the receive buffer. P0 updates the weights of step using G stored in the receive buffer of the partial allreduce and G becomes the stale gradients. The stale gradient G is then stored in the send buffer. If P0 does not catch up with P1 at step , P0 will passively participate in the partial allreduce again and contribute G. If P0 catches up with P1 at step (as in the case shown in Fig. 7), P0 will add G and G (calculated in step ) together, and contribute the accumulated gradients G to the partial allreduce; P0 resets the send buffer to G after finishing allreduce.

1: is local batchsize for processes
2:for  to  do
3:       Each process samples elements from dataset
9:end for
Algorithm 2 Eager-SGD
Figure 7. Partial collective operations in eager-SGD.

In severe load imbalance situations, some slower processes may lag behind by more than one step. The data in the receive buffer of the partial allreduce will then be overwritten and only the latest data in the receive buffer can be seen, which results in different weights on different processes. This may result in slightly lower accuracy as shown in Section 6.2.2. Thus, we periodically synchronize the models across all processes to eliminate the side effect. Since we only synchronize the models every tens of epochs, the overhead can be ignored.

5.1. Correctness and Convergence Guarantees

5.1.1. System Model

In this section, we prove that, under a reasonable set of modeling assumptions, the eager-SGD algorithm will still converge. We assume a system with asynchronous processors indexed as , which take steps at different speeds.

For the purposes of analysis, we can split the execution in serial fashion into rounds, where each round can be mapped to the partial-allreduce of corresponding index. Without loss of generality, we assume that each processor participates in each round , since it eventually submits an update to the corresponding partial-allreduce, which we denote by , for asynchronous distributed sum. However, its update may or may not be delivered to the other processors. Each partial-allreduce has the following semantics:

  • (Invocation) Each process proposes a

    -dimensional vector

    , corresponding to its current proposed update, to .

  • (Response) Each process receives a tuple , where is the -dimensional update to the parameter set corresponding to round , as decided by the shared object , and is a boolean stating whether the update by process has been included in .

We can therefore rephrase the algorithm as having each process invoke the object in each round, with its current update. If its update is not “accepted” () then the processor simply adds it to its update in the next iteration. The ADS objects we implement provide the following guarantees.

Lemma 5.1 ().

Each ADS object ensures the following:

  1. (Liveness) The object eventually returns an output at every invoking process.

  2. (Safety) The output is consistent, in the sense that (1) it is a correct average of a subset of the proposed updates in the round; (2) the returned bits reflect its composition; and (3) the output is the same at every invoking process.

  3. (Quorum Size) The subset of proposed updates included in the output is of size , where is a lower bound parameter ensured by the algorithm.

  4. (Staleness Bound) There exists a bounded parameter such that any update by a process can be rejected by the ADS objects for at most consecutive rounds from the time it was generated before being accepted.

5.1.2. Convergence Proof

We now show that these properties are sufficient for eager-SGD to ensure convergence for a standard class of smooth non-convex objectives. In the following, all norms are -norms, unless otherwise stated.

Assumption 1 (Loss Function).

We assume that our objective loss function

satisfies the following:

  • (Lower Bound) The function is bounded from below, that is, there exists a finite value such that, .

  • (Smoothness) The function is -smooth, i.e.

Further, we make the following standard assumptions about the gradients generated by the nodes:

Assumption 2 (Gradients).

For any round and process , the gradients generated by the processes satisfy the following, where expectations are taken with respect to the random data sampling at round .

  • (Unbiasedness)

  • (Second Moment Bound)

    There exists s.t.:

Convergence Bound

We can now prove the following:

Theorem 5.2 (Eager-SGD Convergence).

Consider an arbitrary objective function and gradient sampling scheme satisfying Assumptions 1 and 2. Fix the success parameter . Then, there exists a small constant learning rate value

such that we execute the eager-SGD algorithm for iterations, we are guaranteed to reach an iterate with such that


We make the following observations regarding the bound. First, we note that, since we analyze non-convex objectives, we must settle for a weaker notion of convergence than in the convex case (where we can prove convergence to a global minimum): specifically, we prove that, for a given sequence of learning rates, the algorithm will converge to a point of negligible gradient. Second, we note the linear dependence in and for the number of iterations to convergence, i.e.:

Thus, we would like the maximum delay and the number of “missed” gradients per round to be minimized. However, obviously, having no stragglers would imply higher synchronization cost. This suggests that, in practice, the algorithm should trade off the additional performance cost of synchronization with the slower convergence due to delayed gradient information.

Tasks Models Parameters Train data size Batch size Epochs Processes
Hyperplane regression One-layer MLP 8,193 32,768 points 2,048 48 8
Cifar-10 ResNet-32 (He et al., 2016) 467,194 50,000 images 512 190 8
ImageNet (Deng et al., 2009) ResNet-50 (He et al., 2016) 25,559,081 1,281,167 images 8,192 90 64
UCF101 (Soomro et al., 2012) Inception+LSTM (Yue-Hei Ng et al., 2015b) 34,663,525 9,537 videos 128 50 8
Table 1. Neural networks used for evaluation.

6. Evaluation

Experiments are conducted on the CSCS Piz Daint supercomputer with Cray Aries interconnect. Each XC50 compute node contains a 12-core Intel Xeon E5-2690 CPU with 64 GiB RAM, and one NVIDIA Tesla P100 GPU. The communication library is Cray MPICH 7.7.2. We use one MPI process per node and utilize the GPU for acceleration in all following experiments. First, we evaluate the performance of the partial collective operations using a microbenchmark. Then, we use the different neural networks summarized in Table 1 to compare our eager-SGD with the state-of-the-art synch-SGD implementations (Horovod (Sergeev and Del Balso, 2018) and Deep500 (Ben-Nun et al., 2019)), under both simulated and real workload imbalance environments.

6.1. Partial Allreduce Operations

Figure 8. Microbenchmark used to test the latency of the collective operations.

We design a microbenchmark, shown in Fig. 8

, to evaluate the performance of partial allreduce operations and MPI_Allreduce (Cray MPICH) with unbalanced workload. All the processes are linearly skewed (line 4) before calling the collective operations and the average latency among all the processes is recorded (lines 7-10). The microbenchmark is a special case with severe load imbalance, which is useful to verify the statistical guarantee of majority allreduce. Experimental results on 32 processes are presented in Fig. 

9. Compared with MPI_Allreduce, solo and majority allreduce operations reduce the latency by on average 53.32x and 2.46x, respectively. This is because all the processes (except the slowest one) for MPI_Allreduce are delayed; solo allreduce is not delayed since the fastest process will trigger the operation immediately; and majority allreduce has to wait for a randomly specified process to trigger the operation, and thus it is moderately delayed.

Figure 9. Average latency comparison between MPI_Allreduce and partial allreduce running on 32 processes by 64 iterations. Processes are linearly skewed by injecting load imbalance from 1 ms to 32 ms.

For the partial collective operations, we refer to the initiator together with the processes that arrive at the operation before the initiator as the active processes, which contribute the latest data (line 5 in Fig. 8). The other processes only contribute null values (line 13). For solo allreduce, since the fastest process is the initiator and all the processes are fully skewed, the Number of Active Processes (NAP) is around 1, as shown in Fig. 9. For majority allreduce, since the initiator is randomly specified, the expectation of NAP is half of the total processes. On average 16 out of 32 processes for majority allreduce are active processes, which means half of the processes contribute the latest data when the processes are fully skewed.

Figure 10. Throughput and loss comparison between synch-SGD and eager-SGD for hyperplane regression using 8 processes. "synch/eager-SGD-200/300/400" represent 200/300/400 ms load imbalance injection, respectively. Each point is at the boundary of one epoch.
(a) Throughput comparison. Each point is at the boundary of one epoch.
(b) Top-1 training accuracy. Each point is at the boundary of one epoch.
(c) Top-1 test accuracy. Each point is at the boundary of every 10 epochs.
Figure 11. Comparisons between synch-SGD and eager-SGD for ResNet-50 on ImageNet using 64 processes. "synch/eager-SGD-300/460" represent 300/460 ms load imbalance injection, respectively.

6.2. Throughput and Convergence with Simulated Dynamic Workload Imbalance

We use three networks shown in Table 1

, including a multilayer perceptron (MLP), ResNet-32, and ResNet-50, to evaluate the performance of eager-SGD with simulated workload imbalance. From the application perspective, these three networks have balanced workload during the distributed training, since each batch has equivalent workload. We manually inject delays to simulate the dynamic load imbalance environment caused by the training system, as discussed in Section 


6.2.1. Hyperplane Regression, Light Load Imbalance

We generate both training and validation datasets for a 8,192-dimensional hyperplane regression using the equation: , where () is the input vector and is the label. An one-layer MLP is used to learn the coefficients () of the hyperplane. We use 8 processes with the total batch size of 2,048 to train the model for 48 epochs. To simulate the dynamic load imbalance environment, we randomly select one process out of the 8 processes at every training step to inject a certain amount of delay, according to the performance variability shown in Fig. 4. The throughput comparison between synch-SGD (Deep500) and eager-SGD (using solo allreduce) is shown in Fig. 10 (top half). With 200, 300, and 400 ms load imbalance injection, eager-SGD achieves 1.50x, 1.75x, and 2.01x speedup over synch-SGD, respectively. We observe that the more severe the load imbalance, the worse the performance of synch-SGD because of the synchronization overhead. On the other hand, the performance of eager-SGD is stable. Given that the throughput on a single GPU node with batch size of 2,048 is 0.64 steps/s, eager-SGD with 400 ms load imbalance injection still achieves 3.8x speedup in strong scaling on 8 GPU nodes.

Fig. 10 (bottom half) presents the validation loss (mean squared error) as a function of the training time, which shows that eager-SGD using solo allreduce converges with equivalent loss value (around 4.7) to synch-SGD but significantly reduces the training time. Since the processes are not severely skewed and the stale gradients are added to the next training iteration (as discussed in Section 5), using solo allreduce is enough for convergence. When using majority allreduce, the throughput of eager-SGD is lower than using solo allreduce (1.64 step/s vs 1.37 step/s with 200 ms load imbalance injection).

6.2.2. ResNet-50 on ImageNet, Light Load Imbalance

Residual Network (ResNet) (He et al., 2016) is widely used in computer vision tasks. To evaluate the performance of eager-SGD, we use 64 processes with a total batch size of 8,192 to train ResNet-50 on ImageNet for 90 epochs. To simulate the dynamic load imbalance environment, we randomly select 4 processes out of the 64 processes at every training step to inject a certain amount of delay, according to the performance variability on Cloud machines discussed in Section 2.3. Fig. 10(a) presents the throughput comparison between synch-SGD (Horovod and Deep500) and eager-SGD using solo allreduce. With 300 and 460 ms load imbalance injection, eager-SGD achieves 1.25x and 1.23x speedup over Deep500, respectively; 1.14x and 1.22x speedup over Horovod, respectively. As the load imbalance injection increases to 460 ms, the throughput of eager-SGD decreases to 1.15 steps/s. This is because severe load imbalance leads to higher overhead in the activation phase of solo allreduce. Given that the throughput of a single GPU node with batch size of 128 is 1.56 steps/s, eager-SGD running on 64 processes with 460 ms load imbalance injection still achieves 46.9x speedup in weak scaling.

Fig. 10(b) and Fig. 10(c) present the Top-1 train and test accuracy as a function of the training time, respectively. We train the model three times for each SGD, and obtain stable accuracy results. For top-1 accuracy, Deep500 achieves 79.1% train accuracy and 75.7% test accuracy, Horovod achieves 79.0% train accuracy and 75.8% test accuracy, while eager-SGD using solo allreduce achieves 78.4% train accuracy and 75.2% test accuracy on average over different load imbalance injections. Note that without model synchronization at every 10 epochs, the top-1 test accuracy of eager-SGD decreases to 74.1%. For top-5 accuracy, synch-SGD achieves 92.6% test accuracy, while eager-SGD using solo allreduce achieves 92.4% test accuracy on average. The experimental results on ResNet-50 demonstrate that eager-SGD (solo) significantly improves the training speed without losing accuracy for deep neural networks in light load imbalance environment.

6.2.3. ResNet-32 on Cifar-10, Severe Load Imbalance

To test the robustness of eager-SGD, we train ResNet-32 on Cifar-10 with 8 processes for 190 epochs in a severe load imbalance environment. All 8 processes are skewed by injecting load imbalance from 50 ms to 400 ms at every training step. The injection amount over the processes is shifted after each step. Fig. 12 presents the test accuracy as a function of the training time. Eager-SGD using solo allreduce has the highest training speed but with lower test accuracy. Solo allreduce only waits for the fastest process to inform the other processes to participate in allreduce, but most of them will contribute stale gradients. Majority allreduce can solve the lower accuracy problem caused by solo allreduce, which achieves approximately equivalent accuracy to synch-SGD with 1.29x speedup. The results demonstrate that eager-SGD using majority allreduce is tolerant to severe load imbalance.

Figure 12. Top-1 test accuracy of synch-SGD (Horovod) and eager-SGD for ResNet-32 on Cifar-10 using 8 processes. Each point is at the boundary of every 10 epochs.

6.3. Case Study: Video Classification

As discussed in Section 2.1, LSTM on UCF101 for video classification has inherent workload imbalance because of different workload for different batches. We use Inception v3 (Szegedy et al., 2016), a CNN model, to extract a 2,048-wide feature from each frame of the videos, and then pass the sequences of features to an LSTM model. The training time reported in the paper is only for the LSTM model, not including the preprocessing time using Inception v3.

(a) Training accuracy.
(b) Test accuracy.
Figure 13. Training results for LSTM on UCF101 using 8 processes. Each point is at the boundary of one epoch.

To evaluate eager-SGD, we use 8 processes with a total batch size of 128 to train LSTM on UCF101 for 50 epochs (more information is in Table 1). Fig. 12(a) and Fig. 12(b)

present the train and test accuracy as a function of the training time, respectively. For each SGD, we train the model four times and plot the curves using the average values. Colored areas around the curves are confidence intervals with the boundaries representing the standard deviation. Although eager-SGD using solo allreduce achieves 1.64x speedup over Horovod, it has lower accuracy. Eager-SGD (solo) achieves on average 60.6% (up to 70.4%) top-1 test accuracy while Horovod achieves on average 69.6%. This is because the workload of the video model is severely unbalanced, and solo allreduce introduces too many stale gradients. In contrast, eager-SGD using majority allreduce achieves 1.27x speedup over Horovod with equivalent accuracy. For example, Horovod achieves on average 69.6% top-1 test accuracy (up to 72.2%) and 90.4% top-5 test accuracy (up to 91.9%), while eager-SGD using majority allreduce achieves on average 69.7% top-1 test accuracy (up to 72.8%) and 90.0% top-5 test accuracy (up to 91.7%). Train accuracy results (in Fig. 

12(a)) show a similar trend as the test accuracy. Synch-SGD achieves on average 86.1% top-1 train accuracy and on average 96.6% top-5 train accuracy; eager-SGD using majority allreduce achieves on average 86.7% top-1 train accuracy and on average 96.1% top-5 train accuracy; eager-SGD using solo allreduce achieves on average 80.8% top-1 train accuracy and on average 90.0% top-5 train accuracy. All the accuracy results are consistent with that claimed in recent work (Yue-Hei Ng et al., 2015b). The training speed and accuracy for Deep500 (not plotted in figures) are similar to Horovod. The results show that majority allreduce, with its statistical guarantee, is sufficient to both speed up training and achieve good accuracy.

The total training time for 50 epochs using a single GPU node with batch size of 16 and 128 is 34,301 seconds and 6,314 seconds, respectively. In week scaling, Synch-SGD (Horovod) and eager-SGD using majority allreduce achieve 3.72x and 4.71x speedup on 8 GPU nodes, respectively. In strong scaling, synch-SGD and eager-SGD using majority allreduce do not have speedup on 8 GPU nodes; in contrast, eager-SGD using solo allreduce achieves 1.12x speedup on 8 GPU nodes in strong scaling, but with lower accuracy. Note that increasing the batch size can further improve the speedup in strong scaling for eager-SGD. However, large batch sizes commonly need carefully-tuned learning rate schedules to achieve good accuracy (You et al., 2018), which is out of scope.

7. Related Work

Deep learning

Parameter Server SGD implementations use synchronous (Li et al., 2014; Gupta et al., 2015), asynchronous (Dean et al., 2012; Chilimbi et al., 2014), stale- (Ho et al., 2013; Zhang et al., 2015), and approximate-synchronous (Hsieh et al., 2017) SGD, where the latter two limit the age and insignificance of the gradients, respectively. In a decentralized setting, asynchrony is achieved by performing communication on an explicit subset of the learners, e.g., forming a ring (Lian et al., 2018) or a general graph (Xie et al., 2016); or on a random subset using Gossip Algorithms (Daily et al., 2018; Jin et al., 2016). These modes achieve some degree of asynchrony, but take or (for ring or gossip-based schemes, respectively) update steps to disseminate the information to all learners. To the best of our knowledge, this is the first work that implements asynchronous and stale-synchronous decentralized SGD where the messages propagate to all nodes in one step.

Collective communication

Several algorithms can be used to implement allreduce operations, and the optimal algorithm depends on network topology, number of processes, and message size (Thakur et al., 2005). For large message sizes and large number of processes, practical implementations employ the ring-allreduce (Gibiansky, 2017) or the Rabenseifner’s Algorithm (Rabenseifner, 2004). Independently from the specific algorithm, the semantic of the allreduce implies processes synchronization. With eager-SGD we relax this synchronization by using solo and majority allreduce operations.

8. Conclusions

In this work, we show that load imbalance is prevalent in deep learning problems with variable-length inputs, and increasingly becoming an issue in cloud systems. To that end, we propose eager-SGD: an asynchronous decentralized version of SGD where fast processes contribute gradients without waiting for slow processes. Using the resilience of machine learning to bounded error, we implement two variants of partial collectives — solo and majority allreduce — enabling this behavior without a central parameter server.

The experimental results reaffirm our theoretical analysis, showing that eager-SGD using solo allreduce speeds up the training process in imbalanced environments (attaining up to 1.64 on UCF101) with a negligible loss of accuracy. Furthermore, as the load imbalance increases, the convergence rate degrades, in which case majority allreduce yields slightly lower performance (1.27 faster than SGD on UCF101) yet desirable generalization.

The research can extend in different directions. Firstly, the promising results make eager-SGD attractive for other applications as well, such as language models and object detection. Secondly, in order to provide different quorum sizes, it is possible to construct a spectrum between solo, majority, and full collectives. Lastly, partial collectives can be used for other algorithms beyond SGD.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 programme (grant agreement DAPP, No. 678880). We also would like to thank the Swiss National Supercomputing Center (CSCS) for providing the computing resources and for their excellent technical support.