Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

04/30/2020 ∙ by Shigang Li, et al. ∙ ETH Zurich Institute of Science and Technology Austria 0

Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates equivalent to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD, WAGMA-SGD significantly improves training throughput (by 2.1x on 1,024 GPUs) and achieves the fastest time-to-solution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The introduction of deep learning is one of the most important advancements in science over the past two decades, powering industries from autonomous driving [15] to drug discovery [88]

. With the rise of deep neural networks, their training evolved into a computationally-intensive task that consumes as many resources as modern complex high-performance computing problems 

[4]. As a result, an abundance of research has been conducted into its scaling and distribution [13].

The leading contenders for largest workloads in deep learning are Neural Language Models [53, 78], Deep Reinforcement Learning (RL) [100, 69] and Neural Architecture Search [67, 66]. In these regimes, computation time is measured in thousands of “GPU days”, with some utilizing hundreds of accelerators (GPUs, TPUs) for several weeks [99, 83, 79].

Distributed training is largely supported by data parallelism, where sample evaluation is partitioned across processors. In this mode of parallelism, all participants must exchange their gradients or model, resulting in an Allreduce operation across a cluster [72]. In practice, the exchange communication dominates the overall runtime [83], especially in large-minibatch SGD. To exacerbate the problem, certain datasets and environments are inherently imbalanced, e.g., with different sentence/video lengths [60] or heterogeneous environments in RL [103].

In order to mitigate the wait time for gradient/weight exchange, existing approaches attempt to relax model consistency between processors [13, 94]. Examples include synchronous gossip-based SGD [61, 5], asynchronous SGD [80, 73, 62, 22], and asynchronous SGD with bounded staleness [39, 107, 20, 92]. Gossip-based SGD replaces the global allreduce by communicating with randomly selected neighbors. Asynchronous SGD breaks the global synchronization to mitigate the effect of stragglers (slow processes). However, most of these approaches adversely impact convergence, necessitating an increase in the number of iterations [5, 74], sometimes to the point where synchronous waits are preferable.

In this paper, we solve this problem by introducing Wait-Avoiding Group Model Averaging (WAGMA) SGD, a novel optimizer that combines group collective communication with bounded staleness, in order to ensure competitive performance with decentralized and asynchronous methods, while retaining the convergence rate of synchronous model-averaging SGD. WAGMA-SGD locally communicates model updates across subgroups of processors, mitigating the need for global communication at every training iteration. Specifically, we propose to use a group allreduce operation for model averaging, in which the fastest process will trigger exchanges within all subgroups. Grouping is performed dynamically to facilitate model update propagation, and as a result not only speeds up communication, but also mitigates the effect of unbalanced workloads, all without harming convergence in practice.

We theoretically prove the convergence of WAGMA-SGD, showing that, for certain parameter values, its convergence rate is comparable to synchronous SGD with model averaging. Subsequently, we test the algorithm on a supercomputer equipped with GPUs for three different categories of deep learning: supervised image classification on the ImageNet dataset; semi-supervised language modeling on the WMT17 translation dataset; and deep reinforcement learning on the Habitat indoor navigation dataset. We show that both theoretically and empirically, WAGMA-SGD is favorable over other asynchronous algorithms and the baselines, which makes it an excellent approach for scaling up distributed deep learning.

Our main contributions are:

  • We propose WAGMA-SGD and realize it based on a wait-avoiding group allreduce operation.

  • We theoretically analyze the convergence of WAGMA-SGD.

  • Compared with state-of-the-art decentralized SGD, WAGMA-SGD improves the training throughput by up to 2.1 on 1,024 GPUs, and achieves the fastest time-to-convergence for all three evaluated tasks.

Ii Background and Related Work

Deep neural networks are primarily trained with mini-batch stochastic gradient descent [16]. Let be the batch size, the neural network weights at step , a set of samples of size , and

a loss function. We compute the loss for each sample as

and then a stochastic gradient as

SGD then iterates steps such that . In more general terms, first-order stochastic gradient update rules can take different forms (e.g., by adding a momentum term), which is represented as . In distributed environments with processors, denotes the local batch size per processor. We refer to Ben-Nun & Hoefler [13] for a general overview of distributed deep learning.

Thanks to the robustness of stochastic optimization, in distributed environments one can relax weight updates by varying several axes, trading off communication overhead for convergence. Data-parallel distributed SGD algorithms can be broadly identified by answering the following five questions:

Q1. What are we averaging?

There are two typical approaches for aggregating distributed updates: gradient and model averaging. When performing gradient averaging, we simply compute as the average over the global batch size. With standard model averaging, the SGD update is applied locally at the node, and then the resulting model is averaged over all processors.

Complementary to these approaches is the degree of quantization or sparsity in the exchanged updates. As these concepts are out of the scope of this paper, we refer to Tang et al. [94] for a comprehensive survey.

Q2. Who is coordinating the averaging?

Earlier implementations of distributed SGD for deep learning [24] use a centralized coordination architecture, where a parameter server or other coordinator maintains a master copy of the model that workers use. As this approach does not scale to large numbers of processors, a decentralized global clock can be synchronized across workers, where each worker maintains a local replica of the model and communicates updates to other workers directly.

To mitigate the overheads of global communication and synchronization, several decentralized instances of SGD have been proposed, e.g., [61, 62, 5, 74], where each worker maintains a local model but communicates updates in separate schedules, rather than synchronizing globally.

Q3. How old (stale) can averaged components be?

In a synchronous system, model or gradient averaging occurs when all processes are on the same training iteration . This does not guarantee that every worker uses the same parameters (i.e., consistent model), however, standard parameter server or globally-coordinated methods ensure all workers have a consistent model. In an asynchronous system, averaging can occur between workers at any point. We thus define the staleness of models/gradients by , indicating how many iterations have passed since the produced value’s model was updated. A bounded staleness system mitigates convergence issues with asynchronous systems by ensuring that the difference in the number of training iterations between the slowest and fastest processor is bounded, using as a proxy.

Q4. How often are we globally averaging?

While bounded and unbounded staleness SGD variants do not adhere to rigid communication schedules, some algorithms may periodically synchronize all processors’ model replicas. This ensures not only the staleness is bounded by but also the consistency of the model is retained throughout training, mitigating its divergence across processors. In other algorithms, this global consensus is achieved post-training, by choosing the model average or the model with best generalization scores. Note that under this nomenclature, synchronous variants’ global average frequency is one step.

Q5. How many learners are averaging at every step?

In the steps between the aforementioned global model averaging period, decentralized SGD variants perform local averages with a certain group (or quorum) size , leveraging the fact that several averaging steps can be performed in parallel.

Coordination Staleness Gradient Averaging Model Averaging
Centralized None Parameter server [59], P3 [49]
Unbounded Hogwild! [80], Downpour SGD [24], AASGD [35] SAPS-PSGD [95]
Bounded SSP [39], Rudra [36], Softsync SGD [108], Gaia [45], -async SGD [30], Qsparse-local-SGD [10], Hybrid sync/async [58] EASGD [107], Federated learning [71, 55]
Decentralized, None Allreduce-SGD [34, 89, 75, 31] BMUF [20]
Unbounded One-shot SGD [70], SimuParallelSGD [109]
Bounded Eager-SGD [60], SFB [105], Gradient lag [57]
Decentralized, None
Unbounded
Bounded WAGMA-SGD
Decentralized, None D-PSGD [61], SGP [5]
Unbounded GossipGraD [22], Choco-SGD [54] AD-PSGD [62], Gossiping SGD [52], SwarmSGD [74]
Bounded CDSGD [51] Local SGD [92, 64, 37, 101, 81]
TABLE I: Classification of data-parallel SGD variants.

Removing the global communication bottleneck in decentralized SGD has been shown to enable scaling to tens and even hundreds of nodes [61, 62, 5]. However, performing averaging in pairs does come at the cost of worse convergence: in particular, early proposals on decentralized algorithms [61, 62] lose accuracy with respect to the synchronous baseline at scale, while more recent work [5, 74] observe that the algorithms can achieve full accuracy if executed for more iterations than the synchronous baseline: in particular, they execute between twice and four times more SGD iterations in total, relative to the synchronous baseline, erasing much of the speedup due to increased scalability. This decreased convergence behavior is connected to the analytical bounds provided by these algorithms: while the theoretical convergence rates suggest linear speedup with respect to the number of SGD steps and executing nodes, these rates only apply after a very large number of SGD steps have been taken, in order to allow the pairwise averaging process to “mix” well, thereby simulating all-to-all averaging. See Section IV-A for a detailed discussion.

Ii-a Training at Scale

An orthogonal challenge to distributed stochastic optimization is that of unbalanced workloads. Imbalance may be caused by the training system [47, 60, 62] or by the task itself [60, 103]. Training on multi-tenant cloud systems can suffer from performance variability due to resource sharing. Several deep learning tasks, such as video classification and machine translation, have inherent load imbalance, because input/output sequences have different lengths [60]. In deep reinforcement learning, an agent must interact with the environment to generate training data. For RL tasks using heterogeneous environments [103], the runtime of training data generation varies significantly.

Beyond deep learning, allreduces have a long history within the HPC community [96, 77, 9, 91, 7, 8, 18, 17, 48, 106, 6, 11] and nonblocking versions have been used to improve performance [44]. Particular implementations have become widely-used within the deep learning community, including Baidu-Allreduce [34], NCCL [75], Gloo [31], and Horovod [89]. Most deep learning frameworks incorporate support for distributed training, either via parameter servers or allreduces [24, 21, 1, 46]. Communication compression is another common (and complementary) approach to reducing communication overhead [87, 93, 29, 65, 2, 3, 102, 63, 14, 82]. Communication may also be impacted by different approaches to partitioning layer parameters, such as model parallelism [56, 97, 33, 27, 28, 50].

Ii-B Comparison Targets

In Table I

we summarize and classify the distributed SGD algorithms most relevant to our work. Algorithms in

bold are used for comparison in this work. Since they typically scale and perform better on large-scale systems, we limit our comparison to decentralized algorithms. The algorithms we compare our evaluation with are chosen specifically to be spread across the different answers to the above five questions, prioritizing popular algorithms with proven convergence, both in theory and in practical deep learning applications:

  • Allreduce-SGD is the standard data-parallel training.

  • Local SGD [20, 92, 64, 37, 101, 81]

    performs a fixed number of local iterations of SGD (a hyperparameter determined by the user) and then averages the models over all processes with a standard allreduce. Several variants with different methods for determining the frequency of global averaging exist.

  • Decentralized parallel SGD (D-PSGD) [61] uses a ring topology, where each process averages its local model with its two neighbors. Processes advance synchronously with a single global clock.

  • Stochastic gradient push (SGP) [5] generalizes the topology used in D-PSGD to support more flexible, asymmetric communication patterns.

  • Eager-SGD [60] uses partial collective allreduces over the gradients, allowing at most half processors to contribute stale gradients if not ready.

  • Asynchronous decentralized parallel SGD (AD-PSGD) [62] extends the idea of D-PSGD by allowing processors to communicate updates at any point in time.

These cover nearly all varieties of consistency and averaging, as well as practical differences in communication patterns.

Ii-C Discussion

Following the discussion on the impact of quorum size on convergence (Q5), it is natural to ask whether performing decentralized averaging in larger groups would be able to provide the best of both worlds, enabling the full convergence of the synchronous algorithm, and the scalability of fully decentralized ones. There are two main barriers to this solution: the first one is at the implementation level, since, to our knowledge, no efficient non-blocking implementation of group model averaging exists. The second is at the application level, since it is not clear whether group averaging would be able to achieve the same convergence as the synchronous solution (both in theory and in practice). In the following sections, we address both of these issues.

Iii Wait-Avoiding Group Communication

The allreduce collective operation [72] is defined as a reduction whose results are shared among all participants. Although serveral optimizations [34, 77] have been designed to improve the performance of this collective, allreduce poses an implicit global synchronization point, which makes it vulnerable to stragglers during deep learning training. On larger systems, the performance of the compute nodes can be impacted by different internal (e.g., load imbalance) and external factors (e.g., network [23] or OS [43] noise), potentially increasing the synchronization overhead. We define this collective as synchronous allreduce. While non-blocking collectives [42] can alleviate the synchronization overhead, they do not fully remove it and completion still waits. Even if the participating processes are perfectly synchronized, the optimal scaling of an allreduce of size is at best for processes [76, 83]. Therefore, growing process counts will reduce the parallel efficiency and eventually make the reduction a scaling bottleneck.

Iii-a Wait-Avoiding Group Allreduce

To overcome the synchronization overhead and overall collective cost, we introduce a new class of wait-avoiding group collectives, focusing on group allreduce for the purpose of this work. We relax synchronization by making the collectives externally-triggerable [26, 60], namely, a collective can be initiated without requiring that all the processes enter it, by externally activating the communication schedule of late processes with activation messages sent by the early ones. Once activated, a group allreduce does not perform a global reduction. Instead, it partially reduces the data within non-overlapping groups of processes, limiting the number of communications needed to implement the group collective.

Iii-A1 Collective activation

In a wait-avoiding group allreduce, any process can make progress regardless of what the other processes are working on. This wait-avoidance is achieved by the activation component. We call the process reaching the collective call first the activator. The activator is in charge of informing all the other processes that an allreduce operation has started and that they have to participate, regardless of whether they reached the collective call-site.

Fig. 1: Wait-avoiding group allreduce on four processes with a group size of two. P1 arrived first and activates the operation.

In a wait-avoiding group allreduce, any process can initiate the collective. We use a modified version of the recursive doubling algorithm that builds a butterfly topology, which can be seen as a set of overlapping binomial trees, one rooted at each process. Any node can activate the collective by sending activation messages along the binomial tree rooted at itself. Fig. 1 shows an example where P1 is the activator. In this case, P1 uses its broadcast tree and sends the activation messages to P0 and P3. Once activated, P0 first forwards the activation message to P2, after which it starts executing its group allreduce schedule.

It is possible that several processes arrive at the wait-avoiding group allreduce operation at close proximity, which means we may have more than one activator during the activation phase. To guarantee that a process does not execute the same collective twice, we assign each operation a version number that is increased every time the collective is executed. The collective version number is encoded in the tag of the activation messages: once an activation is received, the collective schedule is activated only if its version number is lower or equal than the one carried by the activation message. The version number check is executed also when a process reaches the collective call: if it fails, then the version of the collective that the process wants to activate has already been executed (and the process has passively participated in it). In this case, no activation messages are sent.

Iii-A2 Asynchronous execution

To enable asynchronous execution of the custom collectives, we extend the fflib communication library [26], adding support for wait-avoiding group allreduce. fflib allows programmers to customize collective operations via a flexible, DAG-based representation of point-to-point and local compute operations, defined as schedules. The library provides a C-based interface for schedule creation and nonblocking invocation, using MPI as its primary backend, with additional support for network offloading engines such as sPIN [41]. Our defined schedule for group operations models both the activation and group allreduce phases.

Iii-B Dynamic Grouping Strategy

As discussed in Section II, in data-parallel SGD variants such as allreduce SGD [89, 12] and gossip SGD [61, 5, 62], each process keeps propagating local model updates to all the other processes at every iteration to make global progress. We propose a dynamic grouping strategy to reduce the latency (in steps) of local update propagation. Together with the group allreduce operation, the grouping strategy guarantees that the local updates can be globally propagated within iterations. The larger the group size, the faster the updates are propagated. By carefully selecting the group size, we can achieve both lower latency than gossip SGD and efficient communication by reducing contention.

1:Input: Total processes. is the group size. is the training iteration.
2:mask = 1,  global_phases = ,  group_phases =
3:shift = group_phasesglobal_phases
4:make each process an individual group initialize groups
5:for  to group_phases do
6:      mask shift bitwise left shift on mask
7:      for  to  do
8:             XOR mask equivalence relation
9:            Find groups: group_p, group_q
10:            if group_p group_q then
11:                 Merge groups: group_merge group_pgroup_q
12:            end if
13:      end for
14:      shift shiftglobal_phases
15:end for processes are partitioned into groups in iteration
Algorithm 1 Dynamic grouping strategy

We define the dynamic grouping strategy in Algorithm 1. We assume the number of processes is a power-of-two, which is a common case in current distributed training systems. The group size () is also set to a power-of-two. In line 2, we initialize the mask, and calculate the number of phases in a butterfly topology for and processes, respectively. Line 3 initializes the shift. In each training iteration , the algorithm first initializes groups, each of which contains one process (line 4). In line 8, an equivalence relation between each pair of processes is found using the bitwise XOR operation. For a pair of processes with an equivalence relation (i.e., ), we find the groups and belong to, respectively (line 9); if and are not in the same group, we merge the two groups into one using the union operation (lines 10–12). In line 15, all the processes will have been partitioned into groups in iteration . Note that the initial value of shift is periodically changing in every iteration (line 3), which, in turn, changes the group composition in every iteration.

To demonstrate dynamic grouping, we use and as an example. In iteration , all processes are initially partitioned into 8 groups. The set of equivalence relations 111The equivalence relations satisfy reflexivity, symmetry, and transitivity. Thus, , , and imply that . We still put in the set for easier explanation. includes , , , , , , , and . By recursively merging the two groups in which a pair of processes with a equivalence relation belongs to, we obtain two non-overlapping groups, which contain the processor sets {0, 1, 2, 3} and {4, 5, 6, 7}. In iteration , the set of equivalence relations changes; thus, the grouping changes accordingly (i.e., {0, 1, 4, 5} and {2, 3, 6, 7}).

Note that we only use Algorithm 1 to formally describe the grouping strategy. The grouping strategy together with allreduce within each group is implemented concisely following the phases of the butterfly topology, namely each pair of processes with a equivalence relation in a phase would exchange messages. We use the variable to change the phases that should be executed in the current iteration. Fig. 2 presents the iterative execution of group allreduce with dynamic grouping in WAGMA-SGD, and grouping is shown on the right side. We can see that although the group size is fixed, the groups are dynamically changing during the iterations. Within each group, the allreduce is conducted following phases of the butterfly topology. To maintain convergence with this communication scheme in data-parallel deep learning training, a standard synchronous allreduce across all the processes is conducted every iterations, bounding the staleness of the weights. In the following section, we present the algorithm in detail and further discuss this periodic synchronization.

Fig. 2: Communication scheme of WAGMA-SGD. Total 8 processes and the group size is 4. Every iterations, the algorithm synchronizes globally.

Iv Wait-Avoiding Group Model Averaging SGD

1:Input: is local batchsize for processes. is the group size of the processes. is the synchronization period.
2:for  to  do
3:       Each process samples elements from dataset
4:      
5:      
6:      
7:      
8:      if  then
9:            -
10:            if  then
11:                 
12:            else
13:                 
14:            end if
15:      else
16:            
17:      end if
18:end for
Algorithm 2 WAGMA-SGD

Based on the insight that larger groups converge faster, and on the novel implementation of wait-avoiding group collectives, we design the Wait-Avoiding Group Model Averaging (WAGMA) SGD algorithm. The algorithm can be classified as a model-averaging, bounded-staleness, decentralized SGD with a group size of and a global communication period of steps. As listed in Algorithm 2, WAGMA-SGD is similar to minibatch SGD, but makes a few important distinctions.

In lines 3–7, each process calculates the local gradients and then applies the local gradients to derive and apply the model update , as in distributed SGD. Subsequently, the wait-avoiding group model averaging is conducted (lines 8–17) using the aforementioned wait-avoiding communication scheme. From an algorithmic perspective, WAGMA-SGD does not rely on certain choice of group members for the local collectives. However, instead of randomly choosing groups of processes, we use the butterfly strategy (Algorithm 1) for topology-aware, efficient, deterministic communication.

In each iteration, faster processes will trigger the model averaging immediately without waiting (line 9, is used to control grouping), which may incur averaging the local models with some stale models from the slower processes. To both bound the staleness and mitigate divergence across local model replicas, we define a synchronization period , in which the models are averaged across all the processes using a global allreduce (line 16). Empirically, we set the synchronization period to 10 training iterations, which balances model accuracy with training throughput, as we will show in Section V.

An execution snapshot of WAGMA-SGD ( and ) is presented in Fig. 3. Suppose P1 is a straggler. When the group allreduce in iteration is triggered by any of the other three processes, P1 can only contribute the stale model parameters . In iteration , P1 and P0 are in the same group; therefore, and will be added together to derive . P0 will use the averaged model for the next iteration of training. P1 subsequently finishes the calculation for the local updated model in iteration (i.e., ), but finds out that the group allreduce in iteration is already finished. In this case, it will average the stale model with (line 13 in Algorithm 1), and the averaged model will be used for the next iteration of training. Meanwhile, the data in the send buffer of P1 is updated by . If the group allreduce in iteration is triggered by some faster process at this time, P1 will continue to passively contribute the stale model . When a standard allreduce is called at the synchronization point, it forces all the processes to contribute the model parameters after training for the same number of iterations. In Fig. 3, P1 catches up with the other processes in iteration ; thus, it will contribute the timely model to P3, as they are in the same group.

Fig. 3: Execution snapshot for WAGMA-SGD for =4 and =2.

Iv-a Proof of Convergence

Algorithm Modelling

For analysis purposes, we will model the algorithm’s execution as follows. We will proceed in steps, indexed by time . Each node maintains its own local model , and has a local partition of the data. In each step, a group of nodes of size is chosen to interact. Each node takes a local gradient step, and then nodes average their models. This averaging step might be inconsistent, as per the above semantics.

In the analysis, we will assume that the group of interacting nodes is chosen uniformly at random—in the long run, the resulting interaction graph will have the same properties as the butterfly interaction strategy used in the implementation. While our analysis considers each interaction sequentially, in practice interaction steps can occur in parallel.

Setup and Analytic Assumptions

We will assume a standard setting in which we are given a dataset of samples , and to each associate a differentiable loss function . Each node is given a random partition of the dataset , and we wish to solve an empirical risk minimization problem by finding

Let be the loss function corresponding to the dataset of the th node, and . To make the analysis tractable, we will make the following standard assumptions on the loss function.

Assumption 1.

We assume the following hold:

  1. (Lipschitz gradients) All functions have -Lipschitz gradient, for some constant .

  2. (Bounded Second Moment)

    There exists a constant such that for any node and .

  3. (Bounded Staleness) The staleness during the averaging step is upper bounded by a parameter . That is, for any node participating in the averaging step at time , averaging is performed with respect to model , where , and every gradient update is applied globally at most steps after it was generated.

Convergence result

We can now state the main convergence result. For readability, we state a simplified variant that highlights the relationship between parameter values, in particular the relation between the convergence time , processors, and the size of the interacting group . The full statement and its proof will be integrated in the paper.

Theorem 1.

Consider the setting described above, in which we wish to optimize a non-convex function . Let be the size of a communication group, and assume that the maximum staleness is constant. Fix a success parameter . For each time , we define to be the average of local models at time . Then there exists a setting of the learning rate such that, if the algorithm has taken

then there exists an iterate such that

where the expectation is taken w.r.t. the randomness in the sampling and interactions.

Discussion

At a high level, this claim shows that the algorithm will eventually reach a point where the model average has negligible gradient, i.e., is at a local minimum. While this does not guarantee convergence to a global minimum, it matches the best achievable guarantees for SGD in the non-convex setting [32]. The convergence proof follows the general decentralized asynchronous framework of Lian et al. [62]

, with differences due to the specific structure of the group communication pattern we employ, and the asynchronous nature of wait-avoiding collectives. Further, we note that the convergence guarantee can be extended to 1) apply to individual models instead of the model average; and 2) relax the bounded second moment assumption to a bound on the variance. Both these improvements come at the cost of additional technical complexity, so we will defer them in the full version of our convergence proof. The current statement of the theorem obscures the rate at which convergence occurs — for standard parameter settings, the convergence speedup (i.e., the rate at which we get to a point of negligible gradient) with respect to the number of nodes is

linear. This linear speedup is the best possible, and matches the rates for other decentralized algorithms, e.g. [61, 62]. We refer the interested reader to the full version of our convergence proof.

It is interesting to examine the impact of the group size on convergence: in particular, notice that the time to convergence decreases quadratically in . Specifically, assuming that group size is small, say, , and other parameters are constant, then the number of steps to reach a local optimum is . This matches the best known bounds in the decentralized model for pairwise interactions [61, 62]. However, can exceed the number of SGD steps taken during regular training even for moderate node counts, making the bound meaningless. For example, the number of SGD steps when training ResNet50 on ImageNet is around , making the bound meaningful only for . For larger group size, e.g., in the order of , our analysis decreases this step bound to , which asymptotically matches the convergence rate and step bound when model averaging is performed synchronously and globally (e.g., the bound of [61] for all-to-all communication), and is also practically more relevant.

V Experimental Evaluation

We conduct our experiments on the CSCS Piz Daint supercomputer. Each Cray XC50 compute node contains a 12-core Intel Xeon E5-2690 CPU with 64 GB RAM, and one NVIDIA Tesla P100 with 16 GB memory. The compute nodes are connected by Cray Aries interconnect in a Dragonfly topology. The communication library is Cray MPICH 7.7.2. We use one MPI process per node and utilize the GPU for acceleration in all following experiments. We evaluate three different deep learning problems, including image classification (ResNet-50 [38] on ImageNet [25]), machine translation (Transformer [98] on WMT17), and deep reinforcement learning (PPO [86, 103] for navigation in Habitat [84]). For throughput tests, we run the number of nodes until reaching a point where batch size is too large to converge [90].

V-a Baselines

We compare our WAGMA-SGD with the state-of-the-art data-parallel SGD variants, including Allreduce-SGD [89, 12], local SGD [64, 20], gossip-based SGD variants (D-PSGD [61], AD-PSGD [62], and SGP [5]), and eager-SGD [60]. Unless mentioned specifically, the synchronization period of local SGD is set to one, namely calling a standard allreduce to average the models in each training iteration, which essentially is a synchronous SGD. For SGP, we evaluate its performance with different number of communication neighbors [5]. For more detailed discussion about the baselines, please refer to Section II.

V-B Image Classification with Simulated Workload Imbalance

Residual Networks (ResNet) [38]

are pervasively used in computer vision tasks. To evaluate their performance, we train ResNet-50 on ImageNet (total 25,559,081 trainable parameters) with a local batch size of 128 images. Although the training workload is balanced due to the input size being fixed, performance variability is observed in practice when training on multi-tenant cloud systems 

[62, 60, 85]. To simulate the same degree of imbalance, we randomly select two processes at every training step to inject a certain amount of delay (320 ms), according to the performance variability on cloud machines [60]. For WAGMA-SGD, we set the synchronization period , and the group size . Both and are power-of-two in our experimental configuration.

Fig. 4: Throughput comparison between different parallel SGD algorithms for ResNet-50 on ImageNet with simulated load imbalance. Local batch size is 128.

Fig. 4 shows the training throughput as the number of GPU nodes increases from 4 to 256, and the top of the rectangle wrapping each cluster indicates the ideal throughput without communication overhead. Compared with local SGD, Allreduce-SGD (implemented in Deep500 [12]), D-PSGD, SGP (two communication neighbors), and eager-SGD when training on 64 GPU nodes, WAGMA-SGD achieves 1.25x, 1.26x, 1.23x, 1.25x, and 1.13x speedup, respectively. The speedup becomes larger as the number of GPU nodes increases to 256: WAGMA-SGD achieves up to 1.37x speedup. The only algorithm with higher throughput than WAGMA-SGD is AD-PSGD, in which the asynchronous communication is completely overlapped with the computation. These results show that WAGMA-SGD can better handle the unbalanced workload than the synchronous SGD algorithms (i.e., local SGD, Allreduce-SGD, D-PSGD, and SGP), as well as the bounded-staleness eager-SGD variant. In the latter case, while staleness is bounded, the algorithm still conducts a global collective communication for gradient averaging in each training iteration. In contrast, WAGMA-SGD keeps the collectives within each group, and thus has a better parallel scalability.

Fig. 5:

Top-1 validation accuracy of ResNet-50 on ImageNet training for 90 epochs using 64 GPU nodes. Each point is at the boundary of every 10 epochs.

Fig. 5 presents the Top-1 validation accuracy (in accordance with MLPerf [68]) when training for 90 epochs on 64 nodes with a total batch size of 8,192. We can see that the accuracy of WAGMA-SGD (75.0%) is very close to the standard Allreduce-SGD (75.8%) and local SGD (75.3%), but WAGMA-SGD significantly reduces the training time. Gossip-based SGD algorithms, such as D-PSGD and the higher-throughput AD-PSGD, attain much lower accuracy than the other variants. This can be explained by the fact that the algorithms have not fully converged, requiring more steps to be taken to achieve comparable accuracy [74]. For SGP, we set and tune the number of communication neighbors to achieve the highest generalization using a directed exponential graph [5], which causes it to achieve higher accuracy than D-PSGD and AD-PSGD, yet still lower than WAGMA-SGD. Note that the default setting for the number of communication neighbors in SGP is one, whereas we set it to two for better generalization performance. Overall, WAGMA-SGD achieves the highest accuracy-vs-time among all parallel SGD variants.

By setting the group size , WAGMA-SGD has a faster model update propagation speed (globally propagate only using iterations) than the gossip-based algorithms (globally propagate using at least iterations), which makes WAGMA-SGD achieve higher accuracy. This is consistent with our analysis in Section IV-A.

To further analyze the convergence properties of WAGMA-SGD, we conduct additional experiments. ❶ In the first experiment, we remove the wait-avoiding group collectives in WAGMA-SGD and only keep standard allreduce operations on the synchronization points, which is essentially equivalent to local SGD with a synchronization period . This causes the top-1 validation accuracy to sharply drop to 66.9%. ❷ In a second experiment, we execute group model averaging without using the dynamic grouping strategy (i.e., the groups are fixed). In this case, the top-1 validation accuracy drops to 73.5%. ❸ We also experiment with increasing the group size to 64 (i.e., a global collective). While accuracy does not increase, the throughput drops by factor of 1.07x. ❹ Lastly, we decrease the group size to 4 and observe that the top-1 validation accuracy drops to 72.8%.

The results from experiments ❶ and ❷ indicate that the combination of group allreduce operations and the dynamic grouping strategy is essential to achieve good generalization performance. The results from experiments ❸ and ❹ demonstrate that empirically exhibits the best performance among different group size settings.

V-C Machine Translation

Transformers are sequence-to-sequence transducers that can be used to translate a sequence of words from one language to another. We use the standard-sized Transformer network 

[98], which has 61,362,176 trainable parameters, to train English to German translation WMT17 dataset. While training the model, the computation overhead changes with the length of the input and output sentences. The samples in the training dataset typically consist of sentences in various lengths and thus the training workload is unbalanced. As shown in Fig. 6, even when using a bucketing strategy to group sentences with similar lengths, there is a high variance in workload size between samples. Specifically, in our experiment each local batch contains equal number of sentences sampled from a randomly selected bucket, where the maximum local batch size is set to 8,192 tokens. For WAGMA-SGD, we set the synchronization period and the group size .

Fig. 6: Runtime distribution of different sentences on a P100 GPU for a Transformer network on WMT17.

Fig. 7: Throughput comparison between different parallel SGD algorithms for Transformer on WMT17.

Fig. 8: Uncased BLEU score for Transformer on WMT17 training for 10 epochs using 16 GPU nodes. Each point is at the boundary of one epoch.

Fig. 7 presents the training throughput as the number of GPU nodes increases from 4 to 64, where the top of the rectangle indicates the ideal throughput without communication overhead. On 16 GPU nodes, WAGMA-SGD achieves the highest throughput, compared with local SGD, Allreduce-SGD (implemented in Horovod [89]

), D-PSGD, AD-PSGD, and SGP (one communication neighbor). When the number of GPU nodes increases to 64, as with image classification WAGMA-SGD exhibits a lower throughput than AD-PSGD but higher than all the other variants. Observe that on 64 nodes, all the algorithms perform far worse than the ideal throughput. We believe that this effect stems from the balance of the number of parameters (occupying 245 MB alone) vs. the operational intensity to compute backpropagation. Since transformer networks mostly consist of tensor contractions implemented as batched matrix products, which utilize GPUs well, communication overhead dominates and not even AD-PSGD manages to overlap communication with computation.

As for accuracy, Fig. 8 presents the BiLingual Evaluation Understudy (BLEU) score (higher is better) on the test dataset after training for 10 epochs on 16 nodes. As the total batch size is a relatively large number of tokens (131,072), it incurs similar quality degradation as in other deep learning problems [90]. Still, among all SGD variants, WAGMA-SGD achieves the highest score using the shortest training time. Gossip-based SGD variants, including D-PSGD, AD-PSGD, and SGP (1n, i.e., one communication neighbor), have lower score than the others, likely because of the slower model update propagation. To verify this claim, we increase the number of communication neighbors to two in SGP (2n), which improves the score to 24.5 (equivalent to local SGD). However, this accuracy increase comes at the cost of significantly reduced training speed compared with SGP (1n).

We conduct additional experiments for WAGMA-SGD, similarly to Section V-B: (1) Without using the dynamic grouping strategy (i.e., fixed groups), the score drops to 23.8; (2) By increasing the group size to 16 (i.e., global collective), accuracy does not improve and training throughput drops by a factor of 1.28x; and (3) By decreasing the group size to 2, the score drops to 23.3. These results reaffirm the conclusions from image classification.

V-D Deep Reinforcement Learning

Due to the inherent characteristics of the problem, reinforcement learning poses a more challenging training process over supervised and semi-supervised learning. This also applies to the heterogeneity in workloads during training — since the problems in question involve interacting with an environment in episodes (where failure terminates an episode early), a variety of episode lengths may occur within a single minibatch, in a way that cannot be anticipated or categorized in buckets.

We use the popular Proximal Policy Optimization (PPO) policy gradient optimizer [86] to train a model for robot navigation on a meta-dataset called Habitat [84], which is composed of multiple heterogeneous environments. We first confirm previous claims [103] and our own in Fig. 9, where we collate the runtime distribution of 5,000 training iterations. The runtime is very widely distributed: from 1.7 seconds to 43.5, with a median below 2 seconds, which makes it an excellent use case for the load-rebalancing properties of WAGMA-SGD.

Fig. 9: Runtime distribution of experience collecting on a P100 GPU in heterogeneous environments.

Fig. 10: Throughput comparison between different parallel SGD algorithms for DDPPO on Habitat.

To evaluate the performance, we train a standard ResNet-LSTM model for navigation. In particular, the network structure is composed of a ResNet-18 visual encoder, connected to a stack of two Long Short-Term Memory (LSTM) 

[40] recurrent units functioning as the policy, containing 8,476,421 trainable parameters. The measured heterogeneous environments in Habitat, Gibson [104] and Matterport3D [19]

, consist of interactive RGB-D datasets. We set the experience steps to 128 and use the two vectorized (namely, optimized) environments, which means each GPU node needs to collect 256 experience steps for each training iteration. We set the WAGMA-SGD synchronization period to

.

Fig. 10 presents the training throughput as the number of GPU nodes increases from 16 to 1024, where the top of the rectangle indicates the ideal throughput without communication overhead. Compared with local SGD, D-PSGD, and SGP (four communication neighbors) on 1,024 GPU nodes, WAGMA-SGD achieves 2.33x, 1.88x, and 2.10x speedup, respectively. The violin plot shows the throughput distribution. WAGMA-SGD only has lower throughput than AD-PSGD, since AD-PSGD is fully asynchronous. These results show that WAGMA-SGD excels in handling highly unbalanced workloads, achieving good scaling efficiency.

Complementary to performance, we study the Success weighted by Path Length (SPL) score (higher is better) after training the model for 10 hours on 64 GPUs. All models are tested four separate times to account for variability, and the average scores together with the standard deviation (shaded regions) are plotted in Fig. 

11. As the figure shows, despite the scalability of AD-PSGD, it only achieves 0.051 SPL on average, and seems to converge, deeming it unusable for RL problems. On the other hand, WAGMA-SGD achieves the highest score over time, even over local SGD. A possible reason for this might be asynchronous methods tend to overcome local convergence issues in deep reinforcement learning [73]. This is also seen in SGP, which scores higher than local SGD, but not as well as WAGMA-SGD, whose quorum size is larger.

Beyond our experiments, the current state-of-the-art SPL score is 0.922 [103]. This score is achieved after training on 2.5 billion experience steps. WAGMA-SGD consumes total 2.6 million experience steps after training for 10 hours using 64 GPUs, and achieves on average 83.1% (up to 91.2%) of the SotA score, and the results seem to keep increasing. This indicates that we are able to achieve almost equivalent generalization in 3 orders of magnitude fewer iterations.

Fig. 11: SPL score comparison between different parallel SGD algorithms for DDPPO on Habitat. Training on 64 GPU nodes for 10 hours. Each point is at the boundary of every 50 updates.

Vi Collectives in Context

Collective operations have a core role in running applications efficiently at scale. As such, their optimization has led to several implementation and algorithmic variants.

Blocking collectives [72] constitute the basic class of operations. In this case, the collective call is allowed to return only when the calling process has completed all the actions needed to its participation in the operation. A first optimization to blocking collectives is to make them non-blocking [42], enabling processes to return immediately and overlap other activities with the ongoing collective.

Some collectives require all the processes to invoke it in order to complete, e.g., a reduction cannot be computed before knowing all the values to reduce. Hence, their completion time can be influenced by any skewing (imbalance) among the processes.

Solo collectives [26] remove this synchronization overhead by making the collectives externally-triggerable: once a process joins the collective, it sends an activation message to all the others, making them to start the collective independently from their state. An issue of solo collectives is that they make triggering the collective possible, even if there is only one process joining it. Majority collectives [60] extend the solo idea by requiring that at least

processes join the collective before triggering it. While these collectives are not guaranteed to be equivalent to their blocking or non-blocking counterparts, they are suited for machine learning tasks, due to the robustness of stochastic optimization to staleness.

Both solo and majority collectives aim to minimize the synchronization overhead. However, once activated, the collective is fully performed, making the application pay the full operation cost plus the activation overhead. Wait-avoiding group collectives (this work) utilize the approach of solo collectives to achieve asynchrony, and further reduce the overall operation cost by dynamically selecting subgroups of processes, each of which executing the collective independently from the others.

Vii Conclusion

We show, both theoretically and in practice, that stochastic optimization via group model averaging — averaging the learned weights across subgroups of nodes — functions well in large clusters. We prove that the algorithm converges under the standard conditions of SGD, and through a careful implementation of wait-avoiding collectives, we use the topology of the network to attain the best scaling results without losing accuracy. For the same number of steps, WAGMA-SGD achieves equivalent (or higher) generalization scores as synchronous SGD, while still achieving up to 2.1 speedup (on RL) over the previous state-of-the-art, gossip-based SGD. Similar results are observed in other models from various sub-fields, empirically proving that this approach is the first to successfully tackle deep learning in extreme scales, dispensing with step-wise global synchronization and bringing SGD to the regime of supercomputers.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §II-A.
  • [2] A. F. Aji and K. Heafield (2017) Sparse communication for distributed gradient descent. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Cited by: §II-A.
  • [3] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic (2017) QSGD: communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §II-A.
  • [4] D. Amodei and D. Hernandez (2018-05) AI and compute. Note: https://openai.com/blog/ai-and-compute/ Cited by: §I.
  • [5] M. Assran, N. Loizou, N. Ballas, and M. Rabbat (2019) Stochastic gradient push for distributed deep learning. In Proceedings of the Thirty-sixth International Conference on Machine Learning (ICML), Cited by: §I, 4th item, TABLE I, §II, §II, §III-B, §V-A, §V-B.
  • [6] A. A. Awan, C. Chu, H. Subramoni, and D. K. Panda (2018) Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL?. In EuroMPI, Cited by: §II-A.
  • [7] V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C. Ho, S. Kipnis, and M. Snir (1995) CCL: a portable and tunable collective communication library for scalable parallel computers. IEEE Transactions on Parallel and Distributed Systems 6 (2). Cited by: §II-A.
  • [8] M. Barnett, R. Littlefield, D. G. Payne, and R. Vandegeijn (1995) Global combine algorithms for 2-D meshes with wormhole routing. Journal of Parallel and Distributed Computing 24 (2). Cited by: §II-A.
  • [9] M. Barnett, L. Shuler, R. van De Geijn, S. Gupta, D. G. Payne, and J. Watts (1994) Interprocessor collective communication library (InterCom). In Proceedings of IEEE Scalable High Performance Computing Conference, Cited by: §II-A.
  • [10] D. Basu, D. Data, C. Karakus, and S. Diggavi (2019) Qsparse-local-SGD: distributed SGD with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: TABLE I.
  • [11] M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda (2017) Scalable reduction collectives with data partitioning-based multi-leader design. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. Cited by: §II-A.
  • [12] T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler (2019-05) A modular benchmarking infrastructure for high-performance and reproducible deep learning. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vol. , pp. 66–77. External Links: Document, ISSN 1530-2075 Cited by: §III-B, §V-A, §V-B.
  • [13] T. Ben-Nun and T. Hoefler (2019-08) Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. ACM Comput. Surv. 52 (4). External Links: ISSN 0360-0300, Link, Document Cited by: §I, §I, §II.
  • [14] J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar (2018) signSGD: compressed optimisation for non-convex problems. In Proceedings of the 35th International Conference on Machine Learning (ICML), Cited by: §II-A.
  • [15] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba (2016) End to end learning for self-driving cars. External Links: 1604.07316 Cited by: §I.
  • [16] L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. SIAM Review 60 (2). Cited by: §II.
  • [17] E. Chan, M. Heimlich, A. Purkayastha, and R. Van De Geijn (2007) Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19 (13). Cited by: §II-A.
  • [18] E. Chan, R. van de Geijn, W. Gropp, and R. Thakur (2006) Collective communication on architectures that support simultaneous communication over multiple links. In PPoPP, Cited by: §II-A.
  • [19] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §V-D.
  • [20] K. Chen and Q. Huo (2016) Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In 2016 ieee international conference on acoustics, speech and signal processing (icassp), pp. 5880–5884. Cited by: §I, 2nd item, TABLE I, §V-A.
  • [21] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman (2014-10) Project adam: building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO, pp. 571–582. External Links: ISBN 978-1-931971-16-4, Link Cited by: §II-A.
  • [22] J. Daily, A. Vishnu, C. Siegel, T. Warfel, and V. Amatya (2018) GossipGraD: scalable deep learning using gossip communication based asynchronous gradient descent. CoRR abs/1803.05880. External Links: Link, 1803.05880 Cited by: §I, TABLE I.
  • [23] D. De Sensi, S. Di Girolamo, and T. Hoefler (2019) Mitigating network noise on dragonfly networks through application-aware routing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–32. Cited by: §III.
  • [24] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng (2012) Large scale distributed deep networks. In Advances in neural information processing systems (NeurIPS), pp. 1223–1231. Cited by: §II-A, TABLE I, §II.
  • [25] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: A large-scale hierarchical image database. In

    Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 248–255. Cited by: §V.
  • [26] S. Di Girolamo, P. Jolivet, K. D. Underwood, and T. Hoefler (2015) Exploiting offload enabled network interfaces. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 26–33. Cited by: §III-A2, §III-A, §VI.
  • [27] N. Dryden, N. Maruyama, T. Benson, T. Moon, M. Snir, and B. Van Essen (2019) Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Cited by: §II-A.
  • [28] N. Dryden, N. Maruyama, T. Moon, T. Benson, M. Snir, and B. Van Essen (2019) Channel and filter parallelism for large-scale CNN training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: §II-A.
  • [29] N. Dryden, T. Moon, S. A. Jacobs, and B. Van Essen (2016) Communication quantization for data-parallel training of deep neural networks. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), Cited by: §II-A.
  • [30] S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar (2018) Slow and stale gradients can win the race: error-runtime trade-offs in distributed sgd. In Proceedings of the International Conference on Artifical Intelligence and Statistics (AISTATS), Cited by: TABLE I.
  • [31] Facebook (2018) Gloo. External Links: Link Cited by: §II-A, TABLE I.
  • [32] S. Ghadimi and G. Lan (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §IV-A.
  • [33] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc (2018) Integrated model, batch, and domain parallelism in training neural networks. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures (SPAA), Cited by: §II-A.
  • [34] A. Gibiansky (2017) Bringing hpc techniques to deep learning. URL http://research. baidu. com/bringing-hpc-techniques-deep-learning. Cited by: §II-A, TABLE I, §III.
  • [35] D. Grishchenko, F. Iutzeler, J. Malick, and M. Amini (2018) Asynchronous distributed learning with sparse communications and identification. arXiv preprint arXiv:1812.03871. Cited by: TABLE I.
  • [36] S. Gupta, W. Zhang, and F. Wang (2016) Model accuracy and runtime tradeoff in distributed deep learning: a systematic study. In 2016 IEEE 16th International Conference on Data Mining (ICDM), Cited by: TABLE I.
  • [37] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe (2019) Local SGD with periodic averaging: tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: 2nd item, TABLE I.
  • [38] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §V-B, §V.
  • [39] Q. Ho, J. Cipar, H. Cui, J. K. Kim, S. Lee, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing (2013) More effective distributed ml via a stale synchronous parallel parameter server. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS’13, USA, pp. 1223–1231. External Links: Link Cited by: §I, TABLE I.
  • [40] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §V-D.
  • [41] T. Hoefler, S. D. Girolamo, K. Taranov, R. E. Grant, and R. Brightwell (2017-Nov.) sPIN: High-performance streaming Processing in the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC17), Cited by: §III-A2.
  • [42] T. Hoefler, A. Lumsdaine, and W. Rehm (2007) Implementation and performance analysis of non-blocking collective operations for mpi. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pp. 52. Cited by: §III, §VI.
  • [43] T. Hoefler, T. Schneider, and A. Lumsdaine (2010) Characterizing the influence of system noise on large-scale applications by simulation. In SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. Cited by: §III.
  • [44] T. Hoefler, J. Squyres, W. Rehm, and A. Lumsdaine (2006-Dec.) A Case for Non-Blocking Collective Operations. In Frontiers of High Performance Computing and Networking - ISPA’06 Workshops, Vol. 4331/2006, pp. 155–164. External Links: ISBN 978-3-540-49860-5 Cited by: §II-A.
  • [45] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu (2017) Gaia: geo-distributed machine learning approaching lan speeds. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, NSDI’17, Berkeley, CA, USA, pp. 629–647. External Links: ISBN 978-1-931971-37-9, Link Cited by: TABLE I.
  • [46] F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer (2016) FireCaffe: near-linear acceleration of deep neural network training on compute clusters. In CVPR, Cited by: §II-A.
  • [47] A. Iosup, N. Yigitbasi, and D. Epema (2011) On the performance variability of production cloud services. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 104–113. Cited by: §II-A.
  • [48] N. Jain and Y. Sabharwal (2010) Optimal bucket algorithms for large MPI collectives on torus interconnects. In ICS, Cited by: §II-A.
  • [49] A. Jayarajan, J. Wei, G. Gibson, A. Fedorova, and G. Pekhimenko (2019) Priority-based parameter propagation for distributed DNN training. In Proceedings of the 2nd SysML Conference, Cited by: TABLE I.
  • [50] Z. Jia, M. Zaharia, and A. Aiken (2019) Beyond data and model parallelism for deep neural networks. In Proceedings of the 2nd Conference on Systems and Machine Learning (SysML), Cited by: §II-A.
  • [51] Z. Jiang, A. Balu, C. Hegde, and S. Sarkar (2017) Collaborative deep learning in fixed topology networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: TABLE I.
  • [52] P. H. Jin, Q. Yuan, F. N. Iandola, and K. Keutzer (2016) How to scale distributed deep learning?. In Workshop on Machine Learning Systems at NeurIPS 2016, Cited by: TABLE I.
  • [53] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020) Scaling laws for neural language models. External Links: 2001.08361 Cited by: §I.
  • [54] A. Koloskova, S. U. Stich, and M. Jaggi (2019) Decentralized stochastic optimization and gossip algorithms with compressed communication. arXiv preprint arXiv:1902.00340. Cited by: TABLE I.
  • [55] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. In NeurIPS Workshop on Private Multi-Party Machine Learning, Cited by: TABLE I.
  • [56] A. Krizhevsky (2014)

    One weird trick for parallelizing convolutional neural networks

    .
    arXiv preprint arXiv:1404.5997. Cited by: §II-A.
  • [57] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips, A. Mahesh, M. Matheson, J. Deslippe, M. Fatica, et al. (2018) Exascale deep learning for climate analytics. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: TABLE I.
  • [58] T. Kurth, J. Zhang, N. Satish, E. Racah, I. Mitliagkas, M. M. A. Patwary, T. Malas, N. Sundaram, W. Bhimji, M. Smorkalov, et al. (2017) Deep learning at 15PF: supervised and semi-supervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: TABLE I.
  • [59] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su (2014) Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, Berkeley, CA, USA, pp. 583–598. External Links: ISBN 978-1-931971-16-4, Link Cited by: TABLE I.
  • [60] S. Li, T. Ben-Nun, S. D. Girolamo, D. Alistarh, and T. Hoefler (2020) Taming unbalanced training workloads in deep learning with partial collective operations. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Cited by: §I, 5th item, §II-A, TABLE I, §III-A, §V-A, §V-B, §VI.
  • [61] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu (2017) Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 5336–5346. External Links: ISBN 978-1-5108-6096-4, Link Cited by: §I, 3rd item, TABLE I, §II, §II, §III-B, §IV-A, §IV-A, §V-A.
  • [62] X. Lian, W. Zhang, C. Zhang, and J. Liu (2018-10–15 Jul) Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3043–3052. External Links: Link Cited by: §I, 6th item, §II-A, TABLE I, §II, §II, §III-B, §IV-A, §IV-A, §V-A, §V-B.
  • [63] H. Lim, D. G. Andersen, and M. Kaminsky (2018) 3LC: lightweight and effective traffic compression for distributed machine learning. arXiv preprint arXiv:1802.07389. Cited by: §II-A.
  • [64] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi (2018) Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217. Cited by: 2nd item, TABLE I, §V-A.
  • [65] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally (2018) Deep gradient compression: reducing the communication bandwidth for distributed training. In Proceedings of the Sixth International Conference on Learning Representations (ICLR), Cited by: §II-A.
  • [66] H. Liu, K. Simonyan, and Y. Yang (2018) DARTS: differentiable architecture search. External Links: 1806.09055 Cited by: §I.
  • [67] R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. External Links: 1808.07233 Cited by: §I.
  • [68] P. Mattson, V. J. Reddi, C. Cheng, C. Coleman, G. Diamos, D. Kanter, P. Micikevicius, D. Patterson, G. Schmuelling, H. Tang, et al. (2020) MLPerf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40 (2), pp. 8–16. Cited by: §V-B.
  • [69] S. McCandlish, J. Kaplan, D. Amodei, and O. D. Team (2018) An empirical model of large-batch training. External Links: 1812.06162 Cited by: §I.
  • [70] R. Mcdonald, M. Mohri, N. Silberman, D. Walker, and G. S. Mann (2009) Efficient large-scale distributed training of conditional maximum entropy models. In Advances in neural information processing systems (NeurIPS), Cited by: TABLE I.
  • [71] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. (2017) Communication-efficient learning of deep networks from decentralized data. In

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)

    ,
    Cited by: TABLE I.
  • [72] Message Passing Interface Forum (2015) MPI: A Message-Passing Interface Standard Version 3.1. Cited by: §I, §III, §VI.
  • [73] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §I, §V-D.
  • [74] G. Nadiradze, A. Sabour, D. Alistarh, A. Sharma, I. Markov, and V. Aksenov (2019) SwarmSGD: scalable decentralized SGD with local updates. arXiv preprint arXiv:1910.12308. Cited by: §I, TABLE I, §II, §II, §V-B.
  • [75] NVIDIA (2020) NVIDIA collective communications library. External Links: Link Cited by: §II-A, TABLE I.
  • [76] P. Patarasuk and X. Yuan (2009-02) Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. J. Parallel Distrib. Comput. 69 (2), pp. 117–124. External Links: ISSN 0743-7315, Link, Document Cited by: §III.
  • [77] R. Rabenseifner (2004) Optimization of collective reduction operations. In International Conference on Computational Science, pp. 1–9. Cited by: §II-A, §III.
  • [78] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2018) Language Models are Unsupervised Multitask Learners. Note: Unpublished manuscript External Links: Link Cited by: §I.
  • [79] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In The Thirty-Third AAAI Conference on Artificial Intelligence, (AAAI 2019), pp. 4780–4789. Cited by: §I.
  • [80] B. Recht, C. Re, S. Wright, and F. Niu (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, pp. 693–701. Cited by: §I, TABLE I.
  • [81] A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani (2019) Robust and communication-efficient collaborative learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: 2nd item, TABLE I.
  • [82] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler (2019) SparCML: High-performance sparse communication for machine learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: §II-A.
  • [83] Renggli, Cedric and Ashkboos, Saleh and Aghagolzadeh, Mehdi and Alistarh, Dan and Hoefler, Torsten (2019) SparCML: High-Performance Sparse Communication for Machine Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA. External Links: ISBN 9781450362290, Link, Document Cited by: §I, §I, §III.
  • [84] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al. (2019) Habitat: a platform for embodied ai research. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9339–9347. Cited by: §V-D, §V.
  • [85] J. Schad, J. Dittrich, and J. Quiané-Ruiz (2010) Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proceedings of the VLDB Endowment 3 (1-2), pp. 460–471. Cited by: §V-B.
  • [86] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §V-D, §V.
  • [87] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §II-A.
  • [88] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D. T. Jones, D. Silver, K. Kavukcuoglu, and D. Hassabis (2020) Improved protein structure prediction using potentials from deep learning. Nature 577 (7792), pp. 706–710. External Links: ISSN 0028-0836, Document Cited by: §I.
  • [89] A. Sergeev and M. Del Balso (2018) Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799. Cited by: §II-A, TABLE I, §III-B, §V-A, §V-C.
  • [90] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl (2018) Measuring the effects of data parallelism on neural network training. External Links: 1811.03600 Cited by: §V-C, §V.
  • [91] M. Shroff and R. A. Van De Geijn (2000) CollMark: MPI collective communication benchmark. In International Conference on Supercomputing, Cited by: §II-A.
  • [92] S. U. Stich (2019) Local SGD converges fast and communicates little. In Proceedings of the Seventh International Conference on Learning Representations (ICLR), Cited by: §I, 2nd item, TABLE I.
  • [93] N. Strom (2015) Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §II-A.
  • [94] Z. Tang, S. Shi, X. Chu, W. Wang, and B. Li (2020) Communication-efficient distributed deep learning: a comprehensive survey. External Links: 2003.06307 Cited by: §I, §II.
  • [95] Z. Tang, S. Shi, and X. Chu (2020) Communication-efficient decentralized learning with sparsification and adaptive peer selection. arXiv preprint arXiv:2002.09692. Cited by: TABLE I.
  • [96] R. Thakur, R. Rabenseifner, and W. Gropp (2005) Optimization of collective communication operations in mpich. The International Journal of High Performance Computing Applications 19 (1), pp. 49–66. Cited by: §II-A.
  • [97] B. Van Essen, H. Kim, R. Pearce, K. Boakye, and B. Chen (2015) LBANN: livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments (MLHPC), Cited by: §II-A.
  • [98] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §V-C, §V.
  • [99] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. Czarnecki, A. Dudzik, A. Huang, P. Georgiev, R. Powell, T. Ewalds, D. Horgan, M. Kroiss, I. Danihelka, J. Agapiou, J. Oh, V. Dalibard, D. Choi, L. Sifre, Y. Sulsky, S. Vezhnevets, J. Molloy, T. Cai, D. Budden, T. Paine, C. Gulcehre, Z. Wang, T. Pfaff, T. Pohlen, D. Yogatama, J. Cohen, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, C. Apps, K. Kavukcuoglu, D. Hassabis, and D. Silver (2019) AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. Note: https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ Cited by: §I.
  • [100] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. External Links: ISSN 1476-4687, Document, Link Cited by: §I.
  • [101] J. Wang and G. Joshi (2019) Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD. In Proceedings of the Second SysML Conference, Cited by: 2nd item, TABLE I.
  • [102] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li (2017) TernGrad: ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems (NeurIPS), Cited by: §II-A.
  • [103] E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra (2019) Decentralized distributed ppo: solving pointgoal navigation. arXiv preprint arXiv:1911.00357. Cited by: §I, §II-A, §V-D, §V-D, §V.
  • [104] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, Cited by: §V-D.
  • [105] P. Xie, J. K. Kim, Y. Zhou, Q. Ho, A. Kumar, Y. Yu, and E. Xing (2016) Lighter-communication distributed machine learning via sufficient factor broadcasting. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, Arlington, Virginia, United States, pp. 795–804. External Links: ISBN 978-0-9966431-1-5, Link Cited by: TABLE I.
  • [106] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng (2018) Image classification at supercomputer scale. In NeurIPS Systems for ML Workshop, Cited by: §II-A.
  • [107] S. Zhang, A. E. Choromanska, and Y. LeCun (2015) Deep learning with elastic averaging SGD. In Advances in neural information processing systems (NeurIPS), pp. 685–693. Cited by: §I, TABLE I.
  • [108] W. Zhang, S. Gupta, X. Lian, and J. Liu (2016) Staleness-aware async-SGD for distributed deep learning. In 25th International Joint Conference on Artificial Intelligence (IJCAI), Cited by: TABLE I.
  • [109] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola (2010) Parallelized stochastic gradient descent. In Advances in neural information processing systems (NeurIPS), Cited by: TABLE I.