I Introduction
The introduction of deep learning is one of the most important advancements in science over the past two decades, powering industries from autonomous driving [15] to drug discovery [88]
. With the rise of deep neural networks, their training evolved into a computationallyintensive task that consumes as many resources as modern complex highperformance computing problems
[4]. As a result, an abundance of research has been conducted into its scaling and distribution [13].The leading contenders for largest workloads in deep learning are Neural Language Models [53, 78], Deep Reinforcement Learning (RL) [100, 69] and Neural Architecture Search [67, 66]. In these regimes, computation time is measured in thousands of “GPU days”, with some utilizing hundreds of accelerators (GPUs, TPUs) for several weeks [99, 83, 79].
Distributed training is largely supported by data parallelism, where sample evaluation is partitioned across processors. In this mode of parallelism, all participants must exchange their gradients or model, resulting in an Allreduce operation across a cluster [72]. In practice, the exchange communication dominates the overall runtime [83], especially in largeminibatch SGD. To exacerbate the problem, certain datasets and environments are inherently imbalanced, e.g., with different sentence/video lengths [60] or heterogeneous environments in RL [103].
In order to mitigate the wait time for gradient/weight exchange, existing approaches attempt to relax model consistency between processors [13, 94]. Examples include synchronous gossipbased SGD [61, 5], asynchronous SGD [80, 73, 62, 22], and asynchronous SGD with bounded staleness [39, 107, 20, 92]. Gossipbased SGD replaces the global allreduce by communicating with randomly selected neighbors. Asynchronous SGD breaks the global synchronization to mitigate the effect of stragglers (slow processes). However, most of these approaches adversely impact convergence, necessitating an increase in the number of iterations [5, 74], sometimes to the point where synchronous waits are preferable.
In this paper, we solve this problem by introducing WaitAvoiding Group Model Averaging (WAGMA) SGD, a novel optimizer that combines group collective communication with bounded staleness, in order to ensure competitive performance with decentralized and asynchronous methods, while retaining the convergence rate of synchronous modelaveraging SGD. WAGMASGD locally communicates model updates across subgroups of processors, mitigating the need for global communication at every training iteration. Specifically, we propose to use a group allreduce operation for model averaging, in which the fastest process will trigger exchanges within all subgroups. Grouping is performed dynamically to facilitate model update propagation, and as a result not only speeds up communication, but also mitigates the effect of unbalanced workloads, all without harming convergence in practice.
We theoretically prove the convergence of WAGMASGD, showing that, for certain parameter values, its convergence rate is comparable to synchronous SGD with model averaging. Subsequently, we test the algorithm on a supercomputer equipped with GPUs for three different categories of deep learning: supervised image classification on the ImageNet dataset; semisupervised language modeling on the WMT17 translation dataset; and deep reinforcement learning on the Habitat indoor navigation dataset. We show that both theoretically and empirically, WAGMASGD is favorable over other asynchronous algorithms and the baselines, which makes it an excellent approach for scaling up distributed deep learning.
Our main contributions are:

We propose WAGMASGD and realize it based on a waitavoiding group allreduce operation.

We theoretically analyze the convergence of WAGMASGD.

Compared with stateoftheart decentralized SGD, WAGMASGD improves the training throughput by up to 2.1 on 1,024 GPUs, and achieves the fastest timetoconvergence for all three evaluated tasks.
Ii Background and Related Work
Deep neural networks are primarily trained with minibatch stochastic gradient descent [16]. Let be the batch size, the neural network weights at step , a set of samples of size , and
a loss function. We compute the loss for each sample as
and then a stochastic gradient asSGD then iterates steps such that . In more general terms, firstorder stochastic gradient update rules can take different forms (e.g., by adding a momentum term), which is represented as . In distributed environments with processors, denotes the local batch size per processor. We refer to BenNun & Hoefler [13] for a general overview of distributed deep learning.
Thanks to the robustness of stochastic optimization, in distributed environments one can relax weight updates by varying several axes, trading off communication overhead for convergence. Dataparallel distributed SGD algorithms can be broadly identified by answering the following five questions:
Q1. What are we averaging?
There are two typical approaches for aggregating distributed updates: gradient and model averaging. When performing gradient averaging, we simply compute as the average over the global batch size. With standard model averaging, the SGD update is applied locally at the node, and then the resulting model is averaged over all processors.
Complementary to these approaches is the degree of quantization or sparsity in the exchanged updates. As these concepts are out of the scope of this paper, we refer to Tang et al. [94] for a comprehensive survey.
Q2. Who is coordinating the averaging?
Earlier implementations of distributed SGD for deep learning [24] use a centralized coordination architecture, where a parameter server or other coordinator maintains a master copy of the model that workers use. As this approach does not scale to large numbers of processors, a decentralized global clock can be synchronized across workers, where each worker maintains a local replica of the model and communicates updates to other workers directly.
To mitigate the overheads of global communication and synchronization, several decentralized instances of SGD have been proposed, e.g., [61, 62, 5, 74], where each worker maintains a local model but communicates updates in separate schedules, rather than synchronizing globally.
Q3. How old (stale) can averaged components be?
In a synchronous system, model or gradient averaging occurs when all processes are on the same training iteration . This does not guarantee that every worker uses the same parameters (i.e., consistent model), however, standard parameter server or globallycoordinated methods ensure all workers have a consistent model. In an asynchronous system, averaging can occur between workers at any point. We thus define the staleness of models/gradients by , indicating how many iterations have passed since the produced value’s model was updated. A bounded staleness system mitigates convergence issues with asynchronous systems by ensuring that the difference in the number of training iterations between the slowest and fastest processor is bounded, using as a proxy.
Q4. How often are we globally averaging?
While bounded and unbounded staleness SGD variants do not adhere to rigid communication schedules, some algorithms may periodically synchronize all processors’ model replicas. This ensures not only the staleness is bounded by but also the consistency of the model is retained throughout training, mitigating its divergence across processors. In other algorithms, this global consensus is achieved posttraining, by choosing the model average or the model with best generalization scores. Note that under this nomenclature, synchronous variants’ global average frequency is one step.
Q5. How many learners are averaging at every step?
In the steps between the aforementioned global model averaging period, decentralized SGD variants perform local averages with a certain group (or quorum) size , leveraging the fact that several averaging steps can be performed in parallel.
Coordination  Staleness  Gradient Averaging  Model Averaging 

Centralized  None  Parameter server [59], P3 [49]  — 
Unbounded  Hogwild! [80], Downpour SGD [24], AASGD [35]  SAPSPSGD [95]  
Bounded  SSP [39], Rudra [36], Softsync SGD [108], Gaia [45], async SGD [30], QsparselocalSGD [10], Hybrid sync/async [58]  EASGD [107], Federated learning [71, 55]  
Decentralized,  None  AllreduceSGD [34, 89, 75, 31]  BMUF [20] 
Unbounded  —  Oneshot SGD [70], SimuParallelSGD [109]  
Bounded  EagerSGD [60], SFB [105], Gradient lag [57]  —  
Decentralized,  None  —  — 
Unbounded  —  —  
Bounded  —  WAGMASGD  
Decentralized,  None  —  DPSGD [61], SGP [5] 
Unbounded  GossipGraD [22], ChocoSGD [54]  ADPSGD [62], Gossiping SGD [52], SwarmSGD [74]  
Bounded  CDSGD [51]  Local SGD [92, 64, 37, 101, 81] 
Removing the global communication bottleneck in decentralized SGD has been shown to enable scaling to tens and even hundreds of nodes [61, 62, 5]. However, performing averaging in pairs does come at the cost of worse convergence: in particular, early proposals on decentralized algorithms [61, 62] lose accuracy with respect to the synchronous baseline at scale, while more recent work [5, 74] observe that the algorithms can achieve full accuracy if executed for more iterations than the synchronous baseline: in particular, they execute between twice and four times more SGD iterations in total, relative to the synchronous baseline, erasing much of the speedup due to increased scalability. This decreased convergence behavior is connected to the analytical bounds provided by these algorithms: while the theoretical convergence rates suggest linear speedup with respect to the number of SGD steps and executing nodes, these rates only apply after a very large number of SGD steps have been taken, in order to allow the pairwise averaging process to “mix” well, thereby simulating alltoall averaging. See Section IVA for a detailed discussion.
Iia Training at Scale
An orthogonal challenge to distributed stochastic optimization is that of unbalanced workloads. Imbalance may be caused by the training system [47, 60, 62] or by the task itself [60, 103]. Training on multitenant cloud systems can suffer from performance variability due to resource sharing. Several deep learning tasks, such as video classification and machine translation, have inherent load imbalance, because input/output sequences have different lengths [60]. In deep reinforcement learning, an agent must interact with the environment to generate training data. For RL tasks using heterogeneous environments [103], the runtime of training data generation varies significantly.
Beyond deep learning, allreduces have a long history within the HPC community [96, 77, 9, 91, 7, 8, 18, 17, 48, 106, 6, 11] and nonblocking versions have been used to improve performance [44]. Particular implementations have become widelyused within the deep learning community, including BaiduAllreduce [34], NCCL [75], Gloo [31], and Horovod [89]. Most deep learning frameworks incorporate support for distributed training, either via parameter servers or allreduces [24, 21, 1, 46]. Communication compression is another common (and complementary) approach to reducing communication overhead [87, 93, 29, 65, 2, 3, 102, 63, 14, 82]. Communication may also be impacted by different approaches to partitioning layer parameters, such as model parallelism [56, 97, 33, 27, 28, 50].
IiB Comparison Targets
In Table I
we summarize and classify the distributed SGD algorithms most relevant to our work. Algorithms in
bold are used for comparison in this work. Since they typically scale and perform better on largescale systems, we limit our comparison to decentralized algorithms. The algorithms we compare our evaluation with are chosen specifically to be spread across the different answers to the above five questions, prioritizing popular algorithms with proven convergence, both in theory and in practical deep learning applications:
AllreduceSGD is the standard dataparallel training.

Decentralized parallel SGD (DPSGD) [61] uses a ring topology, where each process averages its local model with its two neighbors. Processes advance synchronously with a single global clock.

Stochastic gradient push (SGP) [5] generalizes the topology used in DPSGD to support more flexible, asymmetric communication patterns.

EagerSGD [60] uses partial collective allreduces over the gradients, allowing at most half processors to contribute stale gradients if not ready.

Asynchronous decentralized parallel SGD (ADPSGD) [62] extends the idea of DPSGD by allowing processors to communicate updates at any point in time.
These cover nearly all varieties of consistency and averaging, as well as practical differences in communication patterns.
IiC Discussion
Following the discussion on the impact of quorum size on convergence (Q5), it is natural to ask whether performing decentralized averaging in larger groups would be able to provide the best of both worlds, enabling the full convergence of the synchronous algorithm, and the scalability of fully decentralized ones. There are two main barriers to this solution: the first one is at the implementation level, since, to our knowledge, no efficient nonblocking implementation of group model averaging exists. The second is at the application level, since it is not clear whether group averaging would be able to achieve the same convergence as the synchronous solution (both in theory and in practice). In the following sections, we address both of these issues.
Iii WaitAvoiding Group Communication
The allreduce collective operation [72] is defined as a reduction whose results are shared among all participants. Although serveral optimizations [34, 77] have been designed to improve the performance of this collective, allreduce poses an implicit global synchronization point, which makes it vulnerable to stragglers during deep learning training. On larger systems, the performance of the compute nodes can be impacted by different internal (e.g., load imbalance) and external factors (e.g., network [23] or OS [43] noise), potentially increasing the synchronization overhead. We define this collective as synchronous allreduce. While nonblocking collectives [42] can alleviate the synchronization overhead, they do not fully remove it and completion still waits. Even if the participating processes are perfectly synchronized, the optimal scaling of an allreduce of size is at best for processes [76, 83]. Therefore, growing process counts will reduce the parallel efficiency and eventually make the reduction a scaling bottleneck.
Iiia WaitAvoiding Group Allreduce
To overcome the synchronization overhead and overall collective cost, we introduce a new class of waitavoiding group collectives, focusing on group allreduce for the purpose of this work. We relax synchronization by making the collectives externallytriggerable [26, 60], namely, a collective can be initiated without requiring that all the processes enter it, by externally activating the communication schedule of late processes with activation messages sent by the early ones. Once activated, a group allreduce does not perform a global reduction. Instead, it partially reduces the data within nonoverlapping groups of processes, limiting the number of communications needed to implement the group collective.
IiiA1 Collective activation
In a waitavoiding group allreduce, any process can make progress regardless of what the other processes are working on. This waitavoidance is achieved by the activation component. We call the process reaching the collective call first the activator. The activator is in charge of informing all the other processes that an allreduce operation has started and that they have to participate, regardless of whether they reached the collective callsite.
In a waitavoiding group allreduce, any process can initiate the collective. We use a modified version of the recursive doubling algorithm that builds a butterfly topology, which can be seen as a set of overlapping binomial trees, one rooted at each process. Any node can activate the collective by sending activation messages along the binomial tree rooted at itself. Fig. 1 shows an example where P1 is the activator. In this case, P1 uses its broadcast tree and sends the activation messages to P0 and P3. Once activated, P0 first forwards the activation message to P2, after which it starts executing its group allreduce schedule.
It is possible that several processes arrive at the waitavoiding group allreduce operation at close proximity, which means we may have more than one activator during the activation phase. To guarantee that a process does not execute the same collective twice, we assign each operation a version number that is increased every time the collective is executed. The collective version number is encoded in the tag of the activation messages: once an activation is received, the collective schedule is activated only if its version number is lower or equal than the one carried by the activation message. The version number check is executed also when a process reaches the collective call: if it fails, then the version of the collective that the process wants to activate has already been executed (and the process has passively participated in it). In this case, no activation messages are sent.
IiiA2 Asynchronous execution
To enable asynchronous execution of the custom collectives, we extend the fflib communication library [26], adding support for waitavoiding group allreduce. fflib allows programmers to customize collective operations via a flexible, DAGbased representation of pointtopoint and local compute operations, defined as schedules. The library provides a Cbased interface for schedule creation and nonblocking invocation, using MPI as its primary backend, with additional support for network offloading engines such as sPIN [41]. Our defined schedule for group operations models both the activation and group allreduce phases.
IiiB Dynamic Grouping Strategy
As discussed in Section II, in dataparallel SGD variants such as allreduce SGD [89, 12] and gossip SGD [61, 5, 62], each process keeps propagating local model updates to all the other processes at every iteration to make global progress. We propose a dynamic grouping strategy to reduce the latency (in steps) of local update propagation. Together with the group allreduce operation, the grouping strategy guarantees that the local updates can be globally propagated within iterations. The larger the group size, the faster the updates are propagated. By carefully selecting the group size, we can achieve both lower latency than gossip SGD and efficient communication by reducing contention.
We define the dynamic grouping strategy in Algorithm 1. We assume the number of processes is a poweroftwo, which is a common case in current distributed training systems. The group size () is also set to a poweroftwo. In line 2, we initialize the mask, and calculate the number of phases in a butterfly topology for and processes, respectively. Line 3 initializes the shift. In each training iteration , the algorithm first initializes groups, each of which contains one process (line 4). In line 8, an equivalence relation between each pair of processes is found using the bitwise XOR operation. For a pair of processes with an equivalence relation (i.e., ), we find the groups and belong to, respectively (line 9); if and are not in the same group, we merge the two groups into one using the union operation (lines 10–12). In line 15, all the processes will have been partitioned into groups in iteration . Note that the initial value of shift is periodically changing in every iteration (line 3), which, in turn, changes the group composition in every iteration.
To demonstrate dynamic grouping, we use and as an example. In iteration , all processes are initially partitioned into 8 groups. The set of equivalence relations ^{1}^{1}1The equivalence relations satisfy reflexivity, symmetry, and transitivity. Thus, , , and imply that . We still put in the set for easier explanation. includes , , , , , , , and . By recursively merging the two groups in which a pair of processes with a equivalence relation belongs to, we obtain two nonoverlapping groups, which contain the processor sets {0, 1, 2, 3} and {4, 5, 6, 7}. In iteration , the set of equivalence relations changes; thus, the grouping changes accordingly (i.e., {0, 1, 4, 5} and {2, 3, 6, 7}).
Note that we only use Algorithm 1 to formally describe the grouping strategy. The grouping strategy together with allreduce within each group is implemented concisely following the phases of the butterfly topology, namely each pair of processes with a equivalence relation in a phase would exchange messages. We use the variable to change the phases that should be executed in the current iteration. Fig. 2 presents the iterative execution of group allreduce with dynamic grouping in WAGMASGD, and grouping is shown on the right side. We can see that although the group size is fixed, the groups are dynamically changing during the iterations. Within each group, the allreduce is conducted following phases of the butterfly topology. To maintain convergence with this communication scheme in dataparallel deep learning training, a standard synchronous allreduce across all the processes is conducted every iterations, bounding the staleness of the weights. In the following section, we present the algorithm in detail and further discuss this periodic synchronization.
Iv WaitAvoiding Group Model Averaging SGD
Based on the insight that larger groups converge faster, and on the novel implementation of waitavoiding group collectives, we design the WaitAvoiding Group Model Averaging (WAGMA) SGD algorithm. The algorithm can be classified as a modelaveraging, boundedstaleness, decentralized SGD with a group size of and a global communication period of steps. As listed in Algorithm 2, WAGMASGD is similar to minibatch SGD, but makes a few important distinctions.
In lines 3–7, each process calculates the local gradients and then applies the local gradients to derive and apply the model update , as in distributed SGD. Subsequently, the waitavoiding group model averaging is conducted (lines 8–17) using the aforementioned waitavoiding communication scheme. From an algorithmic perspective, WAGMASGD does not rely on certain choice of group members for the local collectives. However, instead of randomly choosing groups of processes, we use the butterfly strategy (Algorithm 1) for topologyaware, efficient, deterministic communication.
In each iteration, faster processes will trigger the model averaging immediately without waiting (line 9, is used to control grouping), which may incur averaging the local models with some stale models from the slower processes. To both bound the staleness and mitigate divergence across local model replicas, we define a synchronization period , in which the models are averaged across all the processes using a global allreduce (line 16). Empirically, we set the synchronization period to 10 training iterations, which balances model accuracy with training throughput, as we will show in Section V.
An execution snapshot of WAGMASGD ( and ) is presented in Fig. 3. Suppose P1 is a straggler. When the group allreduce in iteration is triggered by any of the other three processes, P1 can only contribute the stale model parameters . In iteration , P1 and P0 are in the same group; therefore, and will be added together to derive . P0 will use the averaged model for the next iteration of training. P1 subsequently finishes the calculation for the local updated model in iteration (i.e., ), but finds out that the group allreduce in iteration is already finished. In this case, it will average the stale model with (line 13 in Algorithm 1), and the averaged model will be used for the next iteration of training. Meanwhile, the data in the send buffer of P1 is updated by . If the group allreduce in iteration is triggered by some faster process at this time, P1 will continue to passively contribute the stale model . When a standard allreduce is called at the synchronization point, it forces all the processes to contribute the model parameters after training for the same number of iterations. In Fig. 3, P1 catches up with the other processes in iteration ; thus, it will contribute the timely model to P3, as they are in the same group.
Iva Proof of Convergence
Algorithm Modelling
For analysis purposes, we will model the algorithm’s execution as follows. We will proceed in steps, indexed by time . Each node maintains its own local model , and has a local partition of the data. In each step, a group of nodes of size is chosen to interact. Each node takes a local gradient step, and then nodes average their models. This averaging step might be inconsistent, as per the above semantics.
In the analysis, we will assume that the group of interacting nodes is chosen uniformly at random—in the long run, the resulting interaction graph will have the same properties as the butterfly interaction strategy used in the implementation. While our analysis considers each interaction sequentially, in practice interaction steps can occur in parallel.
Setup and Analytic Assumptions
We will assume a standard setting in which we are given a dataset of samples , and to each associate a differentiable loss function . Each node is given a random partition of the dataset , and we wish to solve an empirical risk minimization problem by finding
Let be the loss function corresponding to the dataset of the th node, and . To make the analysis tractable, we will make the following standard assumptions on the loss function.
Assumption 1.
We assume the following hold:

(Lipschitz gradients) All functions have Lipschitz gradient, for some constant .

(Bounded Staleness) The staleness during the averaging step is upper bounded by a parameter . That is, for any node participating in the averaging step at time , averaging is performed with respect to model , where , and every gradient update is applied globally at most steps after it was generated.
Convergence result
We can now state the main convergence result. For readability, we state a simplified variant that highlights the relationship between parameter values, in particular the relation between the convergence time , processors, and the size of the interacting group . The full statement and its proof will be integrated in the paper.
Theorem 1.
Consider the setting described above, in which we wish to optimize a nonconvex function . Let be the size of a communication group, and assume that the maximum staleness is constant. Fix a success parameter . For each time , we define to be the average of local models at time . Then there exists a setting of the learning rate such that, if the algorithm has taken
then there exists an iterate such that
where the expectation is taken w.r.t. the randomness in the sampling and interactions.
Discussion
At a high level, this claim shows that the algorithm will eventually reach a point where the model average has negligible gradient, i.e., is at a local minimum. While this does not guarantee convergence to a global minimum, it matches the best achievable guarantees for SGD in the nonconvex setting [32]. The convergence proof follows the general decentralized asynchronous framework of Lian et al. [62]
, with differences due to the specific structure of the group communication pattern we employ, and the asynchronous nature of waitavoiding collectives. Further, we note that the convergence guarantee can be extended to 1) apply to individual models instead of the model average; and 2) relax the bounded second moment assumption to a bound on the variance. Both these improvements come at the cost of additional technical complexity, so we will defer them in the full version of our convergence proof. The current statement of the theorem obscures the rate at which convergence occurs — for standard parameter settings, the convergence speedup (i.e., the rate at which we get to a point of negligible gradient) with respect to the number of nodes is
linear. This linear speedup is the best possible, and matches the rates for other decentralized algorithms, e.g. [61, 62]. We refer the interested reader to the full version of our convergence proof.It is interesting to examine the impact of the group size on convergence: in particular, notice that the time to convergence decreases quadratically in . Specifically, assuming that group size is small, say, , and other parameters are constant, then the number of steps to reach a local optimum is . This matches the best known bounds in the decentralized model for pairwise interactions [61, 62]. However, can exceed the number of SGD steps taken during regular training even for moderate node counts, making the bound meaningless. For example, the number of SGD steps when training ResNet50 on ImageNet is around , making the bound meaningful only for . For larger group size, e.g., in the order of , our analysis decreases this step bound to , which asymptotically matches the convergence rate and step bound when model averaging is performed synchronously and globally (e.g., the bound of [61] for alltoall communication), and is also practically more relevant.
V Experimental Evaluation
We conduct our experiments on the CSCS Piz Daint supercomputer. Each Cray XC50 compute node contains a 12core Intel Xeon E52690 CPU with 64 GB RAM, and one NVIDIA Tesla P100 with 16 GB memory. The compute nodes are connected by Cray Aries interconnect in a Dragonfly topology. The communication library is Cray MPICH 7.7.2. We use one MPI process per node and utilize the GPU for acceleration in all following experiments. We evaluate three different deep learning problems, including image classification (ResNet50 [38] on ImageNet [25]), machine translation (Transformer [98] on WMT17), and deep reinforcement learning (PPO [86, 103] for navigation in Habitat [84]). For throughput tests, we run the number of nodes until reaching a point where batch size is too large to converge [90].
Va Baselines
We compare our WAGMASGD with the stateoftheart dataparallel SGD variants, including AllreduceSGD [89, 12], local SGD [64, 20], gossipbased SGD variants (DPSGD [61], ADPSGD [62], and SGP [5]), and eagerSGD [60]. Unless mentioned specifically, the synchronization period of local SGD is set to one, namely calling a standard allreduce to average the models in each training iteration, which essentially is a synchronous SGD. For SGP, we evaluate its performance with different number of communication neighbors [5]. For more detailed discussion about the baselines, please refer to Section II.
VB Image Classification with Simulated Workload Imbalance
Residual Networks (ResNet) [38]
are pervasively used in computer vision tasks. To evaluate their performance, we train ResNet50 on ImageNet (total 25,559,081 trainable parameters) with a local batch size of 128 images. Although the training workload is balanced due to the input size being fixed, performance variability is observed in practice when training on multitenant cloud systems
[62, 60, 85]. To simulate the same degree of imbalance, we randomly select two processes at every training step to inject a certain amount of delay (320 ms), according to the performance variability on cloud machines [60]. For WAGMASGD, we set the synchronization period , and the group size . Both and are poweroftwo in our experimental configuration.Fig. 4 shows the training throughput as the number of GPU nodes increases from 4 to 256, and the top of the rectangle wrapping each cluster indicates the ideal throughput without communication overhead. Compared with local SGD, AllreduceSGD (implemented in Deep500 [12]), DPSGD, SGP (two communication neighbors), and eagerSGD when training on 64 GPU nodes, WAGMASGD achieves 1.25x, 1.26x, 1.23x, 1.25x, and 1.13x speedup, respectively. The speedup becomes larger as the number of GPU nodes increases to 256: WAGMASGD achieves up to 1.37x speedup. The only algorithm with higher throughput than WAGMASGD is ADPSGD, in which the asynchronous communication is completely overlapped with the computation. These results show that WAGMASGD can better handle the unbalanced workload than the synchronous SGD algorithms (i.e., local SGD, AllreduceSGD, DPSGD, and SGP), as well as the boundedstaleness eagerSGD variant. In the latter case, while staleness is bounded, the algorithm still conducts a global collective communication for gradient averaging in each training iteration. In contrast, WAGMASGD keeps the collectives within each group, and thus has a better parallel scalability.
Fig. 5 presents the Top1 validation accuracy (in accordance with MLPerf [68]) when training for 90 epochs on 64 nodes with a total batch size of 8,192. We can see that the accuracy of WAGMASGD (75.0%) is very close to the standard AllreduceSGD (75.8%) and local SGD (75.3%), but WAGMASGD significantly reduces the training time. Gossipbased SGD algorithms, such as DPSGD and the higherthroughput ADPSGD, attain much lower accuracy than the other variants. This can be explained by the fact that the algorithms have not fully converged, requiring more steps to be taken to achieve comparable accuracy [74]. For SGP, we set and tune the number of communication neighbors to achieve the highest generalization using a directed exponential graph [5], which causes it to achieve higher accuracy than DPSGD and ADPSGD, yet still lower than WAGMASGD. Note that the default setting for the number of communication neighbors in SGP is one, whereas we set it to two for better generalization performance. Overall, WAGMASGD achieves the highest accuracyvstime among all parallel SGD variants.
By setting the group size , WAGMASGD has a faster model update propagation speed (globally propagate only using iterations) than the gossipbased algorithms (globally propagate using at least iterations), which makes WAGMASGD achieve higher accuracy. This is consistent with our analysis in Section IVA.
To further analyze the convergence properties of WAGMASGD, we conduct additional experiments. ❶ In the first experiment, we remove the waitavoiding group collectives in WAGMASGD and only keep standard allreduce operations on the synchronization points, which is essentially equivalent to local SGD with a synchronization period . This causes the top1 validation accuracy to sharply drop to 66.9%. ❷ In a second experiment, we execute group model averaging without using the dynamic grouping strategy (i.e., the groups are fixed). In this case, the top1 validation accuracy drops to 73.5%. ❸ We also experiment with increasing the group size to 64 (i.e., a global collective). While accuracy does not increase, the throughput drops by factor of 1.07x. ❹ Lastly, we decrease the group size to 4 and observe that the top1 validation accuracy drops to 72.8%.
The results from experiments ❶ and ❷ indicate that the combination of group allreduce operations and the dynamic grouping strategy is essential to achieve good generalization performance. The results from experiments ❸ and ❹ demonstrate that empirically exhibits the best performance among different group size settings.
VC Machine Translation
Transformers are sequencetosequence transducers that can be used to translate a sequence of words from one language to another. We use the standardsized Transformer network
[98], which has 61,362,176 trainable parameters, to train English to German translation WMT17 dataset. While training the model, the computation overhead changes with the length of the input and output sentences. The samples in the training dataset typically consist of sentences in various lengths and thus the training workload is unbalanced. As shown in Fig. 6, even when using a bucketing strategy to group sentences with similar lengths, there is a high variance in workload size between samples. Specifically, in our experiment each local batch contains equal number of sentences sampled from a randomly selected bucket, where the maximum local batch size is set to 8,192 tokens. For WAGMASGD, we set the synchronization period and the group size .Fig. 7 presents the training throughput as the number of GPU nodes increases from 4 to 64, where the top of the rectangle indicates the ideal throughput without communication overhead. On 16 GPU nodes, WAGMASGD achieves the highest throughput, compared with local SGD, AllreduceSGD (implemented in Horovod [89]
), DPSGD, ADPSGD, and SGP (one communication neighbor). When the number of GPU nodes increases to 64, as with image classification WAGMASGD exhibits a lower throughput than ADPSGD but higher than all the other variants. Observe that on 64 nodes, all the algorithms perform far worse than the ideal throughput. We believe that this effect stems from the balance of the number of parameters (occupying 245 MB alone) vs. the operational intensity to compute backpropagation. Since transformer networks mostly consist of tensor contractions implemented as batched matrix products, which utilize GPUs well, communication overhead dominates and not even ADPSGD manages to overlap communication with computation.
As for accuracy, Fig. 8 presents the BiLingual Evaluation Understudy (BLEU) score (higher is better) on the test dataset after training for 10 epochs on 16 nodes. As the total batch size is a relatively large number of tokens (131,072), it incurs similar quality degradation as in other deep learning problems [90]. Still, among all SGD variants, WAGMASGD achieves the highest score using the shortest training time. Gossipbased SGD variants, including DPSGD, ADPSGD, and SGP (1n, i.e., one communication neighbor), have lower score than the others, likely because of the slower model update propagation. To verify this claim, we increase the number of communication neighbors to two in SGP (2n), which improves the score to 24.5 (equivalent to local SGD). However, this accuracy increase comes at the cost of significantly reduced training speed compared with SGP (1n).
We conduct additional experiments for WAGMASGD, similarly to Section VB: (1) Without using the dynamic grouping strategy (i.e., fixed groups), the score drops to 23.8; (2) By increasing the group size to 16 (i.e., global collective), accuracy does not improve and training throughput drops by a factor of 1.28x; and (3) By decreasing the group size to 2, the score drops to 23.3. These results reaffirm the conclusions from image classification.
VD Deep Reinforcement Learning
Due to the inherent characteristics of the problem, reinforcement learning poses a more challenging training process over supervised and semisupervised learning. This also applies to the heterogeneity in workloads during training — since the problems in question involve interacting with an environment in episodes (where failure terminates an episode early), a variety of episode lengths may occur within a single minibatch, in a way that cannot be anticipated or categorized in buckets.
We use the popular Proximal Policy Optimization (PPO) policy gradient optimizer [86] to train a model for robot navigation on a metadataset called Habitat [84], which is composed of multiple heterogeneous environments. We first confirm previous claims [103] and our own in Fig. 9, where we collate the runtime distribution of 5,000 training iterations. The runtime is very widely distributed: from 1.7 seconds to 43.5, with a median below 2 seconds, which makes it an excellent use case for the loadrebalancing properties of WAGMASGD.
To evaluate the performance, we train a standard ResNetLSTM model for navigation. In particular, the network structure is composed of a ResNet18 visual encoder, connected to a stack of two Long ShortTerm Memory (LSTM)
[40] recurrent units functioning as the policy, containing 8,476,421 trainable parameters. The measured heterogeneous environments in Habitat, Gibson [104] and Matterport3D [19], consist of interactive RGBD datasets. We set the experience steps to 128 and use the two vectorized (namely, optimized) environments, which means each GPU node needs to collect 256 experience steps for each training iteration. We set the WAGMASGD synchronization period to
.Fig. 10 presents the training throughput as the number of GPU nodes increases from 16 to 1024, where the top of the rectangle indicates the ideal throughput without communication overhead. Compared with local SGD, DPSGD, and SGP (four communication neighbors) on 1,024 GPU nodes, WAGMASGD achieves 2.33x, 1.88x, and 2.10x speedup, respectively. The violin plot shows the throughput distribution. WAGMASGD only has lower throughput than ADPSGD, since ADPSGD is fully asynchronous. These results show that WAGMASGD excels in handling highly unbalanced workloads, achieving good scaling efficiency.
Complementary to performance, we study the Success weighted by Path Length (SPL) score (higher is better) after training the model for 10 hours on 64 GPUs. All models are tested four separate times to account for variability, and the average scores together with the standard deviation (shaded regions) are plotted in Fig.
11. As the figure shows, despite the scalability of ADPSGD, it only achieves 0.051 SPL on average, and seems to converge, deeming it unusable for RL problems. On the other hand, WAGMASGD achieves the highest score over time, even over local SGD. A possible reason for this might be asynchronous methods tend to overcome local convergence issues in deep reinforcement learning [73]. This is also seen in SGP, which scores higher than local SGD, but not as well as WAGMASGD, whose quorum size is larger.Beyond our experiments, the current stateoftheart SPL score is 0.922 [103]. This score is achieved after training on 2.5 billion experience steps. WAGMASGD consumes total 2.6 million experience steps after training for 10 hours using 64 GPUs, and achieves on average 83.1% (up to 91.2%) of the SotA score, and the results seem to keep increasing. This indicates that we are able to achieve almost equivalent generalization in 3 orders of magnitude fewer iterations.
Vi Collectives in Context
Collective operations have a core role in running applications efficiently at scale. As such, their optimization has led to several implementation and algorithmic variants.
Blocking collectives [72] constitute the basic class of operations. In this case, the collective call is allowed to return only when the calling process has completed all the actions needed to its participation in the operation. A first optimization to blocking collectives is to make them nonblocking [42], enabling processes to return immediately and overlap other activities with the ongoing collective.
Some collectives require all the processes to invoke it in order to complete, e.g., a reduction cannot be computed before knowing all the values to reduce. Hence, their completion time can be influenced by any skewing (imbalance) among the processes.
Solo collectives [26] remove this synchronization overhead by making the collectives externallytriggerable: once a process joins the collective, it sends an activation message to all the others, making them to start the collective independently from their state. An issue of solo collectives is that they make triggering the collective possible, even if there is only one process joining it. Majority collectives [60] extend the solo idea by requiring that at leastprocesses join the collective before triggering it. While these collectives are not guaranteed to be equivalent to their blocking or nonblocking counterparts, they are suited for machine learning tasks, due to the robustness of stochastic optimization to staleness.
Both solo and majority collectives aim to minimize the synchronization overhead. However, once activated, the collective is fully performed, making the application pay the full operation cost plus the activation overhead. Waitavoiding group collectives (this work) utilize the approach of solo collectives to achieve asynchrony, and further reduce the overall operation cost by dynamically selecting subgroups of processes, each of which executing the collective independently from the others.
Vii Conclusion
We show, both theoretically and in practice, that stochastic optimization via group model averaging — averaging the learned weights across subgroups of nodes — functions well in large clusters. We prove that the algorithm converges under the standard conditions of SGD, and through a careful implementation of waitavoiding collectives, we use the topology of the network to attain the best scaling results without losing accuracy. For the same number of steps, WAGMASGD achieves equivalent (or higher) generalization scores as synchronous SGD, while still achieving up to 2.1 speedup (on RL) over the previous stateoftheart, gossipbased SGD. Similar results are observed in other models from various subfields, empirically proving that this approach is the first to successfully tackle deep learning in extreme scales, dispensing with stepwise global synchronization and bringing SGD to the regime of supercomputers.
References
 [1] (2015) TensorFlow: largescale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §IIA.

[2]
(2017)
Sparse communication for distributed gradient descent.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, Cited by: §IIA.  [3] (2017) QSGD: communicationefficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §IIA.
 [4] (201805) AI and compute. Note: https://openai.com/blog/aiandcompute/ Cited by: §I.
 [5] (2019) Stochastic gradient push for distributed deep learning. In Proceedings of the Thirtysixth International Conference on Machine Learning (ICML), Cited by: §I, 4th item, TABLE I, §II, §II, §IIIB, §VA, §VB.
 [6] (2018) Optimized broadcast for deep learning workloads on denseGPU InfiniBand clusters: MPI or NCCL?. In EuroMPI, Cited by: §IIA.
 [7] (1995) CCL: a portable and tunable collective communication library for scalable parallel computers. IEEE Transactions on Parallel and Distributed Systems 6 (2). Cited by: §IIA.
 [8] (1995) Global combine algorithms for 2D meshes with wormhole routing. Journal of Parallel and Distributed Computing 24 (2). Cited by: §IIA.
 [9] (1994) Interprocessor collective communication library (InterCom). In Proceedings of IEEE Scalable High Performance Computing Conference, Cited by: §IIA.
 [10] (2019) QsparselocalSGD: distributed SGD with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: TABLE I.
 [11] (2017) Scalable reduction collectives with data partitioningbased multileader design. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. Cited by: §IIA.
 [12] (201905) A modular benchmarking infrastructure for highperformance and reproducible deep learning. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vol. , pp. 66–77. External Links: Document, ISSN 15302075 Cited by: §IIIB, §VA, §VB.
 [13] (201908) Demystifying Parallel and Distributed Deep Learning: An InDepth Concurrency Analysis. ACM Comput. Surv. 52 (4). External Links: ISSN 03600300, Link, Document Cited by: §I, §I, §II.
 [14] (2018) signSGD: compressed optimisation for nonconvex problems. In Proceedings of the 35th International Conference on Machine Learning (ICML), Cited by: §IIA.
 [15] (2016) End to end learning for selfdriving cars. External Links: 1604.07316 Cited by: §I.
 [16] (2018) Optimization methods for largescale machine learning. SIAM Review 60 (2). Cited by: §II.
 [17] (2007) Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19 (13). Cited by: §IIA.
 [18] (2006) Collective communication on architectures that support simultaneous communication over multiple links. In PPoPP, Cited by: §IIA.
 [19] (2017) Matterport3D: learning from RGBD data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §VD.
 [20] (2016) Scalable training of deep learning machines by incremental block training with intrablock parallel optimization and blockwise modelupdate filtering. In 2016 ieee international conference on acoustics, speech and signal processing (icassp), pp. 5880–5884. Cited by: §I, 2nd item, TABLE I, §VA.
 [21] (201410) Project adam: building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO, pp. 571–582. External Links: ISBN 9781931971164, Link Cited by: §IIA.
 [22] (2018) GossipGraD: scalable deep learning using gossip communication based asynchronous gradient descent. CoRR abs/1803.05880. External Links: Link, 1803.05880 Cited by: §I, TABLE I.
 [23] (2019) Mitigating network noise on dragonfly networks through applicationaware routing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–32. Cited by: §III.
 [24] (2012) Large scale distributed deep networks. In Advances in neural information processing systems (NeurIPS), pp. 1223–1231. Cited by: §IIA, TABLE I, §II.

[25]
(2009)
Imagenet: A largescale hierarchical image database.
In
Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 248–255. Cited by: §V.  [26] (2015) Exploiting offload enabled network interfaces. In 2015 IEEE 23rd Annual Symposium on HighPerformance Interconnects, pp. 26–33. Cited by: §IIIA2, §IIIA, §VI.
 [27] (2019) Improving strongscaling of CNN training by exploiting finergrained parallelism. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Cited by: §IIA.
 [28] (2019) Channel and filter parallelism for largescale CNN training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: §IIA.
 [29] (2016) Communication quantization for dataparallel training of deep neural networks. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), Cited by: §IIA.
 [30] (2018) Slow and stale gradients can win the race: errorruntime tradeoffs in distributed sgd. In Proceedings of the International Conference on Artifical Intelligence and Statistics (AISTATS), Cited by: TABLE I.
 [31] (2018) Gloo. External Links: Link Cited by: §IIA, TABLE I.
 [32] (2013) Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §IVA.
 [33] (2018) Integrated model, batch, and domain parallelism in training neural networks. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures (SPAA), Cited by: §IIA.
 [34] (2017) Bringing hpc techniques to deep learning. URL http://research. baidu. com/bringinghpctechniquesdeeplearning. Cited by: §IIA, TABLE I, §III.
 [35] (2018) Asynchronous distributed learning with sparse communications and identification. arXiv preprint arXiv:1812.03871. Cited by: TABLE I.
 [36] (2016) Model accuracy and runtime tradeoff in distributed deep learning: a systematic study. In 2016 IEEE 16th International Conference on Data Mining (ICDM), Cited by: TABLE I.
 [37] (2019) Local SGD with periodic averaging: tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: 2nd item, TABLE I.
 [38] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §VB, §V.
 [39] (2013) More effective distributed ml via a stale synchronous parallel parameter server. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 1, NIPS’13, USA, pp. 1223–1231. External Links: Link Cited by: §I, TABLE I.
 [40] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §VD.
 [41] (2017Nov.) sPIN: Highperformance streaming Processing in the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC17), Cited by: §IIIA2.
 [42] (2007) Implementation and performance analysis of nonblocking collective operations for mpi. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pp. 52. Cited by: §III, §VI.
 [43] (2010) Characterizing the influence of system noise on largescale applications by simulation. In SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. Cited by: §III.
 [44] (2006Dec.) A Case for NonBlocking Collective Operations. In Frontiers of High Performance Computing and Networking  ISPA’06 Workshops, Vol. 4331/2006, pp. 155–164. External Links: ISBN 9783540498605 Cited by: §IIA.
 [45] (2017) Gaia: geodistributed machine learning approaching lan speeds. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, NSDI’17, Berkeley, CA, USA, pp. 629–647. External Links: ISBN 9781931971379, Link Cited by: TABLE I.
 [46] (2016) FireCaffe: nearlinear acceleration of deep neural network training on compute clusters. In CVPR, Cited by: §IIA.
 [47] (2011) On the performance variability of production cloud services. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 104–113. Cited by: §IIA.
 [48] (2010) Optimal bucket algorithms for large MPI collectives on torus interconnects. In ICS, Cited by: §IIA.
 [49] (2019) Prioritybased parameter propagation for distributed DNN training. In Proceedings of the 2nd SysML Conference, Cited by: TABLE I.
 [50] (2019) Beyond data and model parallelism for deep neural networks. In Proceedings of the 2nd Conference on Systems and Machine Learning (SysML), Cited by: §IIA.
 [51] (2017) Collaborative deep learning in fixed topology networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: TABLE I.
 [52] (2016) How to scale distributed deep learning?. In Workshop on Machine Learning Systems at NeurIPS 2016, Cited by: TABLE I.
 [53] (2020) Scaling laws for neural language models. External Links: 2001.08361 Cited by: §I.
 [54] (2019) Decentralized stochastic optimization and gossip algorithms with compressed communication. arXiv preprint arXiv:1902.00340. Cited by: TABLE I.
 [55] (2016) Federated learning: strategies for improving communication efficiency. In NeurIPS Workshop on Private MultiParty Machine Learning, Cited by: TABLE I.

[56]
(2014)
One weird trick for parallelizing convolutional neural networks
. arXiv preprint arXiv:1404.5997. Cited by: §IIA.  [57] (2018) Exascale deep learning for climate analytics. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: TABLE I.
 [58] (2017) Deep learning at 15PF: supervised and semisupervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: TABLE I.
 [59] (2014) Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, Berkeley, CA, USA, pp. 583–598. External Links: ISBN 9781931971164, Link Cited by: TABLE I.
 [60] (2020) Taming unbalanced training workloads in deep learning with partial collective operations. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Cited by: §I, 5th item, §IIA, TABLE I, §IIIA, §VA, §VB, §VI.
 [61] (2017) Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 5336–5346. External Links: ISBN 9781510860964, Link Cited by: §I, 3rd item, TABLE I, §II, §II, §IIIB, §IVA, §IVA, §VA.
 [62] (201810–15 Jul) Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3043–3052. External Links: Link Cited by: §I, 6th item, §IIA, TABLE I, §II, §II, §IIIB, §IVA, §IVA, §VA, §VB.
 [63] (2018) 3LC: lightweight and effective traffic compression for distributed machine learning. arXiv preprint arXiv:1802.07389. Cited by: §IIA.
 [64] (2018) Don’t use large minibatches, use local sgd. arXiv preprint arXiv:1808.07217. Cited by: 2nd item, TABLE I, §VA.
 [65] (2018) Deep gradient compression: reducing the communication bandwidth for distributed training. In Proceedings of the Sixth International Conference on Learning Representations (ICLR), Cited by: §IIA.
 [66] (2018) DARTS: differentiable architecture search. External Links: 1806.09055 Cited by: §I.
 [67] (2018) Neural architecture optimization. External Links: 1808.07233 Cited by: §I.
 [68] (2020) MLPerf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40 (2), pp. 8–16. Cited by: §VB.
 [69] (2018) An empirical model of largebatch training. External Links: 1812.06162 Cited by: §I.
 [70] (2009) Efficient largescale distributed training of conditional maximum entropy models. In Advances in neural information processing systems (NeurIPS), Cited by: TABLE I.

[71]
(2017)
Communicationefficient learning of deep networks from decentralized data.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)
, Cited by: TABLE I.  [72] (2015) MPI: A MessagePassing Interface Standard Version 3.1. Cited by: §I, §III, §VI.
 [73] (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §I, §VD.
 [74] (2019) SwarmSGD: scalable decentralized SGD with local updates. arXiv preprint arXiv:1910.12308. Cited by: §I, TABLE I, §II, §II, §VB.
 [75] (2020) NVIDIA collective communications library. External Links: Link Cited by: §IIA, TABLE I.
 [76] (200902) Bandwidth Optimal Allreduce Algorithms for Clusters of Workstations. J. Parallel Distrib. Comput. 69 (2), pp. 117–124. External Links: ISSN 07437315, Link, Document Cited by: §III.
 [77] (2004) Optimization of collective reduction operations. In International Conference on Computational Science, pp. 1–9. Cited by: §IIA, §III.
 [78] (2018) Language Models are Unsupervised Multitask Learners. Note: Unpublished manuscript External Links: Link Cited by: §I.
 [79] (2019) Regularized evolution for image classifier architecture search. In The ThirtyThird AAAI Conference on Artificial Intelligence, (AAAI 2019), pp. 4780–4789. Cited by: §I.
 [80] (2011) Hogwild: a lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, pp. 693–701. Cited by: §I, TABLE I.
 [81] (2019) Robust and communicationefficient collaborative learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: 2nd item, TABLE I.
 [82] (2019) SparCML: Highperformance sparse communication for machine learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: §IIA.
 [83] (2019) SparCML: HighPerformance Sparse Communication for Machine Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA. External Links: ISBN 9781450362290, Link, Document Cited by: §I, §I, §III.
 [84] (2019) Habitat: a platform for embodied ai research. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9339–9347. Cited by: §VD, §V.
 [85] (2010) Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proceedings of the VLDB Endowment 3 (12), pp. 460–471. Cited by: §VB.
 [86] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §VD, §V.
 [87] (2014) 1bit stochastic gradient descent and its application to dataparallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §IIA.
 [88] (2020) Improved protein structure prediction using potentials from deep learning. Nature 577 (7792), pp. 706–710. External Links: ISSN 00280836, Document Cited by: §I.
 [89] (2018) Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799. Cited by: §IIA, TABLE I, §IIIB, §VA, §VC.
 [90] (2018) Measuring the effects of data parallelism on neural network training. External Links: 1811.03600 Cited by: §VC, §V.
 [91] (2000) CollMark: MPI collective communication benchmark. In International Conference on Supercomputing, Cited by: §IIA.
 [92] (2019) Local SGD converges fast and communicates little. In Proceedings of the Seventh International Conference on Learning Representations (ICLR), Cited by: §I, 2nd item, TABLE I.
 [93] (2015) Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §IIA.
 [94] (2020) Communicationefficient distributed deep learning: a comprehensive survey. External Links: 2003.06307 Cited by: §I, §II.
 [95] (2020) Communicationefficient decentralized learning with sparsification and adaptive peer selection. arXiv preprint arXiv:2002.09692. Cited by: TABLE I.
 [96] (2005) Optimization of collective communication operations in mpich. The International Journal of High Performance Computing Applications 19 (1), pp. 49–66. Cited by: §IIA.
 [97] (2015) LBANN: livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in HighPerformance Computing Environments (MLHPC), Cited by: §IIA.
 [98] (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §VC, §V.
 [99] (2019) AlphaStar: Mastering the RealTime Strategy Game StarCraft II. Note: https://deepmind.com/blog/alphastarmasteringrealtimestrategygamestarcraftii/ Cited by: §I.
 [100] (2019) Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature 575 (7782), pp. 350–354. External Links: ISSN 14764687, Document, Link Cited by: §I.
 [101] (2019) Adaptive communication strategies to achieve the best errorruntime tradeoff in localupdate SGD. In Proceedings of the Second SysML Conference, Cited by: 2nd item, TABLE I.
 [102] (2017) TernGrad: ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems (NeurIPS), Cited by: §IIA.
 [103] (2019) Decentralized distributed ppo: solving pointgoal navigation. arXiv preprint arXiv:1911.00357. Cited by: §I, §IIA, §VD, §VD, §V.
 [104] (2018) Gibson Env: realworld perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, Cited by: §VD.
 [105] (2016) Lightercommunication distributed machine learning via sufficient factor broadcasting. In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence, UAI’16, Arlington, Virginia, United States, pp. 795–804. External Links: ISBN 9780996643115, Link Cited by: TABLE I.
 [106] (2018) Image classification at supercomputer scale. In NeurIPS Systems for ML Workshop, Cited by: §IIA.
 [107] (2015) Deep learning with elastic averaging SGD. In Advances in neural information processing systems (NeurIPS), pp. 685–693. Cited by: §I, TABLE I.
 [108] (2016) Stalenessaware asyncSGD for distributed deep learning. In 25th International Joint Conference on Artificial Intelligence (IJCAI), Cited by: TABLE I.
 [109] (2010) Parallelized stochastic gradient descent. In Advances in neural information processing systems (NeurIPS), Cited by: TABLE I.
Comments
There are no comments yet.