ddl-benchmarks: Benchmarks for Distributed Deep Learning
In recent years, distributed deep learning techniques are widely deployed to accelerate the training of deep learning models by exploiting multiple computing nodes. However, the extensive communications among workers dramatically limit the system scalability. In this article, we provide a systematic survey of communication-efficient distributed deep learning. Specifically, we first identify the communication challenges in distributed deep learning. Then we summarize the state-of-the-art techniques in this direction, and provide a taxonomy with three levels: optimization algorithm, system architecture, and communication infrastructure. Afterwards, we present a comparative study on seven different distributed deep learning techniques on a 32-GPU cluster with both 10Gbps Ethernet and 100Gbps InfiniBand. We finally discuss some challenges and open issues for possible future investigations.READ FULL TEXT VIEW PDF
Distributed deep learning becomes very common to reduce the overall trai...
Distributed deep learning systems (DDLS) train deep neural network model...
Traditional data quality control methods are based on users experience o...
Distributed training techniques have been widely deployed in large-scale...
Deep learning is an emerging research field that has proven its effectiv...
Deep learning has become an indispensable part of life, such as face
Over the past few years, we have seen fundamental breakthroughs in core
ddl-benchmarks: Benchmarks for Distributed Deep Learning
Deep learning (DL) has been widely used in many practical AI applications including computer vision, natural language processing, robotics, etc. In the DL pipeline, deep neural networks with learnable parameters (deep models) are trained for specific problems (e.g., image classification) on a data set, and the well-trained models are then deployed for inference (e.g., predicting the image label). The training process of deep models is essential to a satisfactory model for deployment. The mini-batch stochastic gradient descent (SGD) and its variants are generally adopted to train deep models. SGD is an iterative learning algorithm that aims to minimize the loss function by updating the model parameters with stochastic gradients calculated by sampling mini-batch of data from the data set. Each iteration of the SGD algorithm typically consists of the following steps. First, it loads a mini-batch of data as the input to the deep model. Second, it calculates the loss value by feed-forwarding. Third, it calculates the first-order gradients of model parameters by backpropagation. Finally, it updates the model parameters and then enters the next iteration. The training process is terminated with some conditions, e.g., the loss value becomes converged. With the increasing size of deep models (e.g., the BERT-xlarge language model has over 1 billion parameters) and training data (e.g., the BDD100K auto-driving data set has 120 million images), training deep models requires significant amount of computations and may take days to months on a single GPU or TPU. A study from OpenAI reported that since 2012, the computing complexity in AI training has been increasing exponentially with a 3.4-month doubling time, much faster than Moore’s Law.
Therefore, it becomes a common practice to exploit multiple processors111Throughout this article, worker and processor are exchangeable as one worker generally exploits one processor to do calculations. to accelerate the training process with a variety of distributed training techniques. Distributed training requires iterative communications between processors in order to collaboratively update the model parameters. However, with more computing processors and the fast-growing computing power of the processors, the data communications among computing processors gradually become the performance bottleneck and limit the system scalability.
How to address the communication challenges in distributed DL has attracted great attention from both academia and industry in recent years.
In distributed DL, there are two main parallelisms , model parallelism and data parallelism, making multiple processors collaborate on training a single model.
As shown in Fig. 1
(a), model parallelism splits model parameters to all computing workers. Every worker holds different parts of parameters and performs computations with its own parameters. Due to the high dependency between different neurons in the model, each worker should exchange its output results with other workers (the bold lines in Fig.1(a)) before continuing the computation of the next layer. One main advantage of model parallelism is that it reduces the memory consumption at each worker. However, the unbalanced parameter sizes and the high computing dependency of different layers in the model limit the scalability of model parallelism. In practice, model parallelism is mainly used to train deep models that are too large to be handled by a single processor.
On the contrary, data parallelism distributes the computational workload from different data samples to different workers. It is more popular in distributed DL due to better scalability and simpler implementation. The conventional synchronized SGD  illustrated in Fig. 1(b) is one of the classical algorithms with data parallelism, which has the same convergence performance (in terms of the number of iterations) as SGD on a single worker. In this method, all workers have identical model parameters and each worker loads different data to calculate the gradients independently. Before updating the model, there is a synchronization barrier that requires the distributed gradients be aggregated through the parameter server(s)  or all-to-all communication . The distributed synchronized SGD is also known as bulk synchronous parallel (BSP) SGD as it requires communication and synchronization in every update iteration.
Much research has been recently proposed to improve the efficiency and scalability of distributed DL; yet there lacks a thorough review and fair comparison of these techniques. In this article, we develop a taxonomy for describing communication-efficient techniques in distributed DL, and then present a survey of the state-of-the-art techniques. We identify the model intensity and local batch size as two key factors that affect the system scalability. We further conduct a comparative study of seven distributed training algorithms using a 32-GPU cluster with three different network configurations. Our evaluation method and results can serve as a reference for the practitioners to design their distributed DL platforms222Our source code is publicly available at https://github.com/HKBU-HPML/ddl-benchmarks..
The rest of the article is organized as follows. We first identify the communication issues in distributed DL. Then we present a survey of the state-of-the-art techniques in addressing the communication challenge, followed by the experimental study. Finally, we discuss the challenges and possible future research directions.
Consider a training job of a deep model with parameters that uses SGD with a mini-batch size of . Assume the number of computational operations required for a single data sample in each training iteration is . We define a new concept model intensity , which is an intrinsic feature of the model that is independent from the distributed training method. E.g., the model intensity of ResNet-50, BERT-Base, and BERT-Large is 470, 249, and 248, respectively. It tells how many computational operations are required for each model parameter per sample and per iteration.
A data parallelism solution with workers will distribute the computational operations to the workers (e.g., each worker has a local mini-batch size of ). We define communication-to-computation (C2C) ratio as the total amount of communications divided by the total amount of computations (i.e., ). In practice, the total amount of communications is linear to the model size and also depends on other factors such as the number of workers . Hence, the C2C ratio can be represented by . Our experimental results in the section of “Comparative Study” verify that a model with lower model intensity and/or small local batch size is more difficult to be scaled to large clusters.
We use the BERT-Large language model (with 336 million parameters) as an example to understand the communication challenges in distributed training. Given a local batch size of 8 (which is limited by the available GPU memory size), each iteration requires floating point operations (FLOPs) which take 163ms on Nvidia RTX2080Ti. There are several communication challenges that limit the system scalability of distributed training.
Communication Size: In each training iteration, the whole set of model parameters or their gradients should be exchanged across all workers while the number of parameters could be millions to billions. For example, the BERT-Large model has a size of 1.34GB if the parameters are stored in 32-bit format. Given workers, it will be a grand challenge to find the average of sets of data and synchronize the updated model within a short time period. For instance, when training BERT-Large on a server with 4 RTX2080Ti connected through PCIe 3.0, each iteration requires 441ms of communication time for the all-reduce operations, resulting in a poor speedup of .
Deep models have a layered structure, and the parameters and their corresponding gradients are typically stored as tens to hundreds of tensors. These tensors are calculated layer by layer on the fly, creating intrinsic time dependency that limits the design space of distributed training. Another challenge is the diversity of tensor sizes. It is difficult to fully utilize the high network bandwidth when exchanging small messages. E.g., on our testbed, transmitting 1MB of message across the 10GbE (TCP/IP), 100GbE (TCP/IP), and 100GbIB (RDMA) takes 1.02ms, 0.51ms, and 0.1ms, respectively, corresponding to an effective throughput of 8.2Gbps, 16.5Gbps, and 83.2Gbps; while transmitting a smaller message of 16KB across the 10GbE, 100GbE, and 100GbIB takes 0.11ms, 0.03ms, and 0.008ms, respectively, resulting in much lower throughput of 1.2Gbps, 4.6Gbps, and 16.7Gbps. Optimally exchanging various tensors among a set of workers requires a co-design of system architecture and communication infrastructure that considers not only the network bandwidth but also the communication latency.
Synchronization: When multiple workers are participating in training a single model with BSP, the workers are synchronized iteratively to keep the model consistency. The synchronization cost could be high due to two reasons. First, in the heterogeneous environment, some workers may run much faster than others , but the fast workers need to wait for the slowest one to synchronize. Second, in the centralized architecture (e.g., a central parameter server is used to manage the model parameters), all workers send and receive data from the central server during the synchronization, which easily makes the central server a bottleneck.
There are three directions to address the above communication challenges: (1) to reduce the C2C ratio; (2) to overlap the communication tasks with the computation tasks; (3) to improve the communication performance by more advanced design of communication primitives, network protocols, and hardware.
In Fig. 2
, we develop a three-level taxonomy to describe communication-efficient distributed DL. At the top, there are high-level optimization algorithms with different communication complexity, which can be classified into lossy and lossless algorithms. If one algorithm has the identical training results with single-worker SGD, then it is lossless; otherwise it is considered as lossy. A lossy algorithm may take more iterations to converge to the same loss value. The middle level is the system architectures that define how the workers exchange the information. At the bottom level, there are diverse communication infrastructures offering the fundamental data communication services. A distributed training method may involve six different aspects:Communication Synchronization; Communication Compression; Scheduling; System Architecture; Communication Protocol; Network Topology; and can be described by “it exploits with/without and/or , running on built on and .” By applying this taxonomy, we provide an overview of the state-of-the-art techniques in Fig. 3, and demystify those details in the following subsections.
The optimization algorithms try to address the communication issues in two directions: 1) Reduce the C2C ratio by increasing the workload of computation (e.g., large-batch training  with BSP) or by reducing the communication complexity with lossy algorithms. Regarding the lossy algorithms, one can relax the synchronization and communication frequency among computing workers (e.g., staled synchronized parallel (SSP) , local SGD , and asynchronous parallel (ASP)  SGD), or compress the communication data with quantization  or/and sparsification . 2) Schedule the communication tasks and computing tasks properly to hide part of the communication overhead.
One immediate way to reduce the C2C ratio is to increase the workload of each worker by enlarging the local mini-batch size of each worker. However, large-batch training could result in bad generalization, which requires some optimization tricks (e.g.., layer-wise adaptive rate scaling (LARS) ) to improve the generalization ability. The mini-batch size is also limited by the memory size of the processor as larger mini-batch sizes require more memory space to store the temporary results.
SSP. SSP SGD  allows some workers to run more iterations before synchronization. It well addresses the synchronization problem of BSP in a heterogeneous environment, where the fast workers should always wait for the slowest one before continuing the next iteration. In SSP, the maximum number of extra iterations (staleness) between the fastest worker and the slowest worker should be bounded to guarantee the training convergence. As shown in Fig. 3 , the fastest worker should wait for the slowest worker with a barrier when the staleness reaches the bound. In practice, the bound can be predefined or dynamically adjusted during training according to the setting of the training task .
Local SGD. Local SGD  allows all workers to run a specific number of local updates independently before synchronization, which achieves the same training scalability as large-batch training, while preserving a good generalization performance.
ASP. Asynchronous parallel (ASP)  SGD enables all workers to train the model without waiting for any other workers to update the model parameters. It is generally used with the parameter server (PS) architecture. In each communication round, the worker only needs to push the gradients to the PS, and then pulls the model parameters from the PS which has no requirement to lock the parameters to wait for other workers.
In summary, different communication synchronization algorithms try to relax the synchronization condition and lessen the communication frequency to reduce the communication cost. BSP has the strictest synchronization at each iteration so that it is lossless, but at the cost of high communication. ASP has no synchronization between workers and thus can achieve higher system scalability, but it is lossy and may sacrifice the convergence performance. SSP and local SGD are two intermediate lossy methods by relaxing the synchronization condition and reducing the communication frequency respectively.
In the communication round, the communication cost is generally linear to the size of the communicating data (gradients or parameters). Communication compression techniques such as gradient quantization  and sparsification  are two popular techniques in reducing the communication traffic.
Gradient Quantization. In the original distributed training algorithms such as BSP, the model parameters and gradients are typically represented by 32-bit single-precision floating points. Gradient quantization can quantify the gradients with a few bits with little impact on the convergence so that the communication traffic can be reduced by up to 32 times. For example, signSGD  only uses 1-bit to represent the sign of the coordinate instead of its floating-point value.
Gradient Sparsification. An orthogonal method to gradient quantization is gradient sparsification. At each communication round, the
-dimensional vector of gradients has only a small proportion of coordinates (saycoordinates with ) contributing to the model updates. One of the representative gradient sparsification algorithms, TopK-SGD , can reduce the communication traffic by two to three orders with slight loss of model accuracy, which is much more aggressive than gradient quantization. In TopK-SGD, after all workers have calculated their local gradients, each worker locally selects “significant” gradients to be averaged with all the other workers to update the model parameters, while the remaining gradients are accumulated to the next iteration.
During the training process of distributed DL, the computing tasks and communication tasks can be described by a directed acyclic graph (DAG). The layer-wise (or tensor-wise) structure of deep models makes it possible to schedule different tasks so that the computing and communication tasks can be overlapped to hide part of the communication cost, as shown in Fig. 3 .
Layer-wise Pipelining. Deep models are stacked with multiple layers, and the learnable parameters of each layer are generally represented by one or two tensors. During the backpropagation (from the last layer to the first layer) in calculating the gradients, if the gradients of layer have been calculated, then they can be communicated to other workers for averaging immediately so that the communication task can be pipelined with the computing task of layer . The naive pipelining between communications and computations during propagation is also called wait-free backpropagation (WFBP) . Considering that transmitting two small messages together is generally faster than transmitting the two messages separately, the MG-WFBP algorithm optimally merges the gradients of several layers to minimize the iteration time .
Tensor Partitioning and Priority Scheduling. In the PS architecture, there are two directions of communications (i.e., push of gradients and pull of parameters). For each layer, the pull of parameters is commonly followed by the push of gradients. If the current layer has a very large tensor, it would block other layers with small tensors. Thus, a more efficient scheduling strategy is to partition a large tensor to multiple small tensors, hereby allowing the lower layers to be scheduled ahead of the higher layers . By using the tensor partitioning and priority scheduling, it is possible to overlap the communication tasks with both feed-forward computing tasks and backpropagation computing tasks.
In distributed DL, PS and all-to-all are two main system architectures supporting the data communication as shown in Fig. 3 .
Parameter Server. In the parameter server (PS) architecture, the PS is logically a central server that aggregates the gradients from the workers, updates the model parameters with the gradients, and sends back the latest model parameters to the workers. It provides a very flexible framework for different optimization algorithms. For example, SSP and ASP can be easily implemented with the PS. However, since PS needs to receive gradients from and send parameters to all workers, it could easily become the system bottleneck especially in the BSP algorithm where all workers communicate with the PS simultaneously. The round-robin synchronous parallel scheme  can alleviate the bandwidth bottleneck by delaying some workers’ updates. We can also use multiple PSes to alleviate the communication pressure. BytePS333https://github.com/bytedance/byteps is a highly optimized framework that supports multiple PSes by partitioning the whole gradient tensors in a load-balanced manner so that each PS only needs to communicate a part of the gradient tensors.
All-to-all. In BSP and local SGD, the average of the distributed gradient or parameter vectors can be calculated by an all-to-all operation, e.g., the all-reduce in Message Passing Interface (MPI). In MPI, the all-to-all operation can be implemented with one reduction operation and one broadcast operation  with the reduction tree algorithm. The ring-based all-reduce collective is widely used in distributed DL, which can fully utilize the bandwidth. However, the ring-based all-reduce has a latency term that is linear to the number of workers, which is inefficient when scaling to large-scale clusters. In the recent version 2.4 of the high-performance communication library (NCCL444https://developer.nvidia.com/nccl), the double binary trees algorithm  is integrated for dense-GPU clusters, which delivers a logarithmic latency while preserving the property of bandwidth optimality. The double binary trees algorithm requires the data to be broken down into multiple blocks and the workers to be formed as two binary trees so that different blocks can be executed in parallel. Horovod555https://github.com/horovod/horovod is a popular distributed DL framework built for the all-to-all architecture, and it supports many state-of-the-art distributed communication libraries (e.g., MPI, NCCL and Gloo666https://github.com/facebookincubator/gloo).
At the bottom level, the communication infrastructures are crucial to providing communication-efficient operations, which include the communication protocols and the design of network topology.
As shown in Fig. 3 , there are currently three sets of popular communication protocols including TCP/IP, InfiniBand (IB), and RoCE. TCP/IP is widely supported by network interface cards (NICs) and commodity Ethernet switches. However, it is inefficient for high-speed data communications in distributed training due to the high cost of data memory copy between the kernel buffer and the application buffer. TCP/IP also consumes CPU resources to process the packets. On the other hand, the RDMA (remote direct memory access) technology can deliver lower latency and higher throughput than TCP/IP . In RDMA, the NIC can write data directly to application memory without the participation of the operating system. The IB technology was designed to adopt the RDMA protocol, but it requires expensive NICs and switches. RoCE was developed to run RDMA over the cheaper Ethernets, and it has two different versions RoCEv1 and RoCEv2. RoCEv1 is an Ethernet link layer protocol and hence the two nodes in the same Ethernet broadcast domain can communicate. RoCEv2 is an Internet layer protocol so that RoCEv2 packets can be routed. In summary, RDMA delivers much lower latency and higher throughput than conventional TCP/IP, and can be used on both IB and Ethernet.
Fat-tree and BCube are two main network topologies in modern data centers. The two topologies are with moderate network robustness in low-cost design. As shown in Fig. 3 , Fat-tree employs all switches for packet forwarding, while both servers and switches participate in packet forwarding in BCube. In both Fat-tree and BCube, each server is guaranteed with the full link bandwidth to communicate with any other servers in the cluster. However, the Fat-tree cannot simultaneously support two ring-based synchronizations, while BCube supports when the number of levels and the number of ports of a switch is even . In the PS architecture with multiple PSes, BCube performs slightly better than Fat-tree. BCube is also more friendly to the hierarchical synchronization than Fat-tree. The Torus network topology is recently deployed (e.g., Google TPU’s cluster) for distributed DL. From Fig. 3 , we can see that there are ring structures in both the horizontal and vertical directions. Therefore, it is very useful to the ring-based all-to-all algorithms. The original vector can be partitioned into two parts, which can be simultaneously all-reduced with the two directions’ rings by fully exploiting the bandwidth of links.
To conclude this section, we summarize and compare different distributed DL techniques in Table I.
|Hierarchy||Technology||Method||Ref.||Lossy/ Lossless||Sync. Freq.||Comm. Freq.||Comm. Traffic||Conve- rgence||Effic- iency||Robus- tness||Latency|
|Optimization Algorithm||Communication Synchronization||BSP||||Lossless||High||High||High||Easy||-||-||-|
|Communication Infrastructure||Communication Protocol||TCP/IP||-||-||-||-||-||-||-||-||High|
In this section, we evaluate and compare the system performance of seven state-of-the-art distributed training techniques listed in Table II, where we focus on the lossless algorithms which have the same training convergence speed as the vanilla SGD.
|Method||System Architecture||Scheduling||Distributed Software||Common Libraries|
|PS/All-to-all||Pipelining||Tensor Fusion||Tensor Partition|
|BSP-PS ||PS||✘||✘||✘||BytePS||PyTorch-1.4 CUDA-10.1 NCCL-2.4.8|
|BSP-A2A [2, 13]||All-to-all||✘||✘||✘||Horovod|
|WFBP-A2A [2, 13]||All-to-all||✔||✘||✘||Horovod|
Hardware: Experiments are conducted on a GPU cluster with two sets of communication infrastructure: TCP/IP over 10Gbps Ethernet (10GbE), and RDMA with 100Gbps InfiniBand (100GbIB). The cluster consists of 8 nodes, and each node has 4 Nvidia RTX2080Ti GPUs (11GB RAM) connected with PCIe3.0 x16, two Intel(R) Xeon(R) Gold 6230 CPUs, and 512GB memory. To fairly compare the network protocols between RDMA and TCP/IP, we further configure an extra 100Gbps Ethernet (100GbE) interface atop the IB NIC through IPoIB.
Software: We exploit PyTorch-1.4777https://pytorch.org/ as the backbone framework with GPU libraries of CUDA-10.1, cuDNN-7.6 and NCCL-2.4.8. We use the highly optimized libraries of BytePS and Horovod for PS and all-to-all architectures, respectively.
Deep Models: Three representative models, ResNet-50, BERT-Base, and BERT-Large, whose model intensities are 470, 249, and 248 respectively, are exploited to compare the performance of different methods. BERT-Base (with 110 million parameters) has about four times larger model size than ResNet-50 (with 25.5 million parameters), while BERT-Large (with 336 million parameters) has around three times larger model size than BERT-Base. To maximize the utilization of GPU memory (11GB), the local mini-batch sizes are set to 64, 64, and 8 for ResNet-50, BERT-Base, and BERT-Large respectively.
Measurements: We use the metric of system throughput (i.e., samples per second) in processing the data during training to measure the performance of evaluated methods. For ResNet-50, the sample is an image with a resolution of ; for BERT-Base and BERT-Large, the sample is a sentence with a length of words. We use the SGD training with a single RTX2080Ti GPU as the baseline to calculate the speedups. Notice that when comparing the results between different number of workers, they have different effective batch sizes and their convergence might be different .
The experimental results are shown in Fig. 4. Each result is the average of five independent experiments. For each run, we conduct 10 training iterations for warm-up, and another 100 iterations for measuring the average throughput.
ResNet-50 vs. BERT-Base: The model intensity of ResNet-50 is about two times higher than BERT-Base, and their mini-batch sizes are both 64; so the C2C ratio of ResNet-50 is around half of BERT-Base. Comparing Fig. 4(a)-(c) with Fig. 4(d)-(f), we can see that ResNet-50 has much better scalability than BERT-Base. For example, on the intra-node training with 4 GPUs, we can achieve an optimal speedup of on ResNet-50, but only on BERT-Base; on the inter-node training with 32 GPUs using 10GbIB, ResNet-50 has a speedup of , while BERT-Base has only . The results confirm that a model with higher intensity is easier to be parallelized.
BERT-Base vs. BERT-Large: The model intensity of BERT-Base and BERT-Large is very close, but the mini-batch size for BERT-Base can be larger than BERT-Large due to the smaller GPU memory consumption. Therefore, the C2C ratio of BERT-Large is about higher than BERT-Base, which makes BERT-Large much more difficult to be parallelized than BERT-Base, as confirmed by comparing Fig. 4(g)-(i) with Fig. 4(d)-(f). For example, 4-GPU training has a maximum of speedup on BERT-Large, while it is on BERT-Base. The small GPU memory size of RTX2080Ti and the bandwidth of PCIe3.0 are not suitable for distributed training of BERT-Large. As a contrary, when training BERT-Large on a more expensive server with four Nvidia V100 GPUs (with 32GB memory) connected by NVLink (with more than higher bandwidth than PCIe3.0), the mini-batch size can be as large as 128, and we achieved a speedup of . The results indicate that a larger memory can support larger mini-batch sizes, which can reduce the C2C ratio and make the model easier to be parallelized.
It is well known that the PS architecture with a single PS cannot scale well. So in our evaluation on PS architecture, we use the same number of PSes as worker servers . With this setting, the PS architecture consumes more network switch ports and more network bandwidth as compared to A2A.
Regarding BSP-PS and BSP-A2A without pipelining, BSP-A2A outperforms BSP-PS in most cases. However, when exploiting the WFBP algorithm  to pipeline communications with computations, WFBP-PS achieves much higher performance than WFBP-A2A especially on 10GbE and 100GbE that have higher latency than 100GbIB or on 32 workers. The all-to-all architecture has a latency term that is logarithmic/linear to the number of workers with tree/ring-based algorithms, and WFBP requires the gradients be aggregated tensor-wisely, which results in many startup overheads . The tensor fusion technique  can well address the startup problem. As we can observe from Fig. 4, MG-WFBP is on par with or better than WFBP-PS. In summary, PS is less latency sensitive than all-to-all, so the PS architecture with availability of extra PSes generally achieves higher performance than all-to-all on 10GbE and 100GbE, while all-to-all has higher scalability than PS on 100GbIB with RDMA that has much lower latency than TCP/IP.
With the WFBP algorithm, WFBP-PS and WFBP-A2A both run faster than BSP-PS and BSP-A2A respectively, but WFBP-A2A might suffer from the startup problem because of the small tensor communications. MG-WFBP significantly improves the scalability of WFBP-A2A, especially on the high-latency 10GbE and 100GbE networks. ByteScheduler-A2A schedules the communications in the opposite direction with MG-WFBP by partitioning tensors instead of merging tensors, while the all-to-all architecture is latency-sensitive, so ByteScheduler-A2A does not achieve any speedups. However, with the PS architecture, ByteScheduler-PS achieves obvious improvement over WFBP-PS on 100GbE with TCP/IP as shown in Fig. 4(b), which indicates that without bringing extra heavy latency by partitioning tensors, communications of partitioned tensors can be well scheduled to be overlapped with backpropagation and feed-forward computations . In summary, scheduling algorithms can significantly improve the scalability, and tensor fusion is more suitable to the all-to-all architecture, while tensor partition is more suitable for PS.
Network performance is one of the main factors affecting the communication efficiency. By improving the raw communication speed from 10Gbps (Fig. 4(a)(d)(g)) to 100Gbps (Fig. 4(b)(e)(h)), the training throughput can be improved by 10-80 in most cases. The network protocol is another important factor for communication efficiency. Comparing 100GbIB using RDMA (Fig. 4(c)(f)(i)) with 100GbE using TCP/IP (Fig. 4(d)(e)(h)), we can see that 100GbIB further boosts the system scalability. For example, on ResNet-50 with 100GbIB, WFBP-PS and WFBP-A2A achieve about 15 improvement over 100GbE; on BERT-Base and BERT-Large, MG-WFBP has nearly 2 speedups on 100GbIB than on 100GbE. However, on BERT-Base and BERT-Large with relatively high C2C ratios, the scalability of all algorithms is still very low even using the low-latency and high-bandwidth RDMA network as shown in Fig. 4(f)(i), mainly due to the slow PCIe3.0 within a node. It is also interesting to observe that there is no single algorithm that works the best for all scenarios. We argue that different optimization strategies should be applied for different communication infrastructures with different network bandwidth and communication latency.
Even though various optimization techniques have been proposed to address the communication problem of distributed DL, there are still many critical issues that are full of challenges in this research area.
The synchronization of multiple workers aims to keep the model consistency and guarantee the convergence of the training. SSP and local SGD are two main technologies to balance the synchronization frequency and the convergence speed, which are with intermediate consistency between BSP and ASP. However, the synchronization frequency cannot be automatically adaptive to the training configurations (such as deep models, hyper-parameter settings, communication speeds of the cluster, etc.) to achieve the fastest convergence. For example, it was reported in  that, lower synchronization frequency of local SGD would sacrifice the test model accuracy. On the other hand, the lower synchronization frequency indicates that the algorithm can run faster with less synchronization overhead, but it may take more iterations to achieve the target model accuracy. It remains as a challenge to set the synchronization frequency to achieve the target accuracy with a shorter training time budget. One possible direction is to dynamically set the synchronization frequency from the highest to the lower ones during the training process according to the training or validation loss.
Relaxed synchronization and communication compression can both reduce the C2C ratio, but they are separate optimization algorithms with lower communication traffic than BSP. However, it is still unknown whether the communication compression techniques are applicable to SSP, local SGD, and ASP. Obviously, if the model/gradient compression can be used in relaxed synchronization methods, the C2C ratio can be further reduced. Yet, it is non-trivial to guarantee the convergence and keep the convergence rate when combining these two technologies. The key idea of the communication compression of distributed DL is that only a small proportion of “significant” gradients have contributed to the model update. Theoretically, it could be possible to compress the communication data in SSP and local SGD if the communicated gradients have similar distributions with BSP.
Gradient compression techniques including quantization and sparsification can reduce the communication size (thus the C2C ratio). However, with a very high compression ratio, it generally requires a larger number of iterations to achieve the target optimization error. Similar to the unknown synchronization frequency, it is also challenging to balance the compression ratio and the convergence speed. One should set the compression ratio properly for a specific cluster such that the training budget is minimal. One possible direction is to measure C2C ratios layer-wisely of deep models dynamically, and then set different compression ratios for different layers such that the system throughput is maximized under the current training environment.
According to the characteristics of distributed DL, various scheduling algorithms have been proposed to maximize the parallelism of computing tasks and communication tasks. However, these algorithms were built upon the directed acyclic graph (DAG) of BSP with three types of tasks (i.e., feed-forward computing tasks, backpropagation computing tasks, and gradient communication tasks). The scheduling algorithm is very useful when the communication time is comparable to the computing time, but it only brings marginal improvement if the communication time is much longer than the computing time. Although communication compression can reduce the communication cost to be comparable to the computing time, current scheduling methods are not directly applicable to the BSP with gradient compression (e.g., TopK-SGD) because gradient compression introduces extra non-negligible computational costs and smaller communication traffic, which makes the scheduling more difficult. One possible solution is to design a generic scheduler for configured DAGs. Note that the DAG would be changed due to tensor partitioning or merging. For the configured DAG, the scheduler can use some heuristic algorithms to search for the higher-performance configurations. Furthermore, current scheduling techniques such as MG-WFBP and ByteScheduler  consider two opposite directions (i.e., tensor fusion and tensor partition) for scheduling. In practice, no one is always optimal. A smarter scheduler could be adaptive to the training environment to dynamically determine whether the tensors should be merged or partitioned to achieve higher performance.
The PS and A2A architectures are widely deployed for the BSP algorithm in both industry and academia. Intuitively, the A2A architecture is more efficient than the PS as A2A does not need the central server. However, one can use multiple PSes to reduce the central server’s bandwidth pressure. One interesting problem is how to configure the number of PSes with a given cluster such that the performance can be comparable with or higher than the A2A counterpart.
The torus network topology has been successfully deployed in the Google TPU cluster, which can alleviate the long latency issue for the ring-based all-reduce collective. On the other hand, Nvidia integrated the double binary trees algorithm  into their high-performance communication library NCCL, which achieves full bandwidth and a logarithmic latency (much lower than the rings). However, the existing binary trees algorithm cannot fully utilize the link bandwidth of the torus topology. In the torus topology, each node has four directly connected nodes, which should have been well matched for the two binary trees where each node has two in-degrees and two out-degrees. It is interesting to design new all-reduce algorithms for torus topology, or to design new network topologies to better support distributed DL.
In this article, we gave an overview of the techniques to address the communication challenges in distributed deep learning. We first analyzed the communication problems in distributed training of deep learning models, and then presented a taxonomy and survey of the existing state-of-the-art technologies, including optimization algorithms, system architectures, and communication infrastructures. We also compared the performance of seven representative distributed training methods by real-world experiments on different network infrastructures. Finally, we discussed the challenges and possible future research directions in this area.
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, “S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters,” inProceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017, pp. 193–205.
International Conference on Machine Learning, 2018, pp. 560–569.