1. Distributed DNN training is communication bound
The goal of this work is to accelerate distributed DNN training in cloud environments. This work focuses on “data” parallelism, where workers process different samples and share the same model. A training iteration in this paradigm has two main components: computation-heavy forward and backward passes, and a communication-heavy model update step. As DNN models get larger and speedier accelerators emerge, the performance bottleneck of distributed DNN training has shifted from computation to communication.
Larger DNN models require more gradient communication per iteration. The throughput of GPUs on ResNet, a recent DNN, has increased by 35x on modern GPUs (Figure 0(a)), effectively a similar increase in network bandwidth given a fixed batch size. However, the network bandwidth in compute instances on major cloud providers such as EC2 has not improved across generational upgrades (ec2, [n. d.]). Further, existing parameter exchange mechanisms have problems scaling up the total throughput on a standard cloud network stack (Table 1). The compound effect of these factors dramatically increases communication overhead during DDNN training.
Figure 0(b) summarizes the throughput of modest-scale DNN training with 8 workers and 8 colocated PSs on EC2 with 10Gbps links and a per GPU batch size of 4 (maximizing GPU memory usage on GRID 520): although modern DNN training frameworks can overlap backward passes with model updates, they can no longer hide the latency of communication due to faster computation. One solution is to increase the per GPU batch size, leading to a larger global batch size given a fixed number of GPUs. Large global batch sizes hurt statistical efficiency (Goyal et al., 2017; IBM, [n. d.]; Keskar et al., 2016); also, GPUs have limited memory. Techniques such as (Chen et al., 2016) could alleviate that pressure, but at a higher computational cost.
Communication overhead will likely worsen as the gap between computation and communication capability widens. New accelerators continue to reduce computation time, but networks are not getting faster at the same rate. Over the last 5 years, 100 Gbps networks have become available, but they pose high cost and have limited deployment.
These observations suggest that DDNN training has shifted from a compute-bound problem to one that also has a significant network-bound component. It is critical to perform model updates efficiently.
2. PHub: optimized parameter server
|Framework||Local||2 workers||4 workers||8 workers|
Model updates are usually performed in a parameter server (PS), a key-value store for the current model (Smola and Narayanamurthy, 2010; Li et al., 2014a; Li et al., 2014b; Zhang and Ré, 2014). We base our work on MXNet, a widely used, state of the art DDNN training framework that is known to be fast (Table 1, (Cha, [n. d.]; Shi et al., 2016)) and supports native distributed training. Our profiling of MXNet reveals two problems: (1) insufficient network bandwidth (more so with colocated PS111The PS process and the worker training process reside in the same machine. than non-colocated servers) and (2) an inefficient PS software stack. We found that data copy, aggregator, and optimizer are the main bottlenecks in the model update process: they prevent the PS from scaling to higher throughput with high bandwidth networks.
We propose PHub, a high performance PS design for DDNN training. We briefly summarize the main optimizations in PHub.
Network Stack Optimized InfiniBand support for lower network overhead, with one shot registration, zero copy, and minimized metadata, so all bandwidth is dedicated to gradient payload.
Aggregation and Optimization
Fine grained key chunking (32KB) for maximized overlap of gradient processing and network transfer, and optimal load balancing in processor cores; locality-preserving, vectorized implementation of aggregator and optimizer.
Gradient Memory Layout NUMA aware, balanced scheme for assigning a key chunk to a processor core, through a series of load-balanced, locality-preserving assignment of queue pairs, interfaces, completion queues to cores (Figure 1(a)). PHub incurs zero synchronization between cores or between NUMA domains.
These software optimizations benefit centralized or sharded PS222Multiple PS processes, each in charge of a partition of keys. configurations. However, to scale up a central PS, software alone is not sufficient: the hardware in a typical server is unbalanced, with significantly more computing resources than network resources. Typically, a single interface/connection in the server must handle traffic for all participating workers. We propose PBox, a new server architecture that balances IO, memory and network bandwidth. Our prototype PBox is built using a server with ten 56Gbps network interfaces—5 per NUMA domain (Figure 1(b)). PBox takes full advantage of PHub software and essentially forms micro-shards inside a single box. We integrated PHub with MXNet; Figure 3 shows the speedup of PHub over a MXNet colocated sharded baseline when training ImageNet-winning networks with 8 workers. PHub on PBox can provide up to 3.8x speedup over the state-of-the-art on a cloud-like 10Gbps network333All interfaces have a negotiated speed of 10Gbps with the switch in this experiment.. The speedup using 56Gbps links is similar, ranging from 2x-7x depending on the DNN being trained.
PHub shows linear scaling with our 8 worker cluster running all workloads, and provides higher throughput than other parameter exchange patterns using MPI or collectives (because PHub uses only one round of communication and minimum total data transfer per iteration). To understand its limits, we used a special ZeroComputeEngine that simulates infinitely fast computation in MXNet, performing only parameter exchange operations. We found PBox performance is limited only by the bandwidth between the PCIe controller and the memory system (Figure 5
). This limit is hard to hit in real training: we estimate that a single PBox will support up to 120 workers training ResNet-50 with batch size of 32 per GPU. If each worker includes 4 GPUs, that translates to a global batch size of 15K, surpassing the maximum suggested in(Goyal et al., 2017). Recent work suggests larger batch sizes may impede training (Goyal et al., 2017; IBM, [n. d.]; Keskar et al., 2016; LeCun et al., [n. d.]); if higher scalability is desired, sharding or use of new platforms with more PCIe throughput (e.g., (AMD, [n. d.])) would enable PHub to provide higher throughput.
3. In-network Aggregation
The PBox results show the benefit of a high-bandwidth centralized PS. Recent programmable switches (Liu et al., 2017; Li et al., 2016b; Li et al., 2016a) enable a new approach to building centralized PS designs by allowing gradient aggregation operations to be performed in the network. Figure 5 shows how PHub running in a top of rack (ToR) switch can benefit from the full bisection bandwidth inside a server rack, performing centralized aggregation of gradients inside a rack; only a single aggregated stream must be sent to higher level switches for further aggregation across racks. This hybrid synchronization reduces bandwidth usage and localizes data movement in the data center.
Current switches have limited computational capabilities: most can perform only integer operations, with little on-switch storage, and only on a small region of each packet. Our future work includes exploring the hardware requirements necessary for efficient DDNN training, as our emulation-based experiments show that these limits lead to unsatisfactory throughput on current switches.
- AMD ([n. d.]) [n. d.]. AMD EPYC. http://www.amd.com/en/products/epyc. ([n. d.]).
- ec2 ([n. d.]) [n. d.]. EC2Instances.info Easy Amazon EC2 Instance Comparison. https://www.ec2instances.info/?region=us-west-2. ([n. d.]).
- IBM ([n. d.]) [n. d.]. IBM Research achieves record deep learning performance with new software technology. https://www.ibm.com/blogs/research/2017/08/distributed-deep-learning/. ([n. d.]).
- Cha ([n. d.]) [n. d.]. Performance of Distributed Deep Learning using ChainerMN. https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html. ([n. d.]).
- Chen et al. (2016) Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).
- Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).
- Keskar et al. (2016) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).
- LeCun et al. ([n. d.]) Y LeCun, L Bottou, and G Orr. [n. d.]. Efficient BackProp in Neural Networks: Tricks of the Trade (Orr, G. and Müller, K., eds.). Lecture Notes in Computer Science 1524 ([n. d.]).
- Li et al. (2016a) Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, and Dan R. K. Ports. 2016a. Just Say No to Paxos Overhead: Replacing Consensus with Network Ordering. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, Berkeley, CA, USA, 467–483. http://dl.acm.org/citation.cfm?id=3026877.3026914
Li et al. (2014a)
Mu Li, David G. Andersen,
Jun Woo Park, Alexander J. Smola,
Amr Ahmed, Vanja Josifovski,
James Long, Eugene J. Shekita, and
Bor-Yiing Su. 2014a.
Scaling Distributed Machine Learning with the Parameter Server. InProceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, USA, 583–598. http://dl.acm.org/citation.cfm?id=2685048.2685095
- Li et al. (2014b) Mu Li, David G. Andersen, Alexander Smola, and Kai Yu. 2014b. Communication Efficient Distributed Machine Learning with the Parameter Server. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). MIT Press, Cambridge, MA, USA, 19–27. http://dl.acm.org/citation.cfm?id=2968826.2968829
- Li et al. (2016b) Xiaozhou Li, Raghav Sethi, Michael Kaminsky, David G. Andersen, and Michael J. Freedman. 2016b. Be Fast, Cheap and in Control with SwitchKV. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, Santa Clara, CA, 31–44. https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/li-xiaozhou
- Liu et al. (2017) Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. 2017. IncBricks: Toward In-Network Computation with an In-Network Cache. SIGOPS Oper. Syst. Rev. 51, 2 (April 2017), 795–809. https://doi.org/10.1145/3093315.3037731
- Shi et al. (2016) Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking State-of-the-Art Deep Learning Software Tools. (2016). arXiv:arXiv:1608.07249
- Smola and Narayanamurthy (2010) Alexander Smola and Shravan Narayanamurthy. 2010. An Architecture for Parallel Topic Models. Proc. VLDB Endow. 3, 1-2 (Sept. 2010), 703–710. https://doi.org/10.14778/1920841.1920931
- Zhang and Ré (2014) Ce Zhang and Christopher Ré. 2014. DimmWitted: A Study of Main-memory Statistical Analytics. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1283–1294. https://doi.org/10.14778/2732977.2733001