It has turned out that deep learning achieves far better predicting performance than existing methods in image recognition, natural language processing, speech recognition and many other fields where machine learning is being applied. The basic technology of neural networks used in deep learning has a long history dating back to the 1950’s. As we entered the 2010’s, the neural network technology with its long history has made the breakthrough as “deep learning” as described above because it is thought to have successfully combined all the advances of algorithms, large-scale data, and high computing powers. Even today, it would be difficult to achieve an outstanding predicting performance by deep learning if one of the three lacks. In this article, we focus on one of the three pillars supporting deep learning: computing performance.
It has become a standard approach to use highly efficient GPUs for training in many deep learning tasks. Nevertheless, the training process is still time-consuming even with the latest GPUs because models have also grown massive and complex. For example, training Resnet-50 He2016 for the ImageNet dataset imagenet2009 typically takes as long as one week with a single GPU. Taking a long time on training means you have a limited number of times to do trial and error for models and parameters needed to achieve high accuracy, making it difficult to produce a good predicting performance. It also means there is a limit to the usable data size. Thus, using multiple GPUs in parallel is crucial in accelerating calculation.
We introduce ChainerMN, an add-on package to Chainer Tokui2015 , a programming framework for deep learning applications written in Python, to provide a distributed learning capability. In the course of developing ChainerMN, we took the following features into consideration:
Flexibility: Chainer is a flexible framework based on its Define-by-Run approach and ChainerMN is designed not to ruin the flexibility aspect. This allows for easy distributed learning even in complex use cases such as dynamic neural networks, generative adversarial networks, and reinforced deep learning.
High performance: We selected technologies assuming practical workloads in deep learning from the very beginning of designing ChainerMN as well as exercised ingenuity with respect to implementation so that hardware performance is fully utilized.
The rest of the paper is organized as follows. First, we explain the basic elements of distributed deep learning, followed by the design and implementation of ChainerMN. Finally, we will present the results of our evaluation experiment and related work.
2.1 Basics of Deep Learning
We can express the prediction by neural networks against input data as where
is a parameter for neural networks. Learning in neural networks using backpropagation and stochastic gradient descent or its variations is an iterative algorithm. Each iteration is composed of the following three steps: forward computation, backward computation, and optimization.
In the forward-computation step, first, the prediction is calculated against an input data point . Then, the loss is calculated to represent the difference from the correct output for. Here, the cross entropy and other indicators may be used.
In the backward-computation step, , the gradient of the parameter in the direction of decreasing the loss
, is calculated. Gradients for all parameters are calculated using the chain rule while going backward from the output layer to the input layer.
In the optimize step, the parameter is updated using the gradient . The simplest rule is to update to where is a parameter called a learning rate.
In practice, instead of using a single training example in an iteration, the forward and backward calculations are performed simultaneously against multiple training examples and optimization is executed using the average of gradients against all the examples. The input examples used in an iteration is called a minibatch while its size is called a batch size. A typical batch size ranges from several tens to several hundred.
Please note that the above description is based on a standard supervised learning. Nonetheless, in case that neural networks are applied to other algorithms such as unsupervised learning and semi-supervised learning, the parallelizing method we will explain below is applicable and ChainerMN is also usable.
2.2 Data Parallel and Model Parallel approaches
There are two major approaches to parallelize training by distributed processing: data parallel and model parallel. In the data-parallel approach, each worker has a model replica and calculate gradients of different minibatches. Workers use these gradients to update the model collaboratively. In the model parallel approach, each worker has a portion of the model and work in cooperation with others to do the calculation for one minibatch. Figure 2 shows the difference between the two approaches.
The model-parallel approach was actively used in the days when GPU memory was small. At present, the model parallel is rarely used in its basic form as the data parallel approach is being used. In the meantime, some issues with the data paralleled approach have surfaced while research on a new form of the model parallel is underway. The model parallel and the data parallel can be used at the same time as well.
2.3 Synchronous vs. Asynchronous
In this subsection, we will focus on the data-parallel approach which is commonly used now. The data-parallel approach is roughly divided into synchronous and asynchronous types, and we explain about the former first. Each iteration in synchronous, data-parallel deep learning is composed of the following four steps: forward computation, backward computation, Allreduce communication, and optimization. Figure 2 illustrates the four steps.
This has an additional step Allreduce to the regular iteration described earlier. In this step, workers communicate with each other to find the average of gradients calculated by individual workers and distribute the average. All workers update the model using the gradient they have obtained through the communication. If we define the batch size processed by each worker as and the number of workers as , the gradient obtained through communication is equivalent to the gradient in the batch size . This means gradients are calculated using more training data in one iteration, improving the gradient quality and accelerating the learning process.
Asynchronous type, on the other hand, uses special workers called a parameter server. The parameter server controls model parameters. Normal workers send gradients to the parameter server once the gradients are obtained by forward and backward calculations. The parameter server receives and uses the gradients to update the model. Workers receive new model parameters and calculate gradients again.
3 Design and Implementation
3.1 Parallelization Approaches
We discuss the design decision of ChainerMN in this section. As we discussed in section 2, there are two major parallelization approaches and two synchronization approaches. We adopt a synchronous and data parallel approach for ChainerMN.
We use the data parallel approach because existing deep learning applications would easily be extensible and faster training process through data parallel was highly expected. Data parallelization is tantamount to increasing a minibatch size in a typical deep learning application and has its advantage of being applicable without having to make significant changes in algorithms and codes of existing programs.
Whether the synchronous or asynchronous type is desirable is also a nontrivial question since different kinds of strategies have been taken in each implementation and results would vary depending on tasks or settings. The paper Xinghao2017 shows experimental results that the asynchronous type is less stable regarding convergence whereas it is faster to converge in the synchronization. Also, we can benefit from the optimized and proven group communication mechanism of MPI, the de-facto standard communication library interface, while in the asynchronous model the implementation scheme uses a parameter server in general.
Chainer is a framework with its Define-by-Run feature. Define-by-Run is a model that takes advantage of the flexibility of script languages where learning models and computational flows are defined at runtime. A Define-and-Run approach, on the other hand, is a model that pre-defines a structure of networks, after which data is input and calculation are done. While potentially easier to optimize performance, this approach is said to lack flexibility.
Chainer provides programming models that enable to define complex neural networks flexibly or make modifications during runtime thanks to its Define-by-Run approach. This lets researchers and engineers work on new models or complex models through trial and error with ease and therefore is suitable for research and development of machine learning. Upon development, we carefully designed the ChainerMN API with the objective of making it easily portable from existing Chainer programs without putting limitations on the flexibility of Chainer.
3.3 API Design
We describe the design of library interface of ChainerMN by describing minimal steps to extend an existing deep learning program written in Chainer to support distributed execution using ChainerMN.
Listing 1 shows a simplified ChainerMN program of a model to solve MNIST classification problem mnist2009 . For a complete program code, refer to ChainerMN’s code repository chainermn . There are three major steps: (1) add a communicator component, (2) create and use mutli_node_optimizer, and (3) add code to distribute a dataset.
A process of modifying an application starts from adding a communication component called Communicator to existing Chainer programs. A communicator is a central component of ChainerMN, and it is designed after MPI’s communicator concept and controls all inter-process communication in ChainerMN program.
mutli_node_optimizer is the most important component in ChainerMN. It wraps Chainer’s normal optimizer and exchanges the gradient across processes using Allreduce operation before optimizing the model. multi_node_optimizer behaves identically as the original optimizer except for the communication, so the extension is seamlessly integrated into Chainer’s existing Trainer ecosystem.
On top of this, basic porting can be done just by adding the scattering step which distributes data for data parallel computations. One needs to split the dataset into equal chunks and distribute them over the processes. This operation is also known as Scatter in MPI. Other parts, i.e.Iterator, Updater, and Evaluator do not need to be changed in basic use cases. Because of this API design, it allows various Chainer programs to be ported with minimal modifications while making the most of the advantage given by Define-by-Run.
3.4 Implementation and Performance Optimization
The communication pattern of synchronous and data parallel deep learning applications is relatively simple from the point of view of HPC applications. Roughly speaking, the only major communication is Allreduce, a process to exchange gradients which are training and evaluation results. Auxiliary parts include Scatter, which arranges necessary data over distributed processes before starting training.
As mentioned above, one of the design goals of ChainerMN is to achieve high performance by leveraging existing and proven HPC technologies. Allreduce is a component that especially requires speed because it is called in every training iteration and needs to process a large amount of data. We attempt to minimize the communication time by using NCCL nccl library developed by NVIDIA. NCCL is a highly-optimized communication library which provides a faster Allreduce operation between NVIDIA GPUs within and across nodes.
4.1 Experimental Environment and Settings
We conducted our experiments on our in-house cluster. It consists of 32 computing nodes. Each node is equipped with two Intel Xeon CPUs (E5-2623 v3, 3.00 GHz, four cores for each), 128 GB of main memory, and four GeForce GTX TITAN X GPUs. Thus, we used 128 GPUs in total. The nodes are interconnected by Mellanox Infiniband FDR 4X. We used CUDA version 8, Python version 3.5, Mvapich2 2.2 and Chainer version 1.2 running on Ubuntu 14.04 LTS.
To demonstrate the performance and scalability of ChainerMN, we use ResNet-50 He2016 model and ImageNet imagenet2009 dataset. Since the dataset is large and the majority part of access is read, we copied all the dataset to all computing nodes’ local SSD in advance.
We used 32 as the batch size per GPU, which means 4096 for 128 GPUs. One of the factors making distributed deep learning difficult is that improving throughput does not necessarily mean better learning efficiency. We note that the batch size 4096 is a healthy setting where the learning efficiency and the resulting model accuracy are maintained, as shown by Goyal et al. Goyal2017
4.2 Scalability Result
Figure 3 shows the scalability of ChainerMN up to 128 GPUs. In this figure, ChainerMN scales well up to 128 GPUs. Table 4.2 shows the relative runtimes over one-GPU execution. In this table, ChainerMN on 128 GPUs achieves 79 % and 90 % parallelization efficiency of the one-GPU and one-node (four GPUs) executions, respectively. It means that the parallelization efficiency of ChainerMN on 128 GPUs is as high as the state-of-the-art Goyal2017 .
We have described the design and implementation of ChainerMN and demonstrated its scalability. Chainer and ChainerMN are designed to have both high flexibility and scalability with its primary object of accelerating research and development in deep learning. We will continue making improvements by tackling challenges such as model parallel, overlapping communication and computation, asynchronous computation among workers, optimized communication by compressed gradients, and fault tolerance.
The authors thank K. Ueno, T. Mitsuishi, and N. Yoshifuji for help on the development of ChainerMN. We also thank T. Sudo, Y. Doi, G. Watanabe, R. Okuta, and M. Sakata for help for experiments. We are grateful to T. Miyato and S. Tokui for fruitful discussions as well.
-  ChainerMN. https://github.com/chainer/chainermn, 2017.
-  NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl, 2017.
-  TSUBAME e-Science Journal. http://www.gsic.titech.ac.jp/TSUBAME_ESJ, 11 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
-  P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
Y. Lecun and C. Cortes.
The MNIST database of handwritten digits.
-  X. Pan, J. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. ICLR Workshop Track, 2016, 02 2017.
-  S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation open source framework for deep learning. In LearningSys, 2015.