Recent progress in machine learning has enabled deep neural networks (DNNs) to advance the state of the art in a wide range of problem domains, from computer vision to high energy physics . As the applicability of DNNs has broadened, there have been efforts to develop user-friendly tools for building them. Software packages such as Keras  and TFLearn  facilitate the construction and training of deep neural networks, offering a flexible interface for combining common model components and configuring the optimization process.
Large model sizes and long training times have motivated the development of distributed training algorithms for DNNs  . These algorithms work by splitting the training task across multiple concurrent processes, which can be threads on a single machine or jobs spread across the nodes of a cluster. The speed-up provided by distributed algorithms is relevant when fast training is critical, such as when iterating on model choice during development, or when retraining a model on new data in a production environment.
Despite the rise of convenient model-building software packages such as Keras, there are few tools for interfacing these packages with distributed training algorithms. In this paper we introduce a lightweight Python framework, mpi_learn, that provides a straightforward means of training Keras models in a distributed fashion. The framework is built on the Message Processing Interface (MPI) protocol  and can operate on personal machines, multi-GPU servers, and large supercomputing sites alike.
Ii Related Work
The package described here was written during the summer of 2016 and was motivated by the need for a mechanism to parallelize the training of models that took several days to converge. It has been used for work for publications and conferences since early 2017. This package, within the MPI framework, was developed concurrently with similar work on running distributed training of Keras models with Spark .
Since our experiments demonstrating the scaling of the algorithm, numerous articles have been produced studying theoretically and demonstrating experimentally the scaling of distributed training of deep neural network, targeting different training frameworks, including tensorflow.
The authors do not claim that their framework is better than any other framework. This package was written for practical reasons in the observed absence of other tools fulfilling the same purpose.
Iii Package Overview
Python packages. Support for the Theano and Tensorflow  backends to Keras is provided.
Iii-a Training Algorithms
The package supports two main distributed training algorithms based on stochastic gradient descent (SGD). The default algorithm is Downpour SGD
, in which worker processes compute gradients of a loss function and send updates to a master process. An alternate algorithm, Elastic Averaging SGD, is also available.
In Downpour SGD, one process is assigned to be the master and the others are assigned to be workers. The master and each worker have a copy of the model to be trained. Each worker has access to a subset of the training data.
During training, a worker reads one batch of training data and computes the gradient of the loss function on that batch. The worker sends the gradient to the master, which uses it to update its model weights. The master sends the updated model weights to the worker, which then repeats the process with the next batch of training data.
See Fig. 1 for an illustration of the Downpour SGD training procedure.
In the Elastic Averaging SGD algorithm, worker processes are connected to a master via an elastic force that periodically ‘pulls’ the weights closer to one another. Workers train independently and communicate with the master only via the elastic force, allowing each worker to explore a different region of the model parameter space.
Training proceeds asynchronously by default, with worker processes exchanging weight information with the master one by one . Synchronous training is also available; in this case the master processes weights from all workers simultaneously. In addition to the canonical training configuration with one master process and several workers, the mpi_learn framework also supports a hierarchical configuration in which there are several master processes, each coordinating a group of workers and reporting to a higher-level master.
Iii-B User Interface
The user interface to the mpi_learn code consists of three main components, each handled via a Python class:
The training procedure is specified via an Algo class that stores information such as the batch size, choice of optimization algorithm, loss function, and any tunable training parameters such as the learning rate.
The DNN model is specified via a ModelBuilder class that provides instructions for constructing a Keras model. The model architecture can be read from a JSON file or specified via Keras code. Using the Tensorflow backend to Keras, it is possible to achieve model parallelism by specifying a device (GPU or CPU) for each layer of the model individually.
Input data is specified via a Data class that provides a data generator for use during the training phase. The user may provide a list of input file paths, which are divided evenly among all worker processes during training.
More details on the code can be found on the mpi_learn Github page .
The training time speedup for a benchmark neural network model was evaluated on two systems:
A Supermicro server with 28 cores and eight NVidia GTX1080 GPUs. Communication between processes is accomplished via shared memory, as all processes are on the same node.
The ALCF Cooley  GPU cluster, with 126 nodes, each having 16 cores and 1 NVidia K80 GPGPU. Nodes are interconnected with FDR Infiniband.
The performance results reported in this paper are in no way a comparison of the systems detailed above; they simply demonstrate the speedup of the training procedure when mpi_learn is used to distribute the training over multiple GPUs. Further performance improvement could be obtained by tuning the software to the specific architecture of the system used.
The mpi_learn17]. The model consists of an LSTM network  with 20 hidden units, followed by a softmax output over three different categories of collision events. The dataset was created using the Delphes simulation framework . The input data consists of 100 files of 9500 samples each, totaling 50GB. This model takes several hours to train on a node with a single GPU.
The purpose of this paper is not to evaluate the performance of the model  but rather to evaluate how much faster this model can be trained when multiple GPUs are utilized. As shown in figure 2, the model accuracy slightly degrades with increased number of workers. This occurs because of the so-called stale gradient issue: workers training on outdated model parameters produce suboptimal gradient updates. The issue can be mitigated by a suitable choice of SGD momentum .
The model is trained several times with various numbers of worker processes, using a batch size of 100 samples. The data in the training set is divided evenly among all workers. Training continues until each worker has processed its training data a fixed number of times (ten, in this case). For each batch of training data, a worker must compute the gradient of the loss function, send the gradient to the master process, and receive updated weights from the master after it applies the gradient update.
Validation of the model’s accuracy is performed by the master process using a held-out test set. Validation can be a bottleneck in the training process because it is performed serially; the frequency of validation can be adjusted as needed to minimize its impact on the total training time.
The time needed to train the model with mpi_learn and a single worker process is also compared to the training time obtained using Keras alone. The times are similar, indicating that the training overhead from the mpi_learn framework itself is small.
For up to 10 worker processes, the speedup is roughly linear with the number of workers. This indicates that the training framework can fully exploit the resources of a multi-GPU node such as the Supermicro server used here.
The speedup deviates from linearity with increasing number of workers. For 60 worker nodes, we observe a speedup of 30 with respect to the nominal training time for this choice of batch size. The deviation from linearity is driven by the time needed for the master process to update the weights of the network and transmit them back to the workers. Because the frequency of weight updates is inversely proportional to the batch size, increasing the batch size can alleviate this bottleneck and speed up the training procedure, as shown in Table I for the example of 20 worker processes.
The higher the amount of validation the earlier the linear scaling will break, because the constant amount of time spend in validation that cannot be compressed by adding more workers to the training part. This is confirmed with the trend of getting better speedup when decreasing the amount of validation.
The mpi_learn package provides a convenient interface for training Keras models in a distributed fashion using the MPI protocol. The system is straightforward to use with most models and can be flexibly customized.
Performance has been evaluated on up to eight GPUs on a single server and on up to 60 GPUs on the ALCF Cooley cluster. The results demonstrate a linear speedup with the number of workers in a certain regime, and in particular allows full usage of the resources of multi-gpu servers. By providing this training framework, we hope to make it easier for researchers in the sciences and other fields to fully harness available computing resources and benefit from existing distributed training algorithms. The framework is lightweight enough to be used without extensive configuration. It can be used on any MPI-enabled machine or cluster, making it especially practical for training using supercomputing resources. These properties facilitate quick prototyping of large deep neural models and training using many GPUs and/or CPUs, an ability that will become more important as deep learning continues to spread to new application areas.
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. We are grateful to Venkatram Vishwanath and Andrew Cherry for their support with Cooley. We acknowledge NVIDIA, SuperMico and the Kavli Foundation for their support of ”iBanks”, the Caltech HEP AI Tower. This work is partially supported by the United States Department of Energy, Office of High Energy Physics under Contract No. DE-SC0011925 and DOE OHEP Research Technology, Computational HEP and Fermilab under Contract No. DE-AC02-07CH11359.
-  Joeri R. Hermans, Distributed Keras: Distributed Deep Learning with Apache Spark and Keras, CERN IT-DB, https://github.com/cerndb/dist-keras, http://joerihermans.com/work/distributed-keras/
-  Alex Sergeev, Mike Del Balso, Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow, https://github.com/uber/horovod
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature 521, 436-444 (2015).
-  P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high energy physics with deep learning. https://arxiv.org/pdf/1402.4735.pdf (2014).
-  https://github.com/fchollet/keras
-  https://github.com/tflearn/tflearn
-  J. Dean, et al. Large scale distributed deep networks. In F. Pereira, et al. (Eds.), Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), 1223-1231 (2012).
-  S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. https://arxiv.org/abs/1412.6651 (2014).
-  S. Hadjis, C. Zhang, I. Mitliagkas, and C. Re. Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs. https://arxiv.org/abs/1606.04487 (2016).
-  MPI Forum. MPI: a message passing interface standard. (1994). Technical Report (1994).
-  https://github.com/duanders/mpi_learn
-  E. Gabriel, et al. Open MPI: goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users’ Group Meeting, (2004).
-  L. Dalcin, et al. MPI for Python: performance improvements and MPI-2 extensions, Journal of Parallel and Distributed Computing, 68(5):655-662 (2008).
-  Theano Development Team. Theano: a Python framework for fast computation of mathematical expressions. http://arxiv.org/abs/1605.02688 (2016).
-  M. Abadi, et al. TensorFlow: large-scale machine learning on heterogeneous systems. tensorflow.org (2015).
-  https://www.alcf.anl.gov/resources-expertise/analytics-visualization
-  L. Evans and P. Bryant. LHC Machine. JINST 3, S08001 (2008). doi:10.1088/1748-0221/3/08/S08001
S. Hochreiter and J. Schmidhuber. Long Short-Term Memory.Neural Comput. 9 (8), 1735-1780 (1997). doi:10.1162/neco.19188.8.131.525
-  The DELPHES 3 collaboration, J. de Favereau, C. Delaere, et al. DELPHES 3: a modular framework for fast simulation of a generic collider experiment. J. High Energ. Phys. 57 (2014). doi:10.1007/JHEP02(2014)057
-  Deep topology classifiers for a more efficient trigger selection at the LHC, Daniel Weitekamp et al. NIPS, Deep Learning for Physical Sciences, 2017 https://dl4physicalsciences.github.io Proceeding to appear