DeepAI AI Chat
Log In Sign Up

Benchmarking network fabrics for data distributed training of deep neural networks

by   Siddharth Samsi, et al.

Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.


On Scale-out Deep Learning Training for Cloud and HPC

The exponential growth in use of large deep neural networks has accelera...

Distributed Machine Learning for Computational Engineering using MPI

We propose a framework for training neural networks that are coupled wit...

Data Engineering for HPC with Python

Data engineering is becoming an increasingly important part of scientifi...

An introduction to distributed training of deep neural networks for segmentation tasks with large seismic datasets

Deep learning applications are drastically progressing in seismic proces...

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

Using multiple nodes and parallel computing algorithms has become a prin...

Democratizing Production-Scale Distributed Deep Learning

The interest and demand for training deep neural networks have been expe...

Distributed Deep Learning for Precipitation Nowcasting

Effective training of Deep Neural Networks requires massive amounts of d...