Benchmarking network fabrics for data distributed training of deep neural networks

08/18/2020
by   Siddharth Samsi, et al.
0

Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.

READ FULL TEXT
research
01/24/2018

On Scale-out Deep Learning Training for Cloud and HPC

The exponential growth in use of large deep neural networks has accelera...
research
11/02/2020

Distributed Machine Learning for Computational Engineering using MPI

We propose a framework for training neural networks that are coupled wit...
research
10/13/2020

Data Engineering for HPC with Python

Data engineering is becoming an increasingly important part of scientifi...
research
02/25/2021

An introduction to distributed training of deep neural networks for segmentation tasks with large seismic datasets

Deep learning applications are drastically progressing in seismic proces...
research
08/19/2020

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

Using multiple nodes and parallel computing algorithms has become a prin...
research
07/27/2023

Benchmarking Performance of Deep Learning Model for Material Segmentation on Two HPC Systems

Performance Benchmarking of HPC systems is an ongoing effort that seeks ...
research
10/31/2018

Democratizing Production-Scale Distributed Deep Learning

The interest and demand for training deep neural networks have been expe...

Please sign up or login with your details

Forgot password? Click here to reset