A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

08/04/2022
by   Jose González-Abad, et al.
0

Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. Additionally, due to containerization and specific features of the workflow, it provides researchers with a cluster-agnostic way of running their models. The experiments carried out show that the workflow offers good scalability in the distributed training of the models and that it easily adapts to different clusters.

READ FULL TEXT

page 25

page 26

research
05/13/2020

Literature Review and Implementation Overview: High Performance Computing with Graphics Processing Units for Classroom and Research Use

In this report, I discuss the history and current state of GPU HPC syste...
research
06/04/2020

Portability of Scientific Workflows in NGS Data Analysis: A Case Study

The analysis of next-generation sequencing (NGS) data requires complex c...
research
10/05/2019

Parallelizing Training of Deep Generative Models on Massive Scientific Datasets

Training deep neural networks on large scientific data is a challenging ...
research
08/17/2023

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Modern large-scale scientific discovery requires multidisciplinary colla...
research
12/15/2021

or2yw: Modeling and Visualizing OpenRefineHistories as YesWorkflow Diagrams

OpenRefine is a popular open-source data cleaning tool. It allows users ...
research
11/30/2022

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Modern Deep Learning (DL) models have grown to sizes requiring massive c...
research
08/30/2022

The BioExcel methodology for developing dynamic, scalable, reliable and portable computational biomolecular workflows

Developing complex biomolecular workflows is not always straightforward....

Please sign up or login with your details

Forgot password? Click here to reset