Analyzing I/O Performance of a Hierarchical HPC Storage System for Distributed Deep Learning

01/04/2023
by   Takaaki Fukai, et al.
0

Today, deep learning is an essential technology for our life. To solve more complex problems with deep learning, both sizes of training datasets and neural networks are increasing. To train a model with large datasets and networks, distributed deep neural network (DDNN) training technique is necessary. For large-scale DDNN training, HPC clusters are a promising computation environment. In large-scale DDNN on HPC clusters, I/O performance is critical because it is becoming a bottleneck. Most flagship-class HPC clusters have hierarchical storage systems. For designing future HPC storage systems, it is necessary to quantify the performance improvement effect of the hierarchical storage system on the workloads. This paper demonstrates the quantitative performance analysis of the hierarchical storage system for DDNN workload in a flagship-class supercomputer. Our analysis shows how much performance improvement and volume increment of the storage will be required to meet the performance goal.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2020

Container Orchestration on HPC Systems

Containerisation demonstrates its efficiency in application deployment i...
research
12/01/2020

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Deep learning (DL) applications are increasingly being deployed on HPC s...
research
01/07/2020

High Performance I/O For Large Scale Deep Learning

Training deep learning (DL) models on petascale datasets is essential fo...
research
10/03/2018

Robust online identification of thermal models for in-production HPC clusters with machine learning-based data selection

Power and thermal management are critical components of high performance...
research
11/27/2019

Dynamically Provisioning Cray DataWarp Storage

Complex applications and workflows needs are often exclusively expressed...
research
05/26/2021

Towards Million-Server Network Simulations on Just a Laptop

The growing size of data center and HPC networks pose unprecedented requ...
research
08/14/2022

DAOS as HPC Storage, a view from Numerical Weather Prediction

Novel object storage solutions potentially address long-standing scalabi...

Please sign up or login with your details

Forgot password? Click here to reset