FanStore: Enabling Efficient and Scalable I/O for Distributed Deep Learning

09/27/2018
by   Zhao Zhang, et al.
0

Emerging Deep Learning (DL) applications introduce heavy I/O workloads on computer clusters. The inherent long lasting, repeated, and random file access pattern can easily saturate the metadata and data service and negatively impact other users. In this paper, we present FanStore, a transient runtime file system that optimizes DL I/O on existing hardware/software stacks. FanStore distributes datasets to the local storage of compute nodes, and maintains a global namespace. With the techniques of system call interception, distributed metadata management, and generic data compression, FanStore provides a POSIX-compliant interface with native hardware throughput in an efficient and scalable manner. Users do not have to make intrusive code changes to use FanStore and take advantage of the optimized I/O. Our experiments with benchmarks and real applications show that FanStore can scale DL training to 512 compute nodes with over 90% scaling efficiency.

READ FULL TEXT

page 5

page 8

page 9

page 10

research
07/05/2021

Data Lake Ingestion Management

Data Lake (DL) is a Big Data analysis solution which ingests raw data in...
research
01/07/2020

High Performance I/O For Large Scale Deep Learning

Training deep learning (DL) models on petascale datasets is essential fo...
research
06/20/2023

λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless Functions

The metadata service (MDS) sits on the critical path for distributed fil...
research
12/01/2020

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Deep learning (DL) applications are increasingly being deployed on HPC s...
research
05/18/2020

HaoCL: Harnessing Large-scale Heterogeneous Processors Made Easy

The pervasive adoption of Deep Learning (DL) and Graph Processing (GP) m...
research
05/31/2019

DFS: A Dataset File System for Data Discovering Users

Many research questions can be answered quickly and efficiently using da...
research
11/18/2016

GaDei: On Scale-up Training As A Service For Deep Learning

Deep learning (DL) training-as-a-service (TaaS) is an important emerging...

Please sign up or login with your details

Forgot password? Click here to reset