Clairvoyant Prefetching for Distributed Machine Learning I/O

01/21/2021
by   Roman Böhringer, et al.
28

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers. Optimal data ingestion pipelines differ between systems, and increasing efficiency requires a delicate balance between access to local storage, external filesystems, and remote workers; yet existing frameworks fail to efficiently utilize such resources. We observe that, given the seed generating the random access pattern for training with SGD, we have clairvoyance and can exactly predict when a given sample will be accessed. We combine this with a theoretical analysis of access patterns in training and performance modeling to produce a novel machine learning I/O middleware, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, and scalable solution that delivers better performance than state-of-the-art approaches while requiring very few changes to existing codebases and supporting a broad range of environments.

READ FULL TEXT

page 6

page 8

page 9

page 10

research
05/31/2020

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

The state-of-the-art deep learning algorithms rely on distributed traini...
research
12/30/2019

Variance Reduced Local SGD with Lower Communication Complexity

To accelerate the training of machine learning models, distributed stoch...
research
04/12/2022

Skyhook: Towards an Arrow-Native Storage System

With the ever-increasing dataset sizes, several file formats such as Par...
research
06/30/2020

Data Movement Is All You Need: A Case Study on Optimizing Transformers

Transformers have become widely used for language modeling and sequence ...
research
09/04/2023

Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

When using Stochastic Gradient Descent (SGD) for training machine learni...
research
11/04/2021

TEE-based Selective Testing of Local Workers in Federated Learning Systems

This paper considers a federated learning system composed of a central c...

Please sign up or login with your details

Forgot password? Click here to reset