Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

12/03/2018
by   Christian Pinto, et al.
0

Deep Learning system architects strive to design a balanced system where the computational accelerator -- FPGA, GPU, etc, is not starved for data. Feeding training data fast enough to effectively keep the accelerator utilization high is difficult when utilizing dedicated hardware like GPUs. As accelerators are getting faster, the storage media & data buses feeding the data have not kept pace and the ever increasing size of training data further compounds the problem. We describe the design and implementation of a distributed caching system called Hoard that stripes the data across fast local disks of multiple GPU nodes using a distributed file system that efficiently feeds the data to ensure minimal degradation in GPU utilization due to I/O starvation. Hoard can cache the data from a central storage system before the start of the job or during the initial execution of the job and feeds the cached data for subsequent epochs of the same job and for different invocations of the jobs that share the same data requirements, e.g. hyper-parameter tuning. Hoard exposes a POSIX file system interface so the existing deep learning frameworks can take advantage of the cache without any modifications. We show that Hoard, using two NVMe disks per node and a distributed file system for caching, achieves a 2.1x speed-up over a 10Gb/s NFS central storage system on a 16 GPU (4 nodes, 4 GPUs per node) cluster for a challenging AlexNet ImageNet image classification benchmark with 150GB of input dataset. As a result of the caching, Hoard eliminates the I/O bottlenecks introduced by the shared storage and increases the utilization of the system by 2x compared to using the shared storage without the cache.

READ FULL TEXT

page 7

page 8

page 9

research
08/30/2022

EasyScale: Accuracy-consistent Elastic Training for Deep Learning

Distributed synchronized GPU training is commonly used for deep learning...
research
02/03/2021

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Driven by the tremendous effort in researching novel deep learning (DL) ...
research
10/26/2020

Disaggregated Accelerator Management System for Cloud Data Centers

A conventional data center that consists of monolithic-servers is confro...
research
08/20/2021

Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training

The data ingestion pipeline, responsible for storing and preprocessing t...
research
09/26/2019

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of...
research
11/06/2019

Reducing Honeypot Log Storage Capacity Consumption – Cron Job with Perl-Script Approach

Honeypot is a decoy computer system that is used to attract and monitor ...
research
04/17/2018

Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows

Distributed computing platforms provide a robust mechanism to perform la...

Please sign up or login with your details

Forgot password? Click here to reset