Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks

04/18/2023
by   Ping Gong, et al.
0

In this paper, we primarily focus on understanding the data preprocessing pipeline for DNN Training in the public cloud. First, we run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. The preliminary results show that data preprocessing is a clear bottleneck, even with the most efficient software and hardware configuration enabled by NVIDIA DALI, a high-optimized data preprocessing library. Second, we identify the potential causes, exercise a variety of optimization methods, and present their pros and cons. We hope this work will shed light on the new co-design of “data storage, loading pipeline” and “training framework” and flexible resource configurations between them so that the resources can be fully exploited and performance can be maximized.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/17/2022

Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines

Preprocessing pipelines in deep learning aim to provide sufficient data ...
research
07/14/2020

Analyzing and Mitigating Data Stalls in DNN Training

Training Deep Neural Networks (DNNs) is resource-intensive and time-cons...
research
08/20/2021

Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training

The data ingestion pipeline, responsible for storing and preprocessing t...
research
11/09/2022

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

We present RecD (Recommendation Deduplication), a suite of end-to-end in...
research
06/10/2022

Smallset Timelines: A Visual Representation of Data Preprocessing Decisions

Data preprocessing is a crucial stage in the data analysis pipeline, wit...
research
09/27/2021

Small data problems in political research: a critical replication study

In an often-cited 2019 paper on the use of machine learning in political...
research
08/20/2023

Demystifying the Performance of Data Transfers in High-Performance Research Networks

High-speed research networks are built to meet the ever-increasing needs...

Please sign up or login with your details

Forgot password? Click here to reset