InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

08/13/2023
by   Kabir Nagrecha, et al.
0

Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved only for DLRM training, driving new interest in cost- and time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion. In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into DLRM training pipeline bottlenecks and challenges. We study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to observe the performance impacts of online ingestion and to identify shortfalls in existing pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute the CPU resources of a trainer machine across a DLRM data pipeline to more effectively parallelize data loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus state-of-the-art data pipeline optimizers while also improving both CPU GPU utilization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/08/2020

The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

With deep reinforcement learning (RL) methods achieving results that exc...
research
09/22/2021

A Survey on Reinforcement Learning for Recommender Systems

Recommender systems have been widely applied in different real-life scen...
research
04/11/2022

Heterogeneous Acceleration Pipeline for Recommendation System Training

Recommendation systems are unique as they show a conflation of compute a...
research
10/18/2021

RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender System

Reinforcement learning based recommender systems (RL-based RS) aims at l...
research
01/03/2023

Offline Evaluation for Reinforcement Learning-based Recommendation: A Critical Issue and Some Alternatives

In this paper, we argue that the paradigm commonly adopted for offline e...
research
08/20/2021

Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training

The data ingestion pipeline, responsible for storing and preprocessing t...
research
06/03/2018

Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

Scalable and efficient processing of genome sequence data, i.e. for vari...

Please sign up or login with your details

Forgot password? Click here to reset