Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

05/10/2022
by   Youngeun Kwon, et al.
5

Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the "future" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can "always" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.

READ FULL TEXT

page 3

page 4

page 7

page 8

page 9

page 11

research
08/08/2022

A Frequency-aware Software Cache for Large Recommendation System Embeddings

Deep learning recommendation models (DLRMs) have been widely applied in ...
research
12/05/2021

Boosting Mobile CNN Inference through Semantic Memory

Human brains are known to be capable of speeding up visual recognition o...
research
10/25/2020

Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training

Personalized recommendations are one of the most widely deployed machine...
research
05/12/2020

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

Personalized recommendations are the backbone machine learning (ML) algo...
research
04/17/2021

ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

Because of the superior feature representation ability of deep learning,...
research
01/14/2023

Failure Tolerant Training with Persistent Memory Disaggregation over CXL

This paper proposes TRAININGCXL that can efficiently process large-scale...
research
08/08/2019

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

Recent studies from several hyperscalars pinpoint to embedding layers as...

Please sign up or login with your details

Forgot password? Click here to reset