ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

04/17/2021
by   Huifeng Guo, et al.
0

Because of the superior feature representation ability of deep learning, various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies. To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently, which makes speeding up the training process an essential problem. Different from the models with dense training data, the training data for CTR models is usually high-dimensional and sparse. To transform the high-dimensional sparse input into low-dimensional dense real-value vectors, almost all deep CTR models adopt the embedding layer, which easily reaches hundreds of GB or even TB. Since a single GPU cannot afford to accommodate all the embedding parameters, when performing distributed training, it is not reasonable to conduct the data-parallelism only. Therefore, existing distributed training platforms for recommendation adopt model-parallelism. Specifically, they use CPU (Host) memory of servers to maintain and update the embedding parameters and utilize GPU worker to conduct forward and backward computations. Unfortunately, these platforms suffer from two bottlenecks: (1) the latency of pull & push operations between Host and GPU; (2) parameters update and synchronization in the CPU servers. To address such bottlenecks, in this paper, we propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models. Specifically, in SFCTR, we also store huge embedding table in CPU but utilize GPU instead of CPU to conduct embedding synchronization efficiently. To reduce the latency of data transfer between both GPU-Host and GPU-GPU, the MixCache mechanism and Virtual Sparse Id operation are proposed. Comprehensive experiments and ablation studies are conducted to demonstrate the effectiveness and efficiency of SFCTR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/04/2021

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

Graph Convolutional Networks (GCNs) are increasingly adopted in large-sc...
research
05/10/2022

Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

Personalized recommendation models (RecSys) are one of the most popular ...
research
08/08/2022

A Frequency-aware Software Cache for Large Recommendation System Embeddings

Deep learning recommendation models (DLRMs) have been widely applied in ...
research
12/14/2021

HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework

Embedding models have been an effective learning paradigm for high-dimen...
research
08/17/2022

Distributed Out-of-Memory SVD on CPU/GPU Architectures

We propose an efficient, distributed, out-of-memory implementation of th...
research
03/07/2020

ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training

Distributed training is useful to train complicated models to shorten th...
research
02/19/2022

Distributed Out-of-Memory NMF of Dense and Sparse Data on CPU/GPU Architectures with Automatic Model Selection for Exascale Data

The need for efficient and scalable big-data analytics methods is more e...

Please sign up or login with your details

Forgot password? Click here to reset