Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

11/04/2020
by   Michael Lui, et al.
0

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes–that load entire models to a single server–are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers. This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79 describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution and sparsity of input inference requests. We further evaluate three embedding table mapping strategies of three DLRM-like models and specify challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe only a marginal latency overhead when the data-center scale recommendation models are served with the distributed inference manner–P99 latency is increased by only 1 case configuration. The latency overheads are largely a result of the commodity infrastructure used and the sparsity of embedding tables. Even more encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.

READ FULL TEXT

page 2

page 3

page 5

page 7

page 13

page 14

page 16

page 17

research
03/14/2022

Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation

Personalized recommendation is an important class of deep-learning appli...
research
06/03/2021

JIZHI: A Fast and Cost-Effective Model-As-A-Service System for Web-Scale Online Inference at Baidu

In modern internet industries, deep learning based recommender systems h...
research
10/17/2022

A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models

Recommendation systems are of crucial importance for a variety of modern...
research
01/25/2022

Serving Deep Learning Models with Deduplication from Relational Databases

There are significant benefits to serve deep learning models from relati...
research
02/21/2023

MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation

Deep learning recommendation systems serve personalized content under di...
research
08/29/2023

Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping

Mixture of experts (MoE) is a popular technique in deep learning that im...
research
01/08/2023

FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models

Sequence-based deep learning recommendation models (DLRMs) are an emergi...

Please sign up or login with your details

Forgot password? Click here to reset