Log In Sign Up

RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance

by   Udit Gupta, et al.

Deep learning recommendation systems must provide high quality, personalized content under strict tail-latency targets and high system loads. This paper presents RecPipe, a system to jointly optimize recommendation quality and inference performance. Central to RecPipe is decomposing recommendation models into multi-stage pipelines to maintain quality while reducing compute complexity and exposing distinct parallelism opportunities. RecPipe implements an inference scheduler to map multi-stage recommendation engines onto commodity, heterogeneous platforms (e.g., CPUs, GPUs).While the hardware-aware scheduling improves ranking efficiency, the commodity platforms suffer from many limitations requiring specialized hardware. Thus, we design RecPipeAccel (RPAccel), a custom accelerator that jointly optimizes quality, tail-latency, and system throughput. RPAc-cel is designed specifically to exploit the distinct design space opened via RecPipe. In particular, RPAccel processes queries in sub-batches to pipeline recommendation stages, implements dual static and dynamic embedding caches, a set of top-k filtering units, and a reconfigurable systolic array. Com-pared to prior-art and at iso-quality, we demonstrate that RPAccel improves latency and throughput by 3x and 6x.


page 1

page 3

page 4

page 8

page 9

page 11


Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation

Personalized recommendation is an important class of deep-learning appli...

The Architectural Implications of Facebook's DNN-based Personalized Recommendation

The widespread application of deep learning has changed the landscape of...

MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions

Deep neural networks are widely used in personalized recommendation syst...

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Neural personalized recommendation is the corner-stone of a wide collect...

Accelerating Recommender Systems via Hardware "scale-in"

In today's era of "scale-out", this paper makes the case that a speciali...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Transformer is a deep learning language model widely used for natural la...