Log In Sign Up

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30 hundreds of machines.


page 1

page 4

page 8


Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation

Personalized recommendation is an important class of deep-learning appli...

Cross-Stack Workload Characterization of Deep Recommendation Systems

Deep learning based recommendation systems form the backbone of most per...

The Architectural Implications of Facebook's DNN-based Personalized Recommendation

The widespread application of deep learning has changed the landscape of...

RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference

Neural personalized recommendation models are used across a wide variety...

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

We present RecD (Recommendation Deduplication), a suite of end-to-end in...

RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance

Deep learning recommendation systems must provide high quality, personal...

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Deep learning recommendation models have grown to the terabyte scale. Tr...

I Introduction

Recommendation algorithms are used pervasively to improve and personalize user experience across a variety of web-services. Search engines use recommendation algorithms to order results, social networks to suggest friends and content, e-commerce websites to suggest purchases, and video streaming services to recommend movies. As the sophistication of recommendation tasks increases with larger amounts of better quality data, recommendation algorithms have evolved from simple rule-based or nearest neighbor-based designs [50]

to deep learning approaches 

[23, 41, 7, 64, 63, 62].

Deep learning-based personalized recommendation algorithms enable a plethora of use cases [56]. For example, Facebook’s recommendation use cases require more than 10

the datacenter inference capacity compared to common computer vision and natural language processing tasks 


. As a result, over 70% of machine learning inference cycles at Facebook’s datacenter fleets are devoted to recommendation and ranking inference 

[17]. Similar capacity demands can be found at Google [29], Amazon [10, 56], and Alibaba [64, 63]. And yet, despite their importance and the significant research on optimizing deep learning based AI workloads [46, 18, 5, 57, 55] from the systems and architecture community, relatively little attention has been devoted to solutions for recommendation [34]. In fact, deep learning-based recommendation inference poses unique challenges that demand unique solutions.

Fig. 1: State-of-the-art recommendation models span diverse performance characteristics compared to CNNs and RNNs. Based on their use case, recommendation models have unique architectures introducing model-level heterogeneity.

First, recommendation models exhibit unique compute, memory, and data reuse characteristics. Figure 1(a) compares the compute intensity of industry-representative recommendation models111Section III describes the eight recommendation models in detail. [17, 41, 63, 64, 7, 62, 23] to state-of-the-art convolutional (CNN)  [22]

and recurrent (RNN) neural networks 

[60]. Compared to CNNs and RNNs, recommendation models, highlighted in the shaded yellow region, tend to be memory intensive as opposed to compute intensive. Furthermore, recommendation models exhibit higher storage requirements (GBs) and irregular memory accesses [17]

. This is because recommendation models operate over not only continuous but also categorical input features. Compared to the continuous features (i.e., vectors, matrices, images), categorical features are processed by inherently different operations. This unique characteristic of recommendation models exposes new system design opportunities to enable efficient inference.

Next, depending on the use case, major components of a recommendation model are sized differently. This introduces model-level heterogeneity across the recommendation models. By focusing on memory access breakdown, Figure 1(b) shows diversity among recommendation models themselves. For instance, dense feature processing that incurs regular memory accesses dominate for Google’s WnD [62, 7], NCF [23], Facebook’s DLRM-RM3 [17], and Alibaba’s DIEN [63]. In contrast, categorical, sparse feature processing that incurs irregular memory accesses dominate for other recommendation models such as Facebook’s DLRM-RM1/RM2 [17] and Alibaba’s DIN [64]. These diverse characteristics of recommendation models expose system optimization design opportunities.

Finally, recommendation models are deployed across web-services that require solutions to consider effects of executing at-scale in datacenters. For instance, it is commonly known that requests for web-based services follow Poisson and log-normal distributions for arrival and working set size respectively 

[36]. Similar characteristics are observed for arrival rates of recommendation queries. However, working set sizes for personalized recommendation queries follow a distinct distribution with heavier tail effects. This difference in query size distribution leads to varying optimization strategies for recommendation inference at-scale. Optimizations based on production query size distributions, compared to log-normal, improve system throughput by up to 1.7 for at-scale recommendation inference.

To enable design optimizations for the diverse collection of industry-relevant recommendation models, this paper presents DeepRecInfra – an end-to-end infrastructure that enables researchers to study at-scale effects of query size and arrival patterns. First, we perform an in-depth characterization of eight state-of-the-art recommendation models that cover commercial video recommendation, e-commerce, and social media [23, 17, 64, 63, 7, 62]. Next, we profile recommendation services in a production datacenter to instrument an inference load generator for modeling recommendation queries.

Built on top of the performance characterization of the recommendation models and dynamic query arrival patterns (rate and size), we propose a hill-climbing based scheduler – DeepRecSched – that splits queries into mini-batches based on the query size and arrival pattern, the recommendation model, and the underlying hardware platform. DeepRecSched maximizes system load under a strict tail-latency target by trading off request versus batch-level parallelism. Since it is also important to consider the role of hardware accelerators for at-scale AI infrastructure efficiency, DeepRecSched also evaluates the impact of specialized hardware for neural recommendation by emulating its behavior running on state-of-art GPUs.

The important contributions of this work are:

  1. This paper describes a new end-to-end infrastructure, DeepRecInfra, that enables system design and optimization across a diverse set of recommendation models. To take into account realistic datacenter-scale execution behavior, we characterize and integrate query arrival patterns and size distributions observed in production datacenters into DeepRecInfra (Section III).

  2. We propose a simple, yet effective scheduler – DeepRecSched– co-designing the degree of request- versus batch-level parallelism based on the dynamic query arrival pattern (rate and size), recommendation model architecture, and service-level latency target (Section IV). Evaluated with DeepRecInfra, DeepRecSched doubles system throughput under strict latency targets. In addition, we implement and evaluate the design on a production datacenter with live recommendation query traffic, showing significant performance improvement.

  3. GPU accelerators can be appealing for recommendation inference, where not all queries are equal. The inflection point varies across the different recommendation models under different system loads and latency targets, showing that DeepRecSched can dynamically determine the optimal configuration. However, compared to CPUs, power efficiency is not optimized in the face of GPUs for recommendation inference (Section VI).

Systems research for personalized recommendation is still a nascent field. To enable follow-on work studying and optimizing recommendation at-scale, we will open source222The open-source implementation will be available upon acceptance of the publication.

the proposed DeepRecInfra infrastructure. Open-source DeepRecInfra will include neural personalized recommendation models representative of industry implementations, as well as query arrival rate and size distributions presented in this paper.

Fig. 2: General architecture of personalized recommendation models. Configuring the key parameters (red) yields different implementations of industry-representative models.

Ii Neural recommendation models

Model Company Domain Dense-FC Predict-FC Embeddings

Tables Lookup Pooling
NCF  [23] - Movies - 256-256-128 4 1 Concat
Wide&Deep [7] Google Play Store - 1024-512-256 Tens 1 Concat
MT-Wide&Deep [62] Youtube Video - N x (1024-512-256) Tens 1 Concat

DLRM-RMC1 [17]
Facebook Social Media 256-128-32 256-64-1 10 80 Sum
DLRM-RMC2 [17] Facebook Social Media 256-128-32 512-128-1 40 80 Sum
DLRM-RMC3 [17] Facebook Social Media 2560-512-32 512-128-1 10 20 Sum
DIN [64] Alibaba E-commerce - 200-80-2 Tens Hundreds Attention+FC
DIEN [63] Alibaba E-commerce - 200-80-2 Tens Tens Attention+RNN
TABLE I: Architectural features of state-of-the-art personalized recommendation models.

Recommendation is the task of personalizing recommending content most relevant to a user based on preferences and prior interactions. Recommendation is used across many services including search, video and movie content, e-commerce, and advertisements. However, accurately modeling preferences based on previous interactions can be challenging because users only interact with a small subset of all possible items. For example, for streaming services, a user only watches a small subset of accessible videos. As a result, unlike inputs to traditional deep neural networks (DNNs), inputs to recommendation models include both dense and sparse features – this affects how recommendation models are constructed.

Ii-a Key Components in Neural Recommendation Models

To accurately model user preference, state-of-the-art recommendation models use deep learning solutions. Figure 2 depicts a generalized architecture of DNN-based recommendation models with dense and sparse features as inputs.

Features. Dense features describe continuous inputs, such as characteristics of a specific user. The dense features are often processed with a stack of MLP layers i.e., fully-connected layers – similar to classic DNN approaches. On the other hand, sparse features represent categorical inputs, such as the collection of products a user has previously purchased or the movies the user has liked . Since the number of positive interactions for a categorical feature is often small compared to the feature’s cardinality (all available products), the binary vector representing such interactions ends up very sparse.

Embedding Tables. Each sparse feature has a corresponding embedding table that is composed of a collection of latent embedding vectors. The number of vectors, or rows in the table, is determined by the number of categories in the given feature – this can vary from tens to billions.The number of elements in each vector, or the column dimension of the table, is determined by the number of latent features for the category representation. This latent dimension is on the order of tens of elements (e.g., 16, 32, or 64). Thus, in total, embedding tables often require storage on the order of tens of GBs.

Embedding Table Access. While embedding tables themselves are dense data structures, embedding table operations incur sparse, irregular memory accesses – especially in the context of personalized recommendation. Each sparse input is encoded either as one-hot or multi-hot encoded vectors, which are used to index into specific rows of the corresponding embedding table. The resulting embedding table vectors are combined with a sparse feature pooling operation such as sum, dot product, or multiplication. Note that while embedding lookups could be encoded as a sparse matrix-matrix multiplication, it is more computationally efficient to implement the operation as a table lookup followed by a pooling operation.

Feature Interaction. The outputs of the dense and sparse features are combined before being processed by subsequent predictor-DNN stacks. Typical operations for feature interaction include concatenation, sum, and averaging.

Product Ranking.

The output of the predictor-DNN stacks is the click through rate (CTR) probability for a single user-item pair. To serve relevant content to users, the CTR of thousands of potential items are evaluated for each user. All CTR’s are then ranked and the top-N choices are presented to the user. As a result, deploying recommendation models requires running the models with non-unit batch sizes.

Iii DeepRecInfra: At-scale Recommendation

To better understand the distinct characteristics of and design system solutions for neural recommendation models, we developed an infrastructure, DeepRecInfra, to model and evaluate at-scale recommendation inference. DeepRecInfra is implemented as a highly extensible framework enabling us to consider a variety of recommendation models and use cases. In particular, DeepRecInfra consists of three key components: (1) a suite of industry-representative recommendation models, (2) industry-representative application level tail latency targets, and (3) real-time query serving based on arrival rates and working set size distributions profiled from recommendation running in a production datacenter. The following subsections detail these components.

Iii-a Industry-scale recommendation models

Recent publications from Google, Facebook, and Alibaba present notable differences across their recommendation models [62, 64, 63, 17, 23]. The generalized recommendation model architecture shown in Figure 2 can be customized by configuring key parameters in order to realize different implementations of recommendation services that exhibit a variety of distinct performance characteristics.

Iii-A1 State-of-the-art neural recommendation models

To capture the diversity, DeepRecInfra composes a collection of eight state-of-the-art recommendation models. We describe the unique aspects of the recommendation model architecture below and summarize key parameter configurations for each implementation in Table I.

  • Neural Collaborative Filtering (NCF) is a generalization of matrix factorization (MF) techniques popularized by the Netflix Prize [33] [42]

    with multi-layer perceptrons (MLPs) and non-linearities. Following Figure


    , NCF only considers one-hot encoded sparse features and does not implement a Dense-FC stack. The model comprises four embedding tables — two for users and two for items — and a relatively small predictor stack. Following the embedding table operations, sparse pooling implements a generalized MF (GMF) whose outputs are processed by the final predictor stack of MLPs.

  • Wide and Deep (WnD) considers both sparse and dense input features. Deployed in Google’s Play Store, WnD uses dense features such as user ages and number of applications installed on a mobile platform. Combined, the dense features have dimension of 1000. Following Figure 2, dense input features in WnD bypass the Dense-FC stack and are directly concatenated with the output of one-hot encoded embedding table lookups. Finally, a relatively large Predict-FC stack produces an output click-through-rate (see Table I).

    Fig. 3: Operator breakdown of state-of-the-art personalized recommendation models with a batch-size of 64. The large diversity in bottlenecks leads to varying design optimizations.
  • Multi-Task Wide and Deep (MT-WnD) extends WnD by evaluating multiple output objectives including predicted click-through rate (CTR), comment rate, likes, and ratings. Leveraging multi-objective modeling in personalizing recommendations for users, MT-WnD enables a finer grained and improved user experience [3]. Building upon WnD, MT-WnD implements parallel Predict-FC stacks for the different tasks or objectives.

  • Deep Learning Recommendation Model (DLRM RMC1, RMC2, RMC3) is a set of neural recommendation models from Facebook that differs from the aforementioned examples with its large number of embedding lookups [41]. In addition, based on Figure 2, DLRM first processes the dense features with a DNN-stack. Based on the configurations shown in  [17] varying the number of lookups per embedding table and size of FC layers yield three different architectures, DLRM-RMC1, DLRM-RMC2, and DLRM-RCM3 (see Table I).

  • Deep Interest Network (DIN) uses attention – implemented as local activation units on top of embedding tables – to model user interests. With respect to Figure 2, DIN does not consider dense input features. The model comprises tens of embedding tables of varying sizes. Smaller embedding tables process one-hot encoded user and item features while the larger embedding tables (up to rows) process multi-hot encoded inputs with hundreds of lookups. The outputs of these multi-hot encoded embedding operations are combined as a weighted sum by a local activation unit (i.e., attention) and then concatenated before being processed by the top predictor stack.

    Fig. 4: GPU speedup over CPU for representative recommendation models. GPUs typically have higher performance than CPU at larger batch-sizes (annotated above). The batch-size at which GPUs start to outperform CPUs and their speedup at large batch-sizes varies across models.
  • Deep Interest Evolution Network (DIEN)

    captures evolving user interests over time by augmenting DIN with gated recurrent units (GRUs) 

    [63]. Inputs to the model are one-hot encoded sparse features. The output of embedding table operations are processed by attention-based multi-layer GRUs. The outputs of the GRUs are concatenated with the remaining embedding vectors and processed by a relatively small predictor FC-stack.

Model Runtime Bottleneck SLA target

Embedding dominated 100
DLRM-RMC2 Embedding dominated 400
DLRM-RMC3 MLP dominated 100
NCF MLP dominated 5
WND MLP dominated 25
MT-WND MLP dominated 25
DIN Embedding + Attention dominated 100
DIEN Attention-based GRU dominated 35

TABLE II: Summarizing performance implications of different personalized recommendation and latency targets used to illustrate design space tradeoffs for DeepRecSched.

Iii-A2 Operator diversity

The apparent diversity of these industry-representative recommendation models leads to a range of performance bottlenecks. Figure 3 compares the performance characteristics of recommendation models running on a server class Intel Broadwell, shown as fractions of time spent on Caffe2 operators for a fixed batch size of 64. As expected, inference runtime for models with high degrees of dense feature processing (i.e., DLRM-RMC3, NCF, WND, MT-WND) is dominated by the MLP layers. On the other hand, inference runtime for models dominated by sparse feature processing (i.e., DLRM-RMC1 and DLRM-RMC2) is dominated by embedding table lookups.

Interestingly, inference runtime for attention based recommendation models is dominated by neither FC nor embedding table operations. For instance, inference run time for DIN is split between concatenation, embedding table, sum, and FC operations. This is a result of the attention units, which (1) concatenate user and item embedding vectors, (2) perform a small FC operation, and (3) use the output of the FC operation to weight the original user embedding vector. Similarly, the execution time of DIEN is dominated by recurrent layers. This is a result of fewer embedding table lookups whose outputs are processed by a series of relatively large attention layers.

Fig. 5: Queries for personalized recommendation models follow a unique distribution not captured by traditional workload distributions (i.e. normal, log-normal) considered for web-services. The heavy tail of query sizes found in production recommendation services leads to unique design optimizations.
Fig. 6: Aggregated execution time over the query set based on the size distribution for CPU and GPU. GPUs readily accelerate larger queries; however, the optimal inflection point and speedup differ across models.

Iii-A3 Acceleration opportunity with specialized hardware

Figure 4 illustrates the speedup of GPUs over CPUs across different representative recommendation models at various batch sizes. While GPUs offer higher compute intensity and memory bandwidth, transferring inputs from the CPU to the GPU can consume a significant fraction of time. For instance, across all batch sizes, data loading time consumes on average 6080% of the end-to-end inference time on the GPU for all recommendation models. GPUs do, however, provide significant performance benefits at higher batch sizes — especially for compute intensive models like WnD. Between different classes of recommendation models, (1) speedup at large batch sizes (i.e. 1024) and (2) batch size required to outperform CPU-only hardware platforms vary widely (see Figure 4).

Iii-B Service level requirement on tail latency

Personalized recommendation models are used in many Internet services deployed at a global scale. They must service a large number of requests across the datacenter while meeting strict latency targets set by the Service Level Agreements (SLAs) of various use cases. In this paper, we measure throughput as the number of queries per second (QPS) that can be processed under a tail-latency requirement.

Diverse set of tail latency targets: We find that the tail latency target varies based on the applications that use recommendation models (e.g., search, entertainment, social-media, e-commerce) and their service-level agreements (SLA). These differences can result in distinct system design decisions for at-scale recommendation. Table II summarizes the tail latency targets for each of the recommendation models [17, 62, 7, 64, 63]. For instance, the Google Play store imposes an SLA target of tens of milliseconds on Wide&Deep [7, 29]. On the other hand, Facebook’s social media platform requires DLRM-RMC1, DLRM-RMC2, and DLRM-RMC3 run within an SLA target of hundreds of milliseconds [17]. Alibaba’s e-commerce platform requires DIN and DIEN to run within an SLA target of tens of milliseconds (using a collection of CPUs and specialized hardware) [64, 63]. In this paper, we use the published targets and profile each model on a server-class Intel Broadwell CPU to set the particular tail-latency target. Section VI then presents system throughput evaluation for a wider range of tail latency targets on optimization strategies and infrastructure efficiency.

Fig. 7: Performance distribution of recommendation inference at datacenter scale to individual machines. Individual machines follow inference distributions, excluding network and geographic effects, at the datacenter scale to within %.
Fig. 8: DeepRecInfra implements an extensible framework that considers industry-representative recommendation models, application level tail latency targets, and real-time query serving (rate and size). Built upon DeepRecInfra, DeepRecSched optimizes system throughput (QPS) under strict latency targets by optimzing per-request batch-size (request versus batch parallelism) and accelerator query size threshold (parallelizing queries across specialized hardware).

Iii-C Real-Time Query Serving for Recommendation Inference

It is crucial to model real-time query serving for inference. DeepRecInfratakes into account two important dimensions of real-time query serving: arrival rate and working set sizes.

Query Arrival Pattern:

Arrival times for queries for services deployed in the datacenter are determined by the inter-arrival time between consecutive requests. This inter-arrival time however can be modeled using a variety of distributions including a fixed value, normal distributions, or Poisson distribution 

[20, 14, 1]. Previous work has shown that, these distributions can lead to different system design optimizations [36, 30]. Following web-services, by profiling services in a production datacenter, we find that query arrival rates for recommendation services follow a Poisson distribution [20, 14, 31, 36, 1].

Query Working Set Size Pattern: Not all recommendation queries are created equally. The size of queries for recommendation inference relates to the number of potential items provided to a user. Given that the potential number of items to be served depend heavily on the user and their interaction with the web-service, the size of queries varies. Related work on designing system solutions for web services typically assumes working set sizes of queries follow a fixed, normal, or log-normal distribution [36]. However, Figure 5 illustrates that query sizes for recommendation exhibit a heavier tail compared to canonical log-normal distributions. As a result, while DeepRecInfra’s load generator supports a variety of query distributions, the results in the remainder of this paper focus on the query size distribution representative of production datacenter (Figure 5).

Figure 6 illustrates the execution time breakdown for queries smaller than the p75 size versus larger queries. Despite the long tail, the collection of small queries constitute over half the CPU execution time. 25% of large queries contribute to nearly 50% of total execution time. This unique query size distribution with a long tail makes GPUs an interesting accelerator target. Figure 6 shows that, across all models, GPU can effectively accelerate the execution time of large queries, particularly. While, offloading the large queries can reduce execution time, the amount of speedup varies based on the model architecture. The optimal threshold for offloading varies across models, motivating a design that can automatically tune the offloading decision for recommendation inference.

Iii-D Subsampling datacenter fleet with single-node servers

To serve potentially billions of users across the world, recommendation models are typically run across thousands of machines. However, it may not always be possible to access and deploy design optimizations across a production-scale datacenter. We show that we can use a handful of machines to study and optimize tail performance of recommendation inference. Figure 7 shows the cumulative distribution of two different recommendation models running on server-class Intel Skylake and Broadwell machines. We find that the datacenter scale performance distribution (black) is tracked by the distribution measured on a handful of machines (red). The tail-latency trends for recommendation inference across a subset of machines is within 10% of the performance across machines in a datacenter, representative of larger scale systems.

Iii-E Putting it Altogether

To study at-scale characteristics of recommendation, it is important to use representative infrastructure. This includes representative recommendation models, query arrival rates, and query working set size distributions. Putting it all together, we developed DeepRecInfra, as depicted in Figure 8, by incorporating an extensible load generator to model query arrival rate and size patterns for a diverse set of recommendation models. This enables efficient and representative design and optimization strategies catered to at-scale recommendation.

Iv DeepRecSched design

In order to consider a variety of recommendation use cases (i.e., model architectures, tail latency targets, real-time query serving, hardware platforms), we design, implement, and evaluate DeepRecSched on top of DeepRecInfra as shown (Figure 8). By exploiting the aforementioned unique characteristics of recommendation models and real-time query distributions, the proposed DeepRecSched maximizes system throughput while meeting strict tail latency targets of recommendation services. Central to DeepRecSched is the observation that working set sizes for recommendation queries follow a unique distribution with a heavy tail. Intuitively, large queries, which take the longest to process, limit the throughput (QPS) a system can handle given a strict latency target. DeepRecSched addresses this bottleneck with two key design optimizations. First, large queries are split into multiple requests of smaller batch sizes that are processed by parallel cores. This requires carefully balancing batch-level and SIMD-level parallelism, cache contention, and the potential increase in queuing delay from a larger number of smaller-sized requests. Second, large queries are offloaded to specialized AI hardware in order to accelerate at-scale recommendation inference.

Iv-a Optimal batch size varies

While all queries can be processed by a single core, splitting queries across cores to exploit hardware parallelism, is often advantageous. Thus, DeepRecSched splits queries into individual requests. However, this sacrifices parallelism within a request with a decreased batch size.

Fig. 9: Optimal request vs. batch parallelism varies based on the use case. (Top) Optimal batch size varies across tail latency targets for DLRM-RMC3. (Bottom) Optimal batch size varies across recommendation models i.e., DLRM-RCM2 (embedding-dominated), DLRM-RMC3 (MLP-dominated), DIEN (attention-dominated).

The optimal batch size that maximizes the system QPS throughput varies based on (1) tail latency targets and (2) recommendation models. Figure 9(top) illustrates that, for DLRM-RMC3, the optimal batch size increases from 128 to 256 as the tail latency target is relaxed from 66 (low) to 100 (medium). Furthermore, Figure 9(bottom) shows that the optimal batch size for DIEN (attention-based), DLRM-RMC3 (FC heavy), and DLRM-RMC1 (embedding table heavy) is 64, 128, and 256, respectively.

This design space is further expanded considering the heterogeneity of CPUs found in production datacenters [21]. Recent work shows that recommendation models are run across a variety of server class CPUs such as Intel Broadwell and Skylake [17]. Key architectural features across these servers can impact the optimum tradeoff between request- and batch-level parallelism. First, Intel Broadwell implements SIMD units based on AVX-256 while Skylake implements AVX-512. Higher batch sizes are typically required to exploit the benefits of the wider SIMD units in Intel Skylake [17]. Next, Intel Broadwell implements an inclusive L2/L3 cache hierarchy while Skylake implements an exclusive one. While inclusive cache hierarchies simplify cache coherence protocols, they are more susceptible to cache contention and performance degradation from parallel cores [28, 27]. In the context of recommendation, this can be achieved by trading off request for batch parallelism. Section VI provides a more detailed analysis into the implication of hardware heterogeneity on trading off request- versus batch-level parallelism.

Fig. 10: The optimal query size threshold, and thus fraction of queries processed by the GPU, varies across recommendation models i.e., DLRM-RMC2 (embedding-dominated), DLRM-RMC3 (MLP-dominated), DIEN (attention-dominated)
Fig. 11: Compared to a static scheduler based on production recommendation services, the top figure shows performance, measured in system throughout (QPS) across a range of latency targets, while the bottom shows power efficiency (QPS/Watt), for DeepRecSched-CPU and DeepRecSched-GPU.

Iv-B Leverage parallelism with specialized hardware

In addition to balancing request- versus batch-level parallelism on general purpose CPUs, in the presence of specialized AI hardware, DeepRecSched improves system throughput by offloading queries that can best leverage parallelism in the available specialized hardware. We evaluate the role of accelerators for at-scale recommendation with state-of-the-art GPUs. Trading off processing queries on CPUs versus GPUs requires careful optimization. Intuitively, offloading queries to the GPU incurs significant data transfer overheads. To amortize this cost, GPUs often require higher batch sizes to exhibit speedup over general-purpose CPUs, as shown in Figure 4 [58]. Consequently, DeepRecSched improves system throughput by offloading the largest queries for recommendation inference to the GPU. This can be accomplished by tuning the query-size threshold. Queries larger than this threshold are offloaded to the GPU while smaller ones are processed by the CPU cores. Figure 10 illustrates the impact of query-size threshold (x-axis) on the achievable QPS (y-axis) across a variety of recommendation models. The optimal threshold varies across the three recommendation models, DLRM-RMC3, DLRM-RMC1, and DIEN. In fact, we find that the threshold not only varies across model architectures, but also across tail latency targets (see Section VI for more details).

Iv-C DeepRecSched

One option to identify the optimal batch size that balances the effects of batch- and request-level parallelism is to apply a control-theoretic approach. Based on the detailed characterization results observed in Figures 9 and 10, we find that a simple hill-climbing based algorithm can sufficiently find the optimal batch and query request sizes across the variety of recommendation models and hardware platforms.

DeepRecSched starts with a unit batch-size to serve recommendation inference queries in DeepRecInfra and increases the batch size to improve system throughput until the achievable QPS degrades while also maintaining the target tail latency. DeepRecSched then tunes the query-size threshold for offloading recommendation inference queries to specialized hardware. Starting with a unit query-size threshold (i.e., all queries are processed on the accelerator), DeepRecSched applies hill-climbing to gradually increase the threshold until the achievable QPS degrades. As what Section VI later shows, by automatically tuning the per-request batch size and GPU query-size threshold, DeepRecSched optimizes infrastructure efficiency of at-scale recommendation across a variety of different model architectures, tail latency targets, query-size distributions, and the underlying hardware.

V Methodology

We implement and evaluate DeepRecSched with DeepRecInfra across a variety of different hardware systems and platforms. We then compare the performance and power efficiency results with a production-scale baseline.

DeepRecInfra. As discussed in Section III, DeepRecInfra comprises three key components:

  • Model Implementation: We implement all the recommendation models (shown in Table I) in Caffe2 with Intel MKL as the backend library for CPUs [26] and CUDA/cuDNN 10.1 for GPUs [43]. All CPU experiments are conducted with a single Caffe2 worker and Intel MKL thread, unless otherwise specified.

  • SLA Latency Targets: Table II presents the SLA targets for each recommendation models. To explore the design tradeoffs over a range of latency targets, we consider three latency targets for each recommendation model — Low, Medium, and High — where Low and High tail latency targets are set to be 50% lower and 50% higher than that of Medium, respectively.

  • Real-Time Query Patterns: Following Section III, query patterns in DeepRecInfra are configurable on two axes: arrival rate and query size. The arrival pattern is fitted on a Poisson distribution whereas the query sizes are drawn from the production distribution (Figure 5).

Experimental System Setup. To consider the implications of hardware heterogeneity found in datacenter [21, 17, 59], we evaluate DeepRecSched with two generations of dual-socket server-class Intel CPUs: Broadwell and Skylake. Broadwell comprises 28 cores running at 2.4GHz with AVX-2 SIMD units and implements an inclusive L2/L3 cache hierarchy. Its TDP is of 120W. Skylake comprises of 40 cores running at 2.0GHz with AVX-512 SIMD units and implements an exclusive L2/L3 cache hierarchy. Its TDP is of 125 Watts.

To consider the implications of AI hardware accelerators, we extend the design space to take into account a GPU accelerator model based on real empirical characterization. The accelerator performance model is constructed with the performance profiles of each recommendation model across the range of query sizes over a real-hardware GPU — server-class NVIDIA GTX 1080Ti with 3584 CUDA cores, 11GB of DDR5 memory, and optimized cuDNN backend library (see Figure 4). This includes both data loading and model computation, capturing end-to-end recommendation inference.

Production-scale baseline. We compare DeepRecSched to the baseline that implements a fixed batch size configuration. This fixed batch size configuration is typically set by splitting the largest query evenly across all available cores on the underlying hardware platform. Given the maximum query size of 1000 (Figure 5), the static batch size configuration is determined as 25 for a server-class 40-core Intel Skylake.

Fig. 12: Exploiting the unique characteristics of at-scale recommendation yields efficiency improvements given the optimal batch size varies across SLA targets and query size distributions (left), models (middle), and hardware platforms (right).

Vi DeepRecSched Evaluation

This section first presents the overall efficiency improvements of DeepRecSched over the baseline across all eight state-of-the-art recommendation models using DeepRecInfra. Next, we describe the design tradeoffs and benefits of DeepRecSched by diving into (1) the tradeoffs in request- versus batch-level parallelism, (2) a case study of demonstrating the benefits of the design optimizations in a real production datacenter, and (3) leveraging parallelization opportunities by offloading requests to specialized hardware.

Performance. Figure 11(top) compares the throughput performance of DeepRecSched-CPU and DeepRecSched-GPU versus a baseline static scheduler across the three tail latency configurations, all normalized to the measured QPS at the low tail latency case of the baseline. Overall, DeepRecSched-CPU achieves 1.7, 2.1, and 2.7 higher QPS across all models for the low, medium, and high tail latency targets, respectively. DeepRecSched-CPU is able to increase the overall system throughput by operating at the optimal batch size configuration. Furthermore, DeepRecSched-GPU increases performance improvement to 4.0, 5.1, and 5.8 at the low, medium, and high tail latency targets, respectively. Thus, parallelizing requests across general-purpose CPUs and specialized hardware provides additional performance improvement for recommendation inference at-scale.

Power efficiency. Figure 11(bottom) compares the QPS-per-watt power efficiency of DeepRecSched-CPU and DeepRecSched-GPU by again normalizing the measured QPS/Watt to the low tail latency case of the baseline static scheduler. Given higher performance under the TDP power budget as the baseline, DeepRecSched-CPU achieves 1.7, 2.1, and 2.7 higher QPS/Watt for all models under the low, medium, and high tail latency targets, respectively. Aggregated across all models, DeepRecSched-GPU improves the power efficiency improvement to 2, 2.6, and 2.9 for each latency target. Compared to the performance improvement, DeepRecSched-GPU provides marginal improvement in power efficiency due to the overhead of GPU acceleration. In fact, while DeepRecSched-GPU improves system QPS across all recommendation models and latency targets, compared to DeepRecSched-CPU, it does not globally improve QPS/Watt. In particular, the power efficiency improvement of DeepRecSched-GPU is more pronounced for compute intensive models (i.e., WND, MT-WND, NCF). For the case of memory intensive models (i.e., DLRM-RMC1, DIN), the power overhead for offloading recommendation inference to GPUs outweighs the performance gain, degrading the overall power efficiency. Thus, judicious optimization of offloading queries across CPUs and specialized AI hardware can improve infrastructure efficiency for recommendation at-scale.

Vi-a Balance of Request and Batch Parallelism

Compared to the fixed static baseline, DeepRecSched-CPU improves QPS by balancing the request- versus batch-level parallelism across varying tail latency targets, query size distributions, recommendation models, and hardware platforms.

Optimizing across SLA targets. Figure 12(a) illustrates the tradeoff between request- and batch-level parallelism across varying tail latency targets for DLRM-RMC1. Under lower, stricter tail latency targets, QPS is optimized at lower batch sizes — favoring request level parallelism. On the other hand, at more relaxed tail latency targets, DeepRecSched-CPU finds the optimal configuration to be at a higher batch size — favoring batch-level parallelism. As previously shown in Figure 11(top), optimizing the per-request batch size yields DeepRecSched-CPU’s QPS improvements over the static baseline across tail latency targets.

Optimizing across query size distributions Figure 12(a) also shows the optimal batch size, for DLRM-RMC1, varies across query working set size distributions (lognormal and the production distribution). The optimal batch-size across all tail latency targets is strictly lower for lognormal than the query size distribution found in production recommendation use cases. This is a result of, as shown in Figure 5, query sizes in production recommendation use cases following a distribution with a heavier tail. In fact, applying optimal batch-size configuration based on the lognormal query size distribution to the production distribution degrades the performance of DeepRecSched-CPU by 1.2, 1.4, and 1.7 at low, medium, and high tail-latencies for DLRM-RMC1. Thus, built ontop of DeepRecInfra, DeepRecSched-CPU carefully optimizes request verus batch-level parallelism for recommendation inference in production datacenters.

Optimizing across recommendation models. Figure 12(b) illustrates that the optimal batch size varies across recommendation models with distinct compute and memory characteristics. For compute intensive models (e.g., DLRM-RMC3, WnD), system throughput is optimized at lower batch sizes compared to memory intensive models (e.g., DLRM-RMC1, DIN). At the high SLA targets, DLRM-RMC3 and WnD have an optimal batch size of 256 and 128, respectively. This is a result of the compute intensive models being accelerated by the data-parallel SIMD units (i.e., AVX-512 in Intel Skylake, AVX-256 in Intel Broadwell). In addition to leveraging the data-parallel per-core SIMD units, recommendation inference can be further accelerated by processing parallel requests across the chip-multiprocessor (CMP) cores. Running the models with smaller batch sizes result in better request-level parallelism and CMP core utilization. On the other hand, DLRM-RMC1 and DIN are optimized at a larger batch size of 1024. This is because the primary performance bottleneck of models with heavy embedding table accesses lies in the DRAM bandwidth utilization. In addition to request level parallelism, memory bandwidth utilization can be improved significantly by running recommendation inference at a higher batch size. By exploiting characteristics of the models to optimize the per-request batch size, DeepRecSched-CPU achieves higher QPS for a variety of distinct recommendation models.

Fig. 13: Exploiting the request vs. batch-level parallelism optimization demonstrated by DeepRecSchedin a real production datacenter improves performance of at-scale recommendation services. Across models and servers, optimizing batch size reduces p95 and p99 latency by 1.39 (left) and 1.31 (right).

Optimizing across hardware platforms. Figure 12(c) shows the optimal batch size, for DLRM-RMC3, varies across server architectures (Intel Broadwell and Skylake machines). The optimal batch size, across all tail-latency targets, is strictly higher on Intel Broadwell compared to Skylake. For example, at a latency target of 175, the optimal batch-size on Intel Broadwell and Skylake is 1024 and 256, respectively. This is a result of the varying cache hierarchies on the two platforms. In particular, Intel Broadwell implements an inclusive L2/L3 cache hierarchy while Intel Skylake implements an exclusive L2/L3 cache hierarchy. As a result, Intel Broadwell suffers from higher cache contention with more active cores leading to performance degradation. For example, at a latency target of 175 and per-request batch sizes of 16 (request-parallel) and 1024 (batch-parallel), Intel Broadwell has an L2 cache miss rate of 55% and 40% respectively. To compensate for this performance penalty, DeepRecSched-CPU runs recommendation models with higher batch-sizes — fewer request and active cores per query — on Intel Broadwell. Overall, DeepRecSched enables a fine balance between request vs. batch-level parallelism across not only varying tail latency targets, query size distributions, and recommendation models, but also the underlying hardware platforms.

Vi-B Tail Latency Reduction for At-Scale Production Execution

In addition to evaluating the proposed design using DeepRecInfra, we deploy the proposed design and demonstrate that optimizations translate to higher performance in a real production datacenter. Figure 13 illustrates the impact of varying the batch-size on the measured tail latency of recommendation models running in a production datacenter. The results are aggregated across a wide collection of recommendation models and server-class Intel CPUs used in the production datacenter fleets. Experiments are conducted on a cluster consisting of hundreds of machines. These machines are configured to receive a fraction of real-time production traffic. To account for the diurnal production traffic as well as intra-day query variability, we deploy and evaluate DeepRecSchedover the course of 24 hours. Compared to the baseline configuration with a fixed batch-size, the optimal batch size provides a 1.39 and 1.31 reduction in p95 and p99 tail latencies, respectively. This reduction in the tail latency can be used to increase system throughput (QPS) serviced by the cluster of machines, as demonstrated by DeepRecSched, translating to datacenter capacity saving.

Fig. 14: (Top) System throughput increases by scheduling queries across both CPUs and GPUs. The percent of work processed by the GPU decreases at higher tail latency targets. (Bottom) While QPS strictly improves, the optimal configuration based on QPS/Watt, varies based on the tail latency targets. GPUs are optimal at low tail latencies while CPUs provide better power efficiency at higher tail latency targets.

Vi-C Leverage Parallelism with Specialized Hardware

In addition to trading off request vs. batch-level parallelism, DeepRecSched-GPU leverages additional parallelism by offloading recommendation inference queries to GPUs.

Performance improvements. GPUs are often treated as throughput-oriented accelerators as compared to CPUs. However, in the context of personalized recommendation, we find that GPUs can unlock lower tail latency targets that could not be achieved by the CPU. Figure 14(a) illustrates the performance impact of scheduling requests across both CPUs and GPUs. While the lowest achievable tail-latency targets for DLRM-RCM1 on CPUs is 57, GPUs can achieve a tail-latency target of as low as 41 (1.4 reduction). This is a result of recommendation models exhibiting high compute and memory intensity, as well as the heavy tail of query sizes in production use cases (Figure 5).

Next, in addition to achieving lower tail latencies, parallelization across both the CPU and the specialized hardware continue to increase system throughput. For instance, Figure 14(a) shows that across all tail-latency targets, DeepRecSched-GPU achieves higher QPS than DeepRecSched-CPU. This is as a result of the execution of the larger queries on GPUs, enabling higher system throughput. Interestingly, the percent of work processed by the GPU decreases with higher tail latency targets. This is due that, at a low latency target, DeepRecSched-GPU optimizes system throughput by setting a low query size threshold and offloads a large fraction of queries to the GPU. Under a more relaxed tail-latency constraint, more inference queries can be processed by the CMPs. This leads to a higher query size threshold for DeepRecSched-GPU. At a tail latency target of 120, the optimal query size threshold is 324 and the percent of work processed by the GPU falls to 18%. As shown in Figure 11(top), optimizing the query size threshold yields DeepRecSched-GPU’s system throughput improvements over the static baseline and DeepRecSched-CPU across the different tail latency targets and recommendation models.

Infrastructure efficiency implications. While GPUs can enable lower latency and higher QPS, power efficiency is not always optimized with GPUs as the specialized AI accelerator. For instance, Figure 14(b) shows the QPS/Watt of both DeepRecSched-CPU and DeepRecSched-GPU for DLRM-RMC1, across a variety of tail latency targets. At low tail latency targets, QPS/Watt is maximized by DeepRecSched-GPU — parallelizing queries across both CPUs and GPUs. However, under more relaxed tail-latency targets, we find QPS/Watt is optimized by processing queries on CPUs only. Despite the additional power overhead of the GPU, DeepRecSched-GPU does not provide commensurate system throughput benefits over DeepRecSched-CPU at higher tail latencies.

More generally, power efficiency is co-optimized by considering both the tail latency target and the recommendation model. For instance, Figure 11(b) illustrates the power efficiency for the collection of recommendation models across different tail latency targets. We find that DeepRecSched-GPU achieves higher QPS/Watt across all latency targets for compute-intensive models (i.e., NCF, WnD, MT-WnD) — the performance improvement of specialized hardware outweighs the increase in power footprint. Similarly, for DLRM-RMC2 and DIEN, DeepRecSched-GPU provides marginal power efficiency improvement compared to DeepRecSched-CPU. On the other hand, the optimal configuration for maximizing power efficiency of DLRM-RMC1 and DLRM-RMC3 varies based on the tail latency target. As a result, as shown in Figure 11(b), in order to maximize infrastructure efficiency, it is important to consider a variety of recommendation uses cases, including model architecture and tail latency targets.

Vii Related Work

While the system and computer architecture community has devoted significant efforts to characterize and optimize deep neural network (DNN) inference efficiency, relatively little work has explored running recommendation at-scale.

DNN accelerator designs. Currently-available benchmarks for DNNs primarily focus on FC, CNNs, and RNNs [49, 65, 11, 12, 58]. Building upon the performance bottlenecks, a variety of software and hardware solutions have been proposed to optimize traditional DNNs  [44, 55, 5, 18, 46, 54, 25, 24, 6, 61, 52, 48, 4, 13, 38, 8, 51, 32, 37, 39, 15, 45, 35, 9, 2]. While the benchmarks and accelerator designs consider a variety of DNN use cases and systems, prior solutions do not apply to the wide collection of state-of-the-art recommendation models presented in this paper. For example, recent characterization of Facebook’s DLRM implementation demonstrates that DNNs for recommendation have unique compute and memory characteristics[41, 17]. These implementations are included, as DLRM-RMC 1-3, within DeepRecInfra. In addition, MLPerf, an industry-academic benchmark suite for machine learning, provides NCF as a training benchmark [1]. NCF, however, is not continued in the latest release; MLPerf is developing a recommendation benchmark that is more representative of industry e-commerce tasks for the next submission round [40, 47]. In addition, a unique and very important aspect of the end-to-end infrastructure presented in the paper is taking into account the at-scale inference request characteristics (arrival rate and size), particularly important for recommendation.

Optimizations for personalized recommendation. There are a few recent works that explored the design optimization opportunities for recommendation models. For instance, TensorDimm proposes and evaluates a near memory processing solution for recommendation models similar to DLRM-RMC 1-3 and NCF [34]. Ginart et al. and Shi et al. [16, 53] propose optimization techniques to compress embedding tables in recommendation models while maintaining the model accuracy. In contrast, this paper optimizes the at-scale inference performance of a wider collection of recommendation models by considering the effect of inference query characteristics as well as tail latency targets specific to distinct use cases.

Machine learning at-scale. Finally, prior work has examined the performance characteristics and optimization techniques for ML running on at-scale, warehouse scale machines. Sirius and DjiNN-and-Tonic explore the implications of ML in warehouse-scale computers [20, 19]. However, the unique properties of recommendation inference and query patterns have not been the focus of the prior work. Li et al.  [36] exploit task and data-level parallelism to meet SLA targets of latency critical applications i.e., Microsoft’s Bing search and finance workloads. Furthermore, recent work has open-sourced benchmarks for studying the performance implication of at-scale execution of latency critical datacenter workloads and cloud micro-services [14, 31]. In contrast, this paper provides an end-to-end infrastructure (DeepRecInfra) and design solutions (DeepRecSched) specialized for at-scale recommendation inference. DeepRecInfraprovides an even baseline for state-of-the-art recommendation models. It models real-time query patterns, representative of the distinct working set size distribution in production datacenter fleets. The unique characteristics lead to the design of DeepRecSched, providing significant performance improvement for at-scale recommendation — an important yet understudied class of AI inference.

Viii Conclusion

Given the growing ubiquity of web-based services that use recommendation algorithms, such as search, social-media, e-commerce, and video streaming, deep learning-based personalized recommendation comprises the majority of AI inference capacity and cycles in production datacenter. We propose DeepRecInfra, an extensible infrastructure to study a variety of at-scale recommendation inference. The infrastructure comprises eight state-of-the-art recommendation models, SLA targets, and query patterns. Built upon this framework, DeepRecSched exploits the unique characteristics of at-scale recommendation inference in order to optimize system throughput under a strict tail latency constraint. Across eight recommendation models and under a variety of SLA targets, we demonstrate that DeepRecSched improves system throughput by 2. In addition to evaluating DeepRecSched on DeepRecInfra, the design optimizations are evaluated in a real production datacenter demonstrating similar performance benefits. Finally, through judicious optimizations, DeepRecSched can leverage additional parallelism by offloading queries across CPUs and specialized AI hardware in order to achieve higher system throughput and infrastructure efficiency.

Ix Acknowledgements

We would like to thank Cong Chen and Ashish Shenoy for the valuable feedback and numerous discussions on the at-scale execution of personalized recommendation systems in Facebook’s datacenter fleets. The collaboration leads to insights which we use to refine the proposed design presented in this paper. It also results in design implementation, testing, and evaluation of the proposed idea for production use cases.


  • [1] (2019) A broad ml benchmark suite for measuring performance of ml software frameworks, ml hardware accelerators, and ml cloud platforms. Note: Cited by: §III-C, §VII.
  • [2] J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, and A. Moshovos (2017) Bit-pragmatic deep neural network computing. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 382–394. Cited by: §VII.
  • [3] J. Baxter (1997-07-01) A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning 28 (1), pp. 7–39. External Links: ISSN 1573-0565, Document, Link Cited by: 3rd item.
  • [4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam (2014) DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, Cited by: §VII.
  • [5] Y. Chen, T. Krishna, J. Emer, and V. Sze (2016)

    Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

    In ISSCC, Cited by: §I, §VII.
  • [6] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Teman (2014) DaDianNao: a machine-learning supercomputer. In MICRO, Cited by: §VII.
  • [7] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §I, §I, §I, §I, TABLE I, §III-B.
  • [8] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie (2016) Prime: a novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News, Vol. 44, pp. 27–39. Cited by: §VII.
  • [9] Y. Choi and M. Rhu (2019) PREMA: a predictive multi-task scheduling algorithm for preemptible neural processing units. arXiv preprint arXiv:1909.04548. Cited by: §VII.
  • [10] M. Chui, J. Manyika, M. Miremadi, N. Henke, R. Chung, P. Nel, and S. Malhotra (2018) NOTES from the ai frontier insights from hundreds of use cases. McKinsey Global Institute. Retrieved from McKinsey online database. Cited by: §I.
  • [11] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and M. Zaharia Dawnbench: an end-to-end deep learning benchmark and competition. Cited by: §VII.
  • [12] Deep bench. External Links: Link Cited by: §VII.
  • [13] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam (2015) ShiDianNao: shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43, pp. 92–104. Cited by: §VII.
  • [14] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y. He, B. Clancy, C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa, R. Lin, Z. Liu, J. Padilla, and C. Delimitrou (2019) An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, pp. 3–18. External Links: ISBN 978-1-4503-6240-5 Cited by: §III-C, §VII.
  • [15] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis (2017) Tetris: scalable and efficient neural network acceleration with 3d memory. In ACM SIGARCH Computer Architecture News, Vol. 45, pp. 751–764. Cited by: §VII.
  • [16] A. Ginart, M. Naumov, D. Mudigere, J. Yang, and J. Zou (2019) Mixed dimension embeddings with application to memory-efficient recommendation systems. arXiv preprint arXiv:1909.11810. Cited by: §VII.
  • [17] U. Gupta, X. Wang, M. Naumov, C. Wu, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, B. Jia, H. S. Lee, et al. (2019) The architectural implications of facebook’s dnn-based personalized recommendation. arXiv preprint arXiv:1906.03109. Cited by: §I, §I, §I, §I, TABLE I, 4th item, §III-A, §III-B, §IV-A, §V, §VII.
  • [18] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally (2016) EIE: efficient inference engine on compressed deep neural network. CoRR abs/1602.01528. External Links: Link, 1602.01528 Cited by: §I, §VII.
  • [19] J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge, R. G. Dreslinski, J. Mars, and L. Tang (2015-06) DjiNN and tonic: DNN as a service and its implications for future warehouse scale computers. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 27–40. External Links: Document, ISSN 1063-6897 Cited by: §VII.
  • [20] J. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. G. Dreslinski, T. Mudge, V. Petrucci, L. Tang, et al. (2015) Sirius: an open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers. In ACM SIGPLAN Notices, Vol. 50, pp. 223–238. Cited by: §III-C, §VII.
  • [21] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang (2018-02) Applied machine learning at facebook: a datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 620–629. External Links: Document, ISSN Cited by: §I, §IV-A, §V.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §I.
  • [23] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, Republic and Canton of Geneva, Switzerland, pp. 173–182. External Links: ISBN 978-1-4503-4913-0, Link, Document Cited by: §I, §I, §I, §I, TABLE I, §III-A.
  • [24] K. Hegde, R. Agrawal, Y. Yao, and C. W. Fletcher (2018-10) Morph: flexible acceleration for 3d cnn-based video understanding. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 933–946. Cited by: §VII.
  • [25] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher (2019)

    ExTensor: an accelerator for sparse tensor algebra

    In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, pp. 319–333. External Links: ISBN 978-1-4503-6938-1 Cited by: §VII.
  • [26] (2018) Intel math kernel library. Note: Cited by: 1st item.
  • [27] A. Jaleel, E. Borch, M. Bhandaru, S. C. Steely Jr, and J. Emer (2010) Achieving non-inclusive cache performance with inclusive caches: temporal locality aware (tla) cache management policies. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 151–162. Cited by: §IV-A.
  • [28] A. Jaleel, J. Nuzman, A. Moga, S. C. Steely, and J. Emer (2015) High performing cache hierarchies for server workloads: relaxing inclusion to capture the latency benefits of exclusive caches. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 343–353. Cited by: §IV-A.
  • [29] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. (2017) In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: §I, §III-B.
  • [30] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken (2009) The nature of data center traffic: measurements & analysis. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement, pp. 202–208. Cited by: §III-C.
  • [31] H. Kasture and D. Sanchez (2016) Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10. Cited by: §III-C, §VII.
  • [32] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay (2016) Neurocube: a programmable digital neuromorphic architecture with high-density 3d memory. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 380–392. Cited by: §VII.
  • [33] Y. Koren, R. Bell, and C. Volinsky (2009-08) Matrix factorization techniques for recommender systems. Computer 42 (8), pp. 30–37. Cited by: 1st item.
  • [34] Y. Kwon, Y. Lee, and M. Rhu (2019) TensorDIMM: a practical near-memory processing architecture for embeddings and tensor operations in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 740–753. Cited by: §I, §VII.
  • [35] Y. Kwon and M. Rhu (2018) Beyond the memory wall: a case for memory-centric hpc system for deep learning. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 148–161. Cited by: §VII.
  • [36] J. Li, K. Agrawal, S. Elnikety, Y. He, I. Lee, C. Lu, K. S. McKinley, et al. (2016) Work stealing for interactive services to meet target latency. In ACM SIGPLAN Notices, Vol. 51, pp. 14. Cited by: §I, §III-C, §III-C, §VII.
  • [37] R. LiKamWa, Y. Hou, J. Gao, M. Polansky, and L. Zhong (2016) RedEye: analog convnet image sensor architecture for continuous mobile vision. In ACM SIGARCH Computer Architecture News, Vol. 44, pp. 255–266. Cited by: §VII.
  • [38] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen (2015) Pudiannao: a polyvalent machine learning accelerator. In ACM SIGARCH Computer Architecture News, Vol. 43, pp. 369–381. Cited by: §VII.
  • [39] D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh (2016) Tabla: a unified template-based framework for accelerating statistical machine learning. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 14–26. Cited by: §VII.
  • [40] P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G. Wei, P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost, V. J. Reddi, T. Robie, T. St. John, C. Wu, L. Xu, C. Young, and M. Zaharia (2019) MLPerf training benchmark. arXiv preprint arXiv:1910.01500. Cited by: §VII.
  • [41] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy (2019) Deep learning recommendation model for personalization and recommendation systems. CoRR abs/1906.00091. External Links: Link Cited by: §I, §I, 4th item, §VII.
  • [42] Netflix update: try this at home. Note: Cited by: 1st item.
  • [43] (2019) NVIDIA cuda deep neural network library. Note: Cited by: 1st item.
  • [44] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. S. Emer, S. W. Keckler, and W. J. Dally (2017) SCNN: an accelerator for compressed-sparse convolutional neural networks. CoRR abs/1708.04485. External Links: Link, 1708.04485 Cited by: §VII.
  • [45] L. Pentecost, M. Donato, B. Reagen, U. Gupta, S. Ma, G. Wei, and D. Brooks (2019) MaxNVM: maximizing dnn storage density and inference efficiency with sparse encoding and error mitigation. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, New York, NY, USA, pp. 769–781. External Links: ISBN 978-1-4503-6938-1, Link, Document Cited by: §VII.
  • [46] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernandez-Lobato, G. Wei, and D. Brooks (2016) Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In ISCA, Cited by: §I, §VII.
  • [47] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. St. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou (2019) MLPerf inference benchmark. arXiv preprint arXiv:1911.02549. Cited by: §VII.
  • [48] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler (2016) VDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 18. Cited by: §VII.
  • [49] Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David Brooks (2016) Fathom: reference workloads for modern deep learning methods. In , IISWC’16. External Links: Link Cited by: §VII.
  • [50] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl (2001) Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pp. 285–295. External Links: ISBN 1-58113-348-0 Cited by: §I.
  • [51] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar (2016) ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44 (3), pp. 14–26. Cited by: §VII.
  • [52] H. Sharma, J. Park, E. Amaro, B. Thwaites, P. Kotha, A. Gupta, J. K. Kim, A. Mishra, and H. Esmaeilzadeh Dnnweaver: from high-level deep network models to fpga acceleration. Cited by: §VII.
  • [53] H. M. Shi, D. Mudigere, M. Naumov, and J. Yang (2019) Compositional embeddings using complementary partitions for memory-efficient recommendation systems. arXiv preprint arXiv:1909.02107. Cited by: §VII.
  • [54] F. Silfa, G. Dot, J. Arnau, and A. Gonzàlez (2018)

    E-pur: an energy-efficient processing unit for recurrent neural networks

    In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT ’18, pp. 18:1–18:12. External Links: ISBN 978-1-4503-5986-3 Cited by: §VII.
  • [55] Udit Gupta, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander M. Rush, Gu-Yeon Wei, David Brooks (2019-09-23) MASR: a modular accelerator for sparse rnns. In International Conference on Parallel Architectures and Compilation Techniques, Cited by: §I, §VII.
  • [56] C. Underwood (2019) Use cases of recommendation systems in business – current applications and methods. . External Links: Link Cited by: §I.
  • [57] V. Volkov and J. W. Demmel (2008) Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Cited by: §I.
  • [58] G. Wei and D. Brooks (2019) Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701. Cited by: §IV-B, §VII.
  • [59] D. Wong and M. Annavaram (2012) KnightShift: scaling the energy proportionality wall through server-level heterogeneity. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, Washington, DC, USA, pp. 119–130. External Links: ISBN 978-0-7695-4924-8, Link, Document Cited by: §V.
  • [60] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. Cited by: §I.
  • [61] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen (2016-10) Cambricon-X: an accelerator for sparse neural networks. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Vol. , pp. 1–12. External Links: Document, ISSN Cited by: §VII.
  • [62] Z. Zhao, L. Hong, L. Wei, J. Chen, A. Nath, S. Andrews, A. Kumthekar, M. Sathiamoorthy, X. Yi, and E. Chi (2019) Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, New York, NY, USA, pp. 43–51. External Links: ISBN 978-1-4503-6243-6, Link, Document Cited by: §I, §I, §I, §I, TABLE I, §III-A, §III-B.
  • [63] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, and K. Gai (2019) Deep interest evolution network for click-through rate prediction. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 5941–5948. Cited by: §I, §I, §I, §I, §I, TABLE I, 6th item, §III-A, §III-B.
  • [64] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai (2018) Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. Cited by: §I, §I, §I, §I, §I, TABLE I, §III-A, §III-B.
  • [65] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Jayarajan, A. Phanishayee, B. Schroeder, and G. Pekhimenko (2018) Benchmarking and analyzing deep neural network training. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pp. 88–100. Cited by: §VII.