The Architectural Implications of Facebook's DNN-based Personalized Recommendation

06/06/2019 ∙ by Udit Gupta, et al. ∙ 0

The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60 of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.



There are no comments yet.


page 1

page 3

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has become a cornerstone in many production-scale data center services. As web-based applications continue to expand globally, so does the amount of compute and storage resources devoted to deep learning training and inference [17, 36, 30]. Personalized recommendation is an important class of these services. Deep learning based recommendation systems are broadly used throughout industry to predict rankings for news feed posts and entertainment content [21, 24]

. For instance, in 2018, McKinsey and Tech Emergence estimated that recommendation systems were responsible for driving up to 35% of Amazon’s revenue 

[16, 51, 48].

The systems and computer architecture community has made significant strides in optimizing the performance, energy efficiency, and memory consumption of deep neural networks (DNNs). Optimizations span across the entire system stack, from algorithmic innovations (e.g., efficient DNN architectures 

[31, 33, 18]) to the use of reduced precision variables [28, 44, 37, 20, 26]), and from system-level techniques (e.g., heavily parallelized training/inference [25, 53]) to the design and deployment of hardware accelerators [36, 44, 13, 27, 54]. These solutions primarily target convolutional (CNN) [31, 45] and recurrent (RNN) [9, 10] neural networks. However, the benefits from these optimization techniques often cannot be realized by recommendation models as the core compute patterns of the models are intrinsically different, introducing unique memory and computational challenges.

Figure 1: Production-scale recommendation systems have orders of magnitude longer inference latency, larger embedding tables (memory intensity) and FC layers (compute intensity). The parameters of three models (RM1, RM2, and RM3), which are representative of ones used at Facebook, are normalized to MLPerf-NCF to highlight the gaps.

The recommendation models available today are not representative of production systems in terms of memory and compute behavior. Figure 1 quantifies the differences between an available recommendation benchmark, i.e., neural-collaborative filtering in MLPerf (MLPerf-NCF) [4], and three at-scale recommendation models, which are representative of models used in production at Facebook: RM1, RM2, and RM3. Compared to MLPerf-NCF, the production-scale models have orders of magnitude higher inference latency. This is a result of discrepancies in the size and scale of two important features of recommendation systems: embedding tables (memory intensive) and fully-connected layers (compute intensive). MLPerf-NCF has up to 10 fewer embedding tables; embedding tables in MLPerf-NCF consume tens of MBs whereas production-scale models are typically on the order of GBs. In addition, MLPerf-NCF implements fewer and smaller fully-connected (FC) layers. Therefore, the insights and solutions derived using these smaller recommendation models may not be applicable to nor representative of production systems and solutions.

In this paper, we present a set of personalized recommendation models representative of datacenter-scale workloads used at Facebook. First, we identify quantitative metrics to evaluate the performance of these recommendation workloads. Next, we design a set of synthetic recommendation models to conduct detailed performance analysis. Because inference in the data center is run across a variety of CPUs [30], we focus the design tradeoff studies on Intel Haswell, Broadwell, and Skylake servers, representative of state-of-the-art system infrastructure. Finally, we study performance characteristics of running recommendation models in production-environments. The insights from this analysis can be used to motivate broader system and architecture optimization for at-scale recommendation. For example, by leveraging server heterogeneity and inference latency across model types, we can maximize latency-bounded throughput by scheduling inference requests to execute on the most suitable platform.

The in-depth analysis presented in this paper of production-scale recommendation systems provides the following insights for future system design:

  • The current practice of using only latency for benchmarking inference performance is insufficient. At the data center scale, the metric of system throughput under a latency constraint is more representative. The latency-bounded throughput measure determines the number of items that can be ranked given service level agreement (SLA) requirements (Section 3).

  • Inference latency varies across several generations of Intel server architectures (Haswell, Broadwell, Skylake) that co-exist in data centers. With unit batch size, inference latency is optimized on high-frequency Broadwell machines. On the other hand, batched inference (throughput) is optimized with Skylake machines as batching increases the compute density of FC layers. Compute-intensive recommendation models are more readily accelerated with AVX-512 instructions in Skylake machines, as compared to AVX-2 instructions in Haswell and Broadwell (Section 5).

  • Co-locating multiple recommendation models on a single machine can improve throughput. However, this introduces a tradeoff between single model latency and aggregated system throughput. We characterize this tradeoff and find that processors with inclusive L2/L3 cache hierarchies (i.e., Haswell, Broadwell) are particularly susceptible to latency degradation due to co-location. This introduces additional scheduling optimization opportunities at the data center scale (Section 6).

  • Across at-scale recommendation models and different server architectures, the fraction of time spent on compute intensive operations, like FC, varies from 30% to 95%. Thus, existing solutions for accelerating FC layers only [36, 44, 27, 54] will translate to limited inference latency improvement for end-to-end recommendation. This is especially true of recommendation models dominated by embedding tables (Section 5).

Open-source: To facilitate future work on at-scale recommendation systems for the systems and computer architecture community, we have open-sourced a suite of synthetic models, representative of production use cases111 Together with the detailed performance analysis performed in this paper, the open-source implementations can be used to further understand the compute requirements, storage capacity, and memory access patterns, enabling optimization and innovation for at-scale recommendation systems.

2 Background

In this section we provide an overview of the overall task of personalized recommendation and the architecture of at-scale recommendation models. We also discuss distinguishing features of DNN-based recommendation models, compared to other DNN workloads, in terms of their compute density, storage capacity, and memory access patterns.

Figure 2: Simplified model-architecture to reflect at-scale recommendation models used at Facebook. Inputs to the model are a collection of dense and sparse features. Sparse features, unique to recommendation models, are transformed to a dense representation using embedding tables, shown in blue. The number and size of embedding tables, number of sparse feature (ID) lookups per table, as well as the depth and width of Bottom-FC and Top-FC layers varies based on the use-case.

2.1 Recommendation Task

Personalized recommendation is the task of recommending new content to users based on their preferences  [21, 24]. Estimates show that up to 75% of movies watched on Netflix and 60% of videos consumed on YouTube are based on suggestions from their recommendation systems [16, 51, 48].

Central to these services is the ability to accurately, and efficiently rank content based on users’ preferences and previous interactions (e.g., clicks on social media posts, ratings, purchases). Building highly accurate personalized recommendation system poses unique challenges as user preferences and past interactions with content are represented as both dense and sparse features [23, 41].

Inputs to DNNs used for image classification and speech recognition are image or audio samples, which are represented as dense matrices and vectors. The dense matrices and vectors are processed by a series of FC, CNN, or RNN layers. In contrast, inputs to recommendation systems represent interactions between any two general entities (i.e., user preferences for online videos). These inputs are a mix of dense and sparse features.

For instance, in the case of ranking videos (e.g., Netflix, YouTube), there may be tens of thousands of potential videos that have been seen by millions of viewers. However, individual users interact with only a handful of videos. This means interactions between users and videos are sparse. Sparse features not only make training more challenging but also require intrinsically different operations (e.g., embedding tables) which impose unique compute, storage capacity, and memory access pattern challenges (see Section 2.2 for details).

Figure 3: Compared to FC and CNN layers, embedding table operations (SparseLengthsSum, SPS, in Caffe2), seen in recommendation systems, exhibit high LLC cache miss rate (left) and low compute density (right).

2.2 Recommendation Models

Figure 2 shows a simplified architecture of state-of-the-art DNNs for personalized recommendation models used at Facebook

. The model comprises a variety of operations such as FC layers, embedding tables (which transform sparse inputs to dense representations), pooling, and non-linearities, such as ReLU. At a high-level, dense and sparse input features are separately transformed using FC layers and embedding tables respectively. The outputs of these transformations are then combined and processed by a final set of FC layers. More advanced architectures for personalized recommendation can be in found 

[12, 1].

Execution Flow: The inputs, for a single user and single post, to recommendation models are a set of dense and sparse features. The output is the predicted click-through-rate (CTR) of the user and post. Dense features are first processed by a series of FC layers, shown as the Bottom-FCs in Figure 2. On the other hand, sparse input features, represented as multiple vectors of sparse IDs, must first be made dense. While, each vector of sparse IDs can be transformed to dense vectors using FC layers, the compute demands of doing so would be significant. Instead, we use embedding tables. Each vector is paired with an embedding table, as shown in Figure 2, and each sparse ID is used to look-up a unique row in the embedding table. The rows of the embedding are then combined into a single vector, typically with a dimension of 32 or 64, using element-wise gather operations. Finally, these vectors and the output of the Bottom-FC layers are concatenated, and processed by the Top-FC layers show in Figure 2. The output is a single value representing the predicted CTR of the user-post pair.

Processing multiple posts: At the data center scale, recommendations for many users and posts must be ranked simultaneously. Thus, it is important to note that the vectors of sparse IDs shown in Figure 2 correspond to inputs for a single user and single post. To compute the CTR of many user-post pairs at once, requests are batched to improve overall throughput in the data center.

The depth and width of FC layers, number and size of embedding tables, number of sparse IDs per input, and typical batch-sizes depend on the use case of the recommendation model (see Section 3 for more details).

Figure 4:

Content for recommendation systems is ranked hierarchically in two steps: filtering and ranking. Filtering reduces the number of total items to a smaller subset using lightweight machine learning techniques or smaller DNN-based recommendation models (RM1). Ranking performs finer grained ranking using larger DNN-based recommendation models (RM2 and RM3).

2.3 Embedding Tables

A key distinguishing feature of DNNs for recommendation systems, compared to CNNs and RNNs, is the use of embedding tables. As shown in Figure 2, embedding tables are used to transform sparse input features to dense ones. The dense representations are subsequently processed by a series of more traditional DNN layers including, FC, pooling, and ReLU non-linearities. However, the embedding tables impose unique challenges to efficient execution in terms of their large storage capacity, low compute density, and irregular memory access pattern.

Model Description FC Embedding Tables
Bottom Dims Top Dims Number Input Dim. Output Dim. Lookups

Small FC Layer1: 8 Layer1: 4 1 to 3 1 to 180 1 User: 4
Few Emb. Tables Layer2: 4 Layer2: 2 Posts:Nx4
Small Emb. Tables Layer3: 1 Layer3: 1
RM2 Small FC Layer1: 8 Layer1: 4 8 to 12 1 to 180 1 User:4
Many Emb. Tables Layer2: 4 Layer2: 2 Posts:Nx4
Small Emb. Tables Layer3: 1 Layer3: 1
RM3 Large FC Layer1: 80 Layer1: 4 1 to 3 10 to 180 1 1
Few Emb. Tables Layer2: 8 Layer2: 2
Large Emb. Tables Layer3: 4 Layer3: 1
Table 1: Model architecture parameters representative of production scale recommendation workloads for three example recommendation models used at Facebook, highlighting their diversity in terms of embedding table and FC sizes. Each parameter (column) is normalized to the smallest instance. For example, Bottom and Top FC sizes are normalized to layer 3 in RM1. Number, input dimension, and output dimension of embedding tables are normalized to the RM1 model. Number of lookups are normalized to RM3.

Storage capacity The size of a single embedding table seen in production-scale recommendation models varies from tens of MBs to several GBs. Furthermore, the number of embedding tables varies from 4 to 40, depending on the particular use case of the recommendation model. (See Section 3 for details). In aggregate, embedding tables for a single recommendation model can consume up to 20GB of memory. Thus, systems running production-scale recommendation models require large, off-chip storage such as DRAM or dense non-volatile memory [23].

As shown in Figure 3, compared to typical FC and CNN layers, embedding tables exhibit low compute density and irregular memory access patterns. Recall that the embedding table operation (implemented as the SparseLengthsSum operator in Caffe2[2] in production-scale recommendation models) entails reading a small subset of rows in the embedding table. The rows, indexed based on input sparse IDs, are combined using element-wise sum. While the entire embedding table is not read for a given input, the accesses follow a highly irregular memory access pattern. On an Intel Broadwell server present in production-environment data centers, this results in a high LLC cache miss rate. For instance, Figure 3(left) shows that a typical SparseLengthsSum operator in production-scale recommendation models has an LLC cache miss rate of 8 MPKI [23], compared to 0.2 MPKI and 0.06 MPKI in an FC layer found in recommendation models and a CNN layer found in ResNet50 [31], respectively. Furthermore, the element-wise sum is a low-compute intensity operation. For instance, as shown in Figure 3(right), SparseLengthsSum (SPS) has a a compute density of 0.25 FLOPS/Byte compared to 18 FLOPS/Byte and 141 FLOPS/Byte for FC and CNN layers. Due to their highly irregular memory access pattern and low-compute density, improving the efficiency of embedding table operations requires unique solutions, compared to the software and hardware acceleration approaches applied to FC and CNN layers.

3 At-scale Personalization

In this section we describe model architectures for three classes of production-scale recommendation models, referred to as RM1, RM2, and RM3. The three model types are used across two different services and have different configurations based on their use-case. Model configurations vary in terms of number and size of embedding tables, the number of sparse IDs per embedding table, and the size of FC layers. These differences affect important execution characteristics such as execution time bottlenecks, compute density, and memory access patterns, which may lead to different system and micro-architecture solutions for efficient execution.

3.1 Production Recommendation Pipeline

As shown in Figure 4, personalized recommendation at Facebook is accomplished by hierarchically ranking content. Lets consider the example of recommending social media posts. When the user interacts with the web-based social media platform, a request is made for relevant posts. At any given time there may be thousands of relevant posts. Based on user preferences, the platform must recommend the top tens of posts. This is accomplished in two steps, filtering and ranking [15].

First, the set of possible posts, thousands, is filtered down by orders of magnitude. This is accomplished using lightweight machine learning techniques such as logistic regression. Compared to using heavier DNN-based solutions, using lightweight techniques trades off higher accuracy for lower run-time. DNN-based recommendation models are used in the filtering step when higher accuracy is needed. One such example is recommendation model 1 (RM1).

Next, the subset of posts is ranked and the top tens of posts are shown to the user. This is accomplished using DNN-based recommendation models. Compared to recommendation models used for filtering content, DNN-based recommendation models for finer grained ranking are typically larger in terms of FC and embedding tables. For instance, in the case of ranking social media posts, the heavyweight recommendation model (RM3) comprises of larger Bottom-FC layers. This is a result of the service using more dense features. In contrast, the heavyweight recommendation model (RM2) comprises of more embedding tables as it processes contents with more sparse features.

SLA requirements: Note that in both steps, lightweight filtering and heavyweight ranking, many posts must be considered per user query. Each query must be processed within strict latency constraints set by SLA. Based on the use case, the SLA requirements can vary from tens to hundreds of milliseconds [36, 15, 43]. Thus, when analyzing and optimizing recommendation systems in production data center, it is important to consider not only single model latency but also throughput metrics under SLA agreements. In the data center, balancing throughput with strict latency requirements is accomplished by batching queries and co-locating multiple inferences on the same machine (see Section 5 and Section 6 for details).

3.2 Diversity of Recommendation Models

Table 1 shows representative architectural parameters for three classes of recommendation models that are used at Facebook: RM1, RM2, and RM3. As many variants of each type of model exist across production-scale recommendation systems, we provide a range of parameters for RM1, RM2, and RM3. While all three types of models follow the general RM architecture, shown in Figure 2, they are quite diverse in terms of number and size of embedding tables, embedding table lookups, and depth and width of FC layers. To highlight these differences we normalize each model feature the smallest instance across all three models. Bottom and Top FC sizes are normalized to layer 3 in RM1. Number of embedding tables, and their input and output dimensions are normalized to RM1 as well. The number of lookups (sparse IDs) per embedding table are normalized to RM3. Generally, RM1 is smaller in terms of FCs and embedding tables, RM2 has many embedding tables (memory intensive), and RM3 has larger FC layers (compute intensive).

The number and size of embedding tables across the three classes of recommendation models. For instance, RM2 can up to an order of magnitude more embedding tables compared to RM1 and RM3. This is because RM1 is a lightweight recommendation model used in the initial filtering step and RM3 is used in applications with fewer sparse features. Furthermore, while the output dimension of embedding tables is the same across the recommendation models (between 24-40), RM3 has the largest embedding tables in terms of the input dimensions. In aggregate, assuming 32-bit floating point datatypes, the storage capacity of embedding tables varies between 100MB, 10GB, and 1GB for RM1, RM2, and RM3. Thus, systems that run any of the three at-scale recommendation model types, require dense, large, off-chip memory systems like DRAM or non-volatile memory.

Embedding table lookups Embedding tables in RM1 and RM2 have more lookups (i.e., more sparse IDs) per input compared to RM3. This is a result RM1 and RM2 are used in services with many sparse features while RM3 is used in recommending social media posts, which has fewer sparse features. Thus, RM1 and RM2 models perform more irregular memory accesses leading to higher cache miss rates on off-the-shelf Intel server architectures found in the data center.

MLP layers Bottom-FC layers for RM3 are generally much wider than those of RM1 and RM2. This is a result of using more dense features in ranking social media posts (RM3) compared to services powered by RM1 and RM2. Thus, RM3 is a more compute intensive model than RM1 and RM2. Finally, it is important to note that width of FC layers is not necessarily a power of 2, or cache-line aligned, as the number of learned dense and sparse features is not necessarily an even power of 2.

Machines Haswell Broadwell Skylake
Frequency 2.5GHz 2.4GHz 2.0GHz
Cores per socket 12 14 20
Sockets 2 2 2
L1 Cache Size 32 KB 32 KB 32 KB
L2 Cache Size 256 KB 256 KB 1MB
L3 Cache Size 30 MB 35 MB 27.5MB
L2/L3 Inclusive Inclusive Inclusive Exclusive
or Exclusive
DRAM Capacity 256 GB 256 GB 256GB
DDR Frequency 1600MHz 2400 MHz 2666 MHz
DDR Bandwidth 51 GB/s 77 GB/s 85 GB/s
per socket
Table 2: Description of machines present in data centers and used to run recommendation models

4 Experimental Setup

Server Architectures: Generally, data centers are composed of a heterogeneous set of server architectures with differences in compute and storage capabilities. Services are mapped to racks of servers to match their compute and storage requirements. For instance, ML inference in Facebook datacenters is run on CPU-based servers such as, large dual-socket server-class Intel Haswell, Broadwell, or Skylake CPUs. These servers comprise large amounts of DRAM and wide-SIMD instructions that are used for running the memory and compute intensive ML inferences.

Table 2 describes key architecture features of the Intel CPU server systems considered in this paper. Compared to Skylake, Haswell and Broadwell servers have higher operating frequencies. For consistency, turbo boost is disabled for all experiments in this paper. On the other hand, the Skylake architecture has support for AVX-512 instructions, more parallel cores, and larger L2 caches. Furthermore, Haswell and Broadwell implement an inclusive L2/L3 cache hierarchy, while Skylake implements an non-inclusive/exclusive cache-hierarchy [35, 34]. (For the remainder of this paper we will refer to Skylake’s L2/L3 cache hierarchy as exclusive). Section 5 and Section 6 detail the tradeoff between the system and micro-architecture designs, and their impact on inference latency and throughput in the datacenter.

Synthetic recommendation models: To study the performance characteristics of recommendation models, we consider representative implementation of three model types RM1, RM2 and RM3 shown in Table 1. We analyze inference performance using a benchmark [42], which accurately represents the execution flow of production-scale models (Section 7). The benchmark is implemented in Caffe2 with Intel MKL as a backend library. All experiments are run with a single Caffe2 worker and Intel MKL thread.

We note that to maximize throughput (i.e., number of posts) processed under strict SLA requirements, inputs and models must be processed in parallel. This is accomplished by using non-unit batch-sizes and co-locating multiple models on a single system (see Section 5 and Section 6 for details). Finally, we point out that all input data and model parameters are stored in fp32 format with the NCHW data layout.

5 Understanding Inference
Performance of a Single Model

In this section we analyze the performance of a single production-scale recommendation model running on server class Intel CPU systems. We highlight the following results.

Figure 5: (Left) Inference latency of three production-scale recommendation models (RM1, RM2, RM3) on an Intel Broadwell server, with unit batch size, varies by an order of magnitude. (Right) Breakdown of time spent in each operator also varies significantly across the three models.

Takeaway-message 1: Inference latency varies by 15 across production-scale recommendation models.

Figure 5 (left) shows the inference latency of the three classes of production-scale models, with unit batch-size, on an Intel Broadwell server. RM1 and RM2 have a latency of 0.04 and 0.30, respectively. This is a result having an order of magnitude more embedding tables in RM2. Compared to RM1 and RM2, however, RM3 has a much higher latency of 0.60. This is because of the much larger FC layers found in RM3. Furthermore, we find significant latency differences between small and large implementations of each type of recommendation model. For instance, a large RM1 has a 2 longer inference latency as compared to a small RM1 model, due to more embedding tables and larger FC layers (see Table 1 for details).

Takeaway-message 2: While memory requirements are set by embedding tables, no single operator determines the execution time bottleneck across production-scale recommendation models.

Figure 5

(right) shows the breakdown of execution time for the three classes of production-scale models running on an Intel Broadwell server. The trends of operator level breakdown across the three recommendation models hold for different Intel server architectures (across Haswell, Broadwell, Skylake). When running compute intensive recommendation models, such as RM3, over 96% of the time is spent in either the BatchMatMul or FC operators. However, the BatchMatMul and FC operators comprise only 61% of the run-time for RM1. The remainder of the time is consumed by running SparseLengthsSum (20%), which corresponds to embedding table operations in Caffe2, Concat (6.5%), and element-wise activation functions. In contrast, for memory-intensive production-scale recommendation models, like RM2, SparseLengthsSum consumes 80% of the execution time for the mode.

Thus, software and hardware acceleration of matrix multiplication operations alone (e.g., BatchMatMul and FC) will provide limited benefits on end-to-end performance across all three recommendation models. Solutions for optimizing the performance of recommendation models must consider efficient execution of non-compute intensive operations such as embedding table lookups.

Takeaway-message 3: Running production-scale recommendation models on Intel Broadwell optimizes single model inference latency.

Figure 6 compares the inference latency of running the recommendation models on Intel Haswell, Broadwell, and Skylake servers. We vary the input batch-size from 16, 128, to 256 for all three recommendation models RM1(top), RM2(center), and RM3(bottom). For a small batch size of 16, inference latency is optimized when the recommendation models are run on the Broadwell architecture. For instance, compared to the Haswell and Skylake architectures, Broadwell sees 1.4 and 1.5 performance improvement for RM1, 1.3 and 1.4 performance improvement for RM2, and 1.32 and 1.65 performance improvement on RM3.

Figure 6: Inference latency of running RM1 (Top), RM2 (Center), and RM3 (Bottom) with batch sizes of 16, 128, and 256. While Broadwell is optimal for running inferences with low batch-sizes, Skylake demonstrates higher performance with larger batch-sizes. This is a result of wider-SIMD support with AVX-512 instructions on Skylake architectures. The horizontal line threshold indicates SLA requirements for models in low-latency recommendation systems (e.g., search [36, 15]).

At low batch sizes, Broadwell outperforms Skylake due a higher clock frequency. As shown in Table 2, Broadwell has a 20% higher clock frequency compared to Skylake. While Skylake has wider-SIMD support with AVX-512 instructions, recommendation models with smaller batch sizes (e.g., less than 16) are memory bound and do not efficiently exploit the wider-SIMD instruction. For instance, we can measure the SIMD throughput by measuring the number of fp_arith_inst_retired.512b_packed_single instructions using the Linux perf utility. The SIMD throughput with a batch-size of 4 and 16 are 2.9 (74% of theoretical) and 14.5 (91% of theoretical) higher, respectively, as compared that with unit batch-size. As a result, at low batch-sizes Broadwell outperforms Skylake running recommendation models, due to higher clock frequency and inefficient use of AVX-512 support in Skylake.

Broadwell machines outperform Haswell machines due to a higher DRAM frequency. Haswell’s longer execution time compared to Broadwell is caused by slower performance running the SparseLengthsSum operator. Recall that the SparseLengthSum operator is memory intensive. For instance, the LLC miss rate of the SparseLengthsSum operator itself is between 1-10 MPKI (see Figure 3). This corresponds to less than 1GB/s DRAM bandwidth utilization — well under the 51GB/s limit of the DRAM capacity in Haswell (see Table 2). As a result, the performance difference between Broadwell and Haswell for the SparseLengthsSum operator comes from differences in memory latency (DRAM frequency) but not throughput (DRAM bandwidth). Haswell implements a slower DRAM (DDR3 at 1600MHz) as compared to Broadwell (DDR4 at 2400MHz), accounting for the performance difference between the two machines.

Takeaway-message 4: While the Skylake has wider-SIMD support, which should provide performance benefits on batched and compute-intensive inference, its throughput is sub-optimal due to irregular memory access patterns from embedding table lookups.

Recall that in production data centers, recommendation queries for many users and posts must be ranked simultaneously. One solution to improving overall system throughput is batching. As shown in Figure 6, Skylake exhibits lower run-time with higher batch-sizes. As a result, for use cases with strict latency constraints (i.e., around 10 for search [36, 15]), Skylake can process recommendation with higher batch-sizes.

This is a result of the Skylake architecture accelerating FC layers using wider-SIMD support with AVX-512 instructions compared to Broadwell and Haswell. However, exploiting the benefits of AVX-512 requires much higher batch-sizes, at least 128, for memory intensive production-scale recommendation models, such as RM1 and RM2. For compute-intensive models, like RM3, Skylake outperforms both Haswell and Broadwell starting at a batch-size of 64. These benefits are sub-optimal given Skylake (AVX-512) has a 2 and 4 wider SIMD width compared to Broadwell (AVX-2) and Haswell (AVX-2), respectively. For instance, Skylake runs the memory-intensive RM1 model 1.3 faster than Broadwell. This is due to the irregular memory access patterns from the embedding table lookups. In fact, the SparseLengthsSum operator becomes the run-time bottleneck in RM1 with sufficiently high batch-sizes.

Takeaway-message 5: Designers must consider a unique set of performance and resource requirement characteristics when accelerating DNN-based recommendation models.

First, solutions must balance low-latency, for use cases with stricter SLA (e.g., search [15, 36]), and high-throughout for use cases such as web-scale services. Thus, even for inference, hardware solutions must consider batching. This can affect whether performance bottlenecks come from the memory-intensive embedding-table lookups or compute-intensive FC layers. Next, optimizing end-to-end model performance of recommendation workloads requires full-stack optimization given the diverse memory capacity, compute intensity, and memory access pattern characteristics seen in representative implements (e.g., RM1, RM2, RM3). For instance, a combination of aggressive compression and novel memory technologies [23] are needed to reduce the memory capacity requirements, set by large embedding tables, by orders of magnitude. Existing solutions of standalone FC accelerators [14, 36, 27, 13, 44, 54] will provide limited performance, area, and energy benefits to end-to-end recommendation workloads. Finally, accelerator architectures must balance flexibility with efficient resource utilization, in terms of memory capacity, bandwidth, and FLOPs, to support the diverse set of recommendation models used in data centers.

Figure 7: Impact of co-locating production-scale recommendation models on an Intel Broadwell server. As we increase the number of co-located models the per-model latency increases due to shared system resources. We find that RM2 is the most affected by co-location, followed by RM1 and RM3. The increase in model latency (shown for RM1) due to co-location is mainly caused by increase in time spent on FC (1.6) and SparseLengthsSum (3).

6 Understanding Effects of
Co-locating Models

In addition to batching multiple items into a single inference, multiple RM inferences at Facebook are simultaneously run on the same server in order to service billions of requests world-wide. This translates to higher resource utilization. Co-locating multiple production-scale recommendation models on a single machine can however significantly degrade inference serving latency, trading off single model latency with server throughput.

We analyze the impact of co-location on per-model latency as well as overall throughput due to co-location. We find that the effects of co-location on latency and throughput not only depend on the type of production-scale recommendation models but also on the underlying server architecture. For instance, processor architectures with inclusive L2/L3 cache hierarchies (i.e., Haswell, Broadwell) are particularly susceptible to performance degradation and worse performance variability, as compared to processors with exclusive L2/L3 cache hierarchies (i.e., Skylake). This exposes opportunities for request scheduling optimization at the data center scale [39, 11].

Takeaway-message 6 Per-model latency degrades due to co-locating many production-scale recommendation models on a single machine. In particular, RM2 suffers from latency degradation more than RM1 and RM3 due to a higher degree of irregular memory accesses.

Figure 7 shows the degradation in per-model latency of RM1, RM2, and RM3 as we co-locate multiple instances of the recommendation models on a single machine. The per-model latency is normalized to the latency of running a single instance (N=1) of the recommendation models, to highlight the relative degradation in model latencies across the three types of models. All experiments are run on a Intel Broadwell architecture. Compared to RM1 and RM3, we find that RM2 suffers higher latency degradation. For instance, compared to a single inference per machine, per-model latency, when simultaneously co-locating 8 production-scale models, is 1.3, 2.6, 1.6 slower for RM1, RM2, and RM3 respectively. At the data center scale, this introduces opportunities for optimizing the number of co-located models per machine in order to balance inference latency with overall throughput — number of items ranked under a strict latency constraint given by the SLA requirements.

Figure 7 also shows that the per-model latency degradation due to co-location is caused by lower performance for the FC, and in particular, SparseLengthsSum operators. As seen in RM1 and RM2, the fraction time spent running the SparseLengthsSum operator increases with higher degrees of co-location. RM3 remains dominated by FC layers regardless of the number of co-located jobs. For instance, for RM2, compared to running a single model per machine, time spent on the FC and SparseLengthsSum operators increases by 1.6 and 3, respectively. While the time spent on remaining operators, accumulated as "Rest", also increases by a factor of 1.6, the impact on the overall run-time of the production-scale recommendation model is marginal. Similarly, for RM1 the fraction of time spent running SparseLengthsSum increases from 15% to 35% when running 1 job to 8 co-located RM1 inferences. The greater impact of co-location on SparseLengthsSum is due to the higher degree of irregular memory accesses which, compared to FC, exhibits less cache reuse. Thus, while co-location improves overall throughput of high-end server architecture, it can impact performance bottlenecks when running production-scale recommendation models. This in turn can translate to less efficient resource utilization.

Figure 8: Tradeoff between latency and throughput as the number of co-located models increases. Starting from no co-location, latency quickly degrades on all three architectures before plateauing. As with batching, Broadwell performs best under low co-location (latency optimal) while Skylake is optimal under high co-location (throughput optimal). Skylake degradation around 18 co-located jobs is due to a sudden increase in LLC miss rate. Experiments are shown for RM2.
(a) FC operator in production-environment.
(b) Same FC operator under co-location
(c) Larger FC operator under co-location
Figure 9: (a) Performance distribution of FC operator that fits in Skylake L2 cache and Broadwell LLC. The three highlight modes correspond to Broadwell with low, medium, and high co-location. (b) Mean latency of the same FC operator (solid line) increases with more co-location. Gap between p5 and p99 latency (shaded region) increases drastically on Broadwell with high co-location and more gradually on Skylake. (c) Larger FC operator highlights the difference in Broadwell’s drastic p99 latency degradation compared to Skylake’s more gradual degradation. Differences between Broadwell and Skylake under high co-location are due to L2/L3 cache sizes and inclusive versus exclusive hierarchies.

Takeaway-message 7 Processor architectures with inclusive L2/L3 cache hierarchies (i.e., Haswell, Broadwell) are more susceptible to per-model latency degradation as compared to architectures with exclusive cache hierarchies (i.e., Skylake) due to the high degree of irregular memory accesses in production-level recommendation models.

Figure 8 shows the impact of co-locating a production-scale recommendation model on both latency and throughput across the Intel Haswell, Broadwell, and Skylake architectures. While the results shown are for RM2, the takeaways hold for RM1 and RM3 as well. Throughput is measured by the number of inferences per second and bounded by a strict latency constraint, set by the SLA requirements, of 450.

No co-location: Recall that in the case of running a single inference per machine, differences in model latency across servers is determined by operating frequency, support for wide-SIMD instructions, and DRAM frequency (see Section 5 for details). Similarly, when the number of co-located inferences is small (i.e., ), Broadwell architectures has a 10% higher throughput and lower latency compared to Skylake.

Co-locating models: Increasing the number of co-located inferences, we find that Skylake outperforms both Haswell and Broadwell in terms of latency and throughput. Co-locating inferences on a single machine stresses the shared memory system causing the latency to degrade. This is particularly true for co-locating production-scale recommendation models that exhibit a high degree of irregular memory accesses. In contrast, traditional DNN inference exploits higher reuse in L1 and L2 caches. As a result we find that, in use cases with strict latency bounds (e.g., 3), Skylake provides the highest throughput by co-locating recommendation models on a single machine.

Skylake’s higher performance with more co-located inferences is a result of implementing an exclusive L2/L3 cache hierarchy as opposed to an inclusive one, as found in Haswell and Broadwell machines. Inclusive cache hierarchies suffer from a higher L2 cache miss-rate compared to exclusive caches, due to the irregular memory access patterns in recommendation models. For instance, the L2 miss rate on the Broadwell architecture increases by 29% when running 16 co-located inferences (22 MPKI) compared to a single inference (17 MPKI). The Skylake architecture has not only a lower L2 miss rate (13 MPKI for single inference), but also a smaller L2 miss rate increase (10%). This is not only caused by a smaller L2 cache size in Broadwell machines, but also a higher degree of cache back-invalidation due to an inclusive L2/L3 cache hierarchy. For instance, while Broadwell sees a 21% increase in L2 read-for-ownership (RFO) miss rate, L2 RFO miss rate increases by only 9% on Skylake. We also find that with a high number of co-located inferences (over 18), Skylake suffers from a sudden latency drop. This is caused by a 5.5% increase in LLC miss rate.

Simultaneous multithreading/hyperthreading In addition to co-locating recommendation models across cores in chip-multiprocessors, multithreading/hyperthreading improves resource utilization by enabling time-multiplexing execution for two models running on a physical core. Prior work has shown that simultaneous multithreading/hyperthreading in modern processors generally improves system throughput [47, 46].

However, hyperthreading degrades p99 latency for recommendation models, especially for compute-intensive ones (i.e., RM3). The results shown in Figure 8 are without hyperthreading — one production-scale recommendation model per physical core. Enabling hyperthreading doubles the number of inference workloads per physical core. This causes FC and SparseLengthsSum run-times to degrade by 1.6 and 1.3, respectively. The FC operator suffers more performance degradation as it exploits hardware for wide-SIMD instructions (i.e., AVX-2, AVX-512) that are time-shared across threads on the physical core. As a result, latency degradation due to hyperthreading is more pronounced in compute-intensive recommendation models (i.e., RM3) than memory-intensive ones (i.e., RM1, RM2). Note that latency degrades only on cores with hyperthreading enabled. At the data center scale, in the general case, not all cores will run two hyperthreaded recommendation models. Thus, hyperthreading impacts p99 latency more than average latency (see Section 6.1 for details on p99 latency degradation in production environments).

6.1 At-Scale Recommendation Execution
in Production Environment

The experiments thus far have used synthetic model implementations to study average latency and throughput across production-scale recommendation models, system architectures, and run-time configurations (e.g., batch-size, number of co-located models). However, data center scale production-environments are concerned with not only the average case but also tail performance [38, 22]. Here we study the impact of running at-scale inferences for recommendation systems in production-environments. In particular, we study the impact of co-located inferences on the run-time of individual operators. Following the earlier results using synthetic model implementations, production-scale data shows that Broadwell sees a larger performance degradation due to co-location as compared to Skylake.

Furthermore, inference for recommendation models running in the data center suffer from high performance variability. Prior work observed significant DGEMM performance variability on Intel Xeon Platinum processor (Skylake) [40]. The variability is caused by increased DRAM traffic from snoop filter evictions. While Skylake implements an exclusive L2/L3 cache hierarchy, an “inclusive” snoop filter tracks lines held in the cache leading to high performance variability. While we do not see performance variability in stand-alone recommendation models (Section 5 and Section 6), we find pronounced performance variability for recommendation-models co-located in the production environment. In fact, p99 latency degrades faster, as the number of co-located inferences increases, on Broadwell machines (inclusive L2/L3 caches) as compared to Skylake. This exposes opportunities for optimizing data center level scheduling decisions to trade off latency and throughput, with performance variability.

Takeaway-message 8 While co-locating production scale recommendation models with irregular memory accesses may increase the overall throughput, it introduces significant performance variability in production-environments.

As an example of performance variability in recommendation systems in production environments, Figure 8(a) shows the distribution of a single FC operator typically found in all three types of production-scale recommendation models (i.e., RM1, RM2, and RM3). The production environment has a number of differences compared to co-locating inferences using the synthetic model implementation, including using a job scheduler that implements a thread pool with separate queuing model for mapping inferences to cores, and exploiting hyper-threading. Despite fixing the input and output dimensions of the FC operator to 512 the performance distribution varies significantly across Broadwell and Skylake architectures. In particular, Skylake sees a single common mode (45) whereas, Broadwell follows a multi-modal distribution with modes at 40, 58, and 75. This is a result of co-locating inferences on both architectures in production environments.

Figure 8(b) shows the impact on latency for the same FC operator under varying degrees of co-located inferences on Broadwell and Skylake in the production data-center environment. All inferences co-locate the FC operator with end-to-end RM1 inferences. Inferences are first co-located to separate physical cores (i.e., 24 for Broadwell, 40 for Skylake) and exploit then hyper-threading. The solid lines illustrate the average operator latency on Broadwell (red) and Skylake (blue), while the shaded regions represent the p5 (bottom) and p99 (top) latencies.

Three key observations are made here. First, as expected, average latency generally increases with higher number of co-located jobs. On the Broadwell architecture, the average latency of the FC operator can be categorized into three regions: latency of around 40 under no co-location, latency of around 60 under low co-location (5-15 jobs), and latency of around 100 under high co-location (over 20 jobs). This roughly corresponds to the modes seen in the overall performance distributions in Figure 8(a). Second, the p99 latency increases significantly with high co-location (over 20 jobs) on the Broadwell architecture. Thus, while the average throughput of the system increases with co-location, it sacrifices predictably meeting SLA requirements due to performance variability. Third, compared to Broadwell, the average and p99 latency increases more gradually on Skylake. Recall that this is a result of a larger L2 cache and implementing an exclusive L2/L3 cache-hierarchy in Skylake — the impact of co-locating production-scale recommendation models with irregular memory accesses is less on the shared memory system.

Figure 8(c) runs the same experiments for a much larger FC operator to further highlight the key observations: (1) three regions (no-location, 10-15 co-located jobs, more than 20 co-located jobs) of operator latency on Broadwell , (2) large increase in p99 latency under high co-location, and (3) gradual degradation of average and p99 latency on Skylake. Thus, in production-environments Broadwell suffers from higher performance variability as compared to Skylake This exposes opportunities for optimizing request level scheduling decisions in the data center to balance latency, throughput, and performance variability.

7 Open-Source Benchmark

Recall that at-scale recommendation models pose unique challenges. First, embedding table operations, which are central to recommendation models, exhibit qualitatively different memory and compute characteristics compared to traditional DNNs, as shown on Figure 3. Second, a diverse set of recommendation models (e.g., RM1, RM2, RM3) is found in data center production-environments. Finally, the performance bottlenecks and the optimal system configurations, of this diverse set of recommendation models, change under different run-time characteristics (Section 5 and Section 6).

Furthermore, DNN based recommendation systems based on NCF [32], are not representative of the ones used in the data center. For instance, compared to production-scale recommendation workloads, the NCF workload from MLPerf [4] has orders of magnitude smaller embedding tables and fewer FC parameters (Figure 1). Consequently, FC comprises over 90% of the execution time in NCF, in contrast SparseLengthSum comprises around 80% of the cycles in RM1 (with batching) and RM2. The higher fraction of cycles devoted to SparseLengthSum is due to RM1 and RM2 implementing tens of embedding tables with dozens of sparse index lookups (irregular memory accesses) per embedding table.

In this section, we describe the parameters of an open-source benchmark that was used to represent data center scale implementations of deep learning recommendation models [42]. We use the Caffe2 implementation, a highly optimized deep learning framework used in production-scale ML services [2], of the benchmark in this paper. The goal here is to close the gap between currently available and realistic production-scale benchmarks.

The open-source benchmark was designed with the flexibility to not only study the production scale models seen in this paper (i.e., RM1, RM2, RM3), but also a wider set of realistic recommendation models used in data centers. For example, recommendation systems applied to personalizing entertainment content (e.g., ranking video) [21] have a similar overall architecture and therefore can also potentially be modeled with this benchmark.

Figure 10: Overall architecture of the open-source recommendation model system. All configurable parameters are outlined in blue.

7.1 Configuring the open-source benchmark

To facilitate ease of use and maximize flexibility, the open-source DLRM benchmark implementation provides a suite of tunable parameters to define an end-to-end, inference-only recommendation system in Caffe2 [2]. Figure 10 illustrates the configurable parameters in the open-source implementation which can be set to resemble production scale RM. The set of configurable parameters includes: (1) the number of embedding tables, (2) input and output dimensions of embedding tables, (3) number of sparse lookups per embedding table, (4) depth and width of MLP layers for dense features (Bottom-MLP), and (5) depth and width of MLP layers after combining dense and sparse features (Top-MLP). Together these can be used to represent the production scale recommendation models shown in Table 1.

Example configurations for RM1, RM2, and RM3: As an example on how to configure the open-source benchmark to represent production scale recommendation workloads, lets consider a RM1 model shown in Table 1. In this model the number of embedding tables can be set to 5, with input and output dimensions of and , the number of sparse lookups to , depth and width of Bottom-MLP layers to and , and the depth and width of Top-MLP layers to and . The RM2 and RM3 models have been configured similarly.

7.2 Using the open-source benchmark

Finally, we note that in this paper the open-source DLRM benchmark is used to study the performance of recommendation workloads on server class CPUs. By varying the batch, FC, and embedding table configurations, it can also be used to study the compute and bandwidth requirements of a diverse set of recommendation models. More generally, the benchmark has also been used to analyze scheduling decisions, such as running recommendation models across many nodes (distributed inference) or many threads. The open-source benchmark can also be used to study memory systems and help answer questions about intelligent pre-fetching/caching techniques, and emergy memory technologies (e.g., non-voltatile memories).

8 Related Work

While the systems and computer architecture community has devoted significant efforts to performance analysis and optimization for DNNs, relatively little focus has been devoted to personalized recommendation systems. This section first reviews DNN-based solutions for personalized recommendation. This is followed by a discussion on state-of-the-art performance analysis and optimizations for DNNs with context on how the proposed techniques relate to personalized recommendation systems.

DNN-based solutions to personalized recommendation Compared image-classification [31], object detection [45], and speech recognition [29, 9, 10] which process input features, inputs to personalized recommendation are a mix of both dense and sparse features. NCF [32] uses a combination of embedding table, FC layers, and ReLU non-linearities using the open-source MovieLens-20m dataset [5]. Dense and sparse features are combined using a series of matrix-factorization and FC layers. In [21], the authors discuss applying this model architecture to Youtube video recommendation. A similar model architecture is applied to predict click-through-rates [49]. More generally, [15] explores the the accuracy tradeoff of wide (few FC layers and embedding tables) and deep (many FC layers and embedding tables) for serving recommendation in the Google Play Store mobile application. The authors find that accuracy is optimized using a combination of wide and deep neural networks, similar to the production-scale recommendation models (i.e., RM1, RM2, RM3) considered in this paper. While on-going research is exploring the use of CNNs and RNNs in recommendation systems [52], for the purposes of this paper we focus on production-scale recommendation models which use a combination of FC, embedding-tables, and element-wise operations and non-linearities.

DNN performance analysis and optimization Current publicly available benchmarks [19, 55, 8, 3] for DNNs focus on neural networks with FC, CNN, and RNN layers only. In combination with open-source implementations of state-of-the-art networks in high-level deep learning frameworks [7, 2, 6], the benchmarks have enabled thorough performance analysis and optimization. However, the resulting software and hardware solutions [36, 44, 13, 27, 54, 14] do not directly apply to production-scale recommendation workloads. In particular, production-scale recommendation workloads pose unique challenges in terms of memory capacity, irregular memory accesses, diversity in compute intensive and memory intensive models, and high-throughput and low-latency optimization targets. Furthermore, available implementations of DNN-based recommendation systems (i.e, MLPerf NCF[4]) are not representative of production-scale systems. Some of the limitations, such as memory capacity, are discussed in the previous work [43]. To alleviate memory capacity and bandwidth constraints, Eisenman et al. propose storing recommendation-models in non-volatile-memories with small amount of DRAM to cache embedding-table queries [23]. The detailed performance analysis in this paper will enable future work to consider a broader set of solutions to optimize end-to-end personalized recommendation systems currently running in data centers and motivate additional optimization techniques that address challenges specifically for mobile [50].

9 Conclusion

In this paper we provide detailed performance analysis of recommendation models on server-scale systems present in the data center. The analysis demonstrates that DNNs for recommendation pose unique challenges to efficient execution as compared to traditional CNN and RNN architectures, which have been the focus of the systems and computer architecture community. In particular, recommendation systems require much larger storage capacity, produce irregular memory accesses, and consist of a diverse set of operator-level performance bottlenecks. The analysis also shows that based on the performance target (i.e., latency versus throughput) and the recommendation model being run, the optimal platform and run-time configuration varies.

Micro-architectural platform features, such as processor frequency and core count, SIMD width and utilization with varying batch-size, cache capacity, inclusive versus exclusive cache hierarchies (when co-locating models), and DRAM configurations, expose request level scheduling optimization opportunities for running recommendation model inference in the data center. As an example, for the server-scale systems considered in this paper, Broadwell achieves up to 40% lower latency while Skylake achieves 30% higher latency-bounded throughput with batching. In addition to the architectural implications for stand-alone recommendation systems, this paper studies the effect of inference co-location and hyperthreading, as mechanisms to improve resource utilization, on performance variability in the data center. The detailed performance analysis of production-scale recommendation models lay the foundation for future full-stack hardware solutions targeting personalized recommendation.