TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

08/08/2019 ∙ by Youngeun Kwon, et al. ∙ KAIST 수리과학과 1

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of embedding layers and the associated tensor operations. We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-data processing cores tailored for DL tensor operations. These custom DIMMs are populated inside a GPU-centric system interconnect as a remote memory pool, allowing GPUs to utilize for scalable memory bandwidth and capacity expansion. A prototype implementation of our proposal on real DL systems shows an average 6.0-15.7x performance improvement on state-of-the-art recommender systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning (ML) algorithms based on deep neural networks (DNNs), also known as deep learning (DL), is scaling up rapidly. To satisfy the computation needs of DL practitioners, GPUs or custom-designed accelerators for DNNs, also known as neural processing units (NPUs) are widely being deployed for accelerating DL workloads. Despite prior studies on enhancing the compute throughput of GPUs/NPUs [1, 2, 3, 4, 5, 6, 7, 8], a research theme that has received relatively less attention is how computer system architects should go about tackling the “memory wall” problem in DL: emerging DL algorithms demand both high memory capacity and bandwidth, limiting the types of applications that can be deployed under practical constraints. In particular, recent studies from several hyperscalars [9, 10] pinpoint to embedding lookups and tensor manipulations (aka embedding layers, Section II-C

) as the most memory-intensive algorithm deployed in datacenters, already reaching several hundreds of GBs of memory footprint, even for inference. Common wisdom in conventional DL workloads (e.g., convolutional and recurrent neural networks, CNNs and RNNs) was that convolutions or matrix-multiplications account for the majority of inference time 

[5, 11]. However, emerging DL applications employing embedding layers exhibit drastically different characteristics as the embedding lookups and tensor manipulations (i.e., feature interaction in Figure 1) account for a significant fraction of execution time. Facebook states that embedding lookups and tensor manipulation operations take up of the execution time of all DL workloads deployed in their datacenters [9].

[width=0.37]./fig/upcoming_dnns.pdf

Fig. 1: Topological structure of emerging DL applications. The figure is reproduced from Facebook’s keynote speech at the Open Compute Project summit  [12], which calls out for architectural solutions providing “High memory bandwidth and capacity for embeddings”. This paper specifically addresses this important memory wall problem in emerging DL applications: i.e., the non-MLP portions (yellow) in this figure.

Given this landscape, this paper focuses on addressing the memory capacity and bandwidth challenges of embedding layers (Figure 1). Specifically, we focus our attention on recommender systems [13] using embeddings which is one of the most common DL workloads deployed in today’s datacenters for numerous application domains like advertisements, movie/music recommendations, news feed, and etc111While this paper uses recommender systems as a driving example to demonstrate the merits of our proposal, TensorDIMM is applicable for any DNN application utilizing embeddings (e.g., transformers using attentions [14, 15], memory-augmented neural networks [16], etc). . As detailed in Section II-C, the model size of embedding layers (typically several hundreds of GBs [10, 9]) far exceeds the memory capacity of GPUs. As a result, the solution vendors take is to store the entire embedding lookup table inside the capacity-optimized, low-bandwidth CPU memory and deploy the application 1) by only using CPUs for the entire computation of DNN inference, or 2) by employing a hybrid CPU-GPU approach where the embedding lookups are conducted on the CPU but the rest are handled on the GPU. We observe that both these approaches leave significant performance left on the table, experiencing an average slowdown compared to a hypothetical, GPU-only version which assumes the entire embeddings can be stored in the GPU memory (Section III-B

). Through a detailed application characterization study, we root-cause the reason behind such performance loss to the following factors. First, the embedding vectors are read out using the low-bandwidth CPU memory, incurring significant latency overheads compared to when the embeddings are read out using the bandwidth-optimized GPU memory. Second, the low computation throughput of CPUs can significantly slowdown the computation-heavy DNN execution step when solely relying on CPUs, whereas the hybrid CPU-GPU version can suffer from the latencies in copying the embeddings from CPU to GPU memory over the thin PCIe channel.

To tackle these challenges, we present a vertically integrated, hardware/software co-design that fundamentally addresses the memory (capacity and bandwidth) wall problem of embedding layers. Our proposal encompasses multiple levels in the hardware/software stack as detailed below.

(Micro)architecture. We present TensorDIMM, which is based on commodity buffered DIMMs but further enhanced with near-data processing (NDP) units customized for key DL tensor operations, such as embedding gathers and reductions. The NDP units in TensorDIMM are designed to conduct embedding gathers and reduction operations “near-DRAM” which drastically reduces the latency in fetching the embedding vectors and reducing them, providing significant improvements in effective communication bandwidth and performance. Additionally, our proposal leverages commodity DRAM devices as-is, so another key advantage of TensorDIMM as opposed to prior NDP architectures [17, 18, 19] is its practicality and ease of implementation.

ISA extension and runtime system. Building on top of TensorDIMM, we propose a custom tensor ISA and runtime system that provide scalable memory bandwidth and capacity expansion for embedding layers. Our proposal entails 1) a carefully designed ISA tailored for DL tensor operations (TensorISA), 2) an efficient address mapping scheme for embeddings, and 3) a runtime system that effectively utilizes TensorDIMMs for tensor operations. The TensorISA has been designed from the ground-up to conduct key DL tensor operations in a memory bandwidth efficient manner. An important challenge with conventional memory systems is that, regardless of how many DIMMs are physically available per each memory channel, the maximum memory bandwidth provided to the memory controller is fixed. TensorISA has been carefully co-designed with both TensorDIMM and the address mapping scheme so that the aggregate memory bandwidth provided to our NDP units increases proportional to the number of TensorDIMMs. In effect, our proposal offers a platform for scalable memory bandwidth expansion for embedding layers. Compared to the baseline system, our default TensorDIMM configuration offers an average increase in memory bandwidth for key DL tensor operations.

System architecture. Our final proposition is to aggregate a pool of TensorDIMM modules as a disaggregated memory node (henceforth referred to as TensorNode) in order to provide scalable memory capacity expansion. A key to our proposal is that TensorNode is interfaced inside the NVLINK-compatible, GPU-side high-bandwidth interconnect222While we discuss TensorDIMM’s merits under the context of a high-bandwidth GPU-side interconnect, the effectiveness of our proposal remains intact for NPUs (e.g., Facebook’s Zion high-bandwidth interconnect [20]). . In state-of-the-art DL systems, the GPUs are connected to a high-bandwidth switch such as NVSwitch [21] which allows high-bandwidth, low-latency data transfers between any pair of GPUs. Our proposal employs a pooled memory architecture “inside” the high-bandwidth GPU-side interconnect, which is fully populated with capacity-optimized memory DIMMs – but in our case, the TensorDIMMs. The benefits of interfacing TensorNode within the GPU interconnect is clear: by storing the embedding lookup tables inside the TensorNode, GPUs can copy in/out the embeddings much faster than the conventional CPU-GPU based approaches (i.e., approximately faster than PCIe, assuming NVLINK(v2) [22]). Furthermore, coupled with the memory bandwidth amplification effects of TensorISA, our TensorNode offers scalable expansion of “both” memory capacity and bandwidth. Overall, our vertically integrated solution provides an average and speedup in inference time compared to the CPU-only and hybrid CPU-GPU implementation of recommender systems, respectively. To summarize our key contributions:

  • To the best of our knowledge, this work is the first to explore architectural solutions for embedding layers, an important building block in emerging DL applications.

  • We propose TensorDIMM, a practical NDP architecture built on top of commodity DRAMs which offers scalable increase in both memory capacity and bandwidth for tensor operations.

  • We propose a TensorDIMM-based disaggregated memory system for DL inference called TensorNode. The efficiency of our solution is demonstrated with a proof-of-concept software prototype of TensorNode on a high-end GPU system, achieving significant performance improvements than conventional approaches.

Ii Background

Ii-a Buffered DRAM Modules

In order to balance memory capacity and bandwidth, commodity DRAM devices that are utilized in unison compose a rank. One or more ranks are packaged into a memory module, the most popular form factor being the dual-inline memory module (DIMM) which has data I/O (DQ) pins. Because a memory channel is typically connected to multiple DIMMs, high-end CPU memory controllers often need to drive hundreds of DRAM devices in order to deliver the command/address (C/A) signals through the memory channel. Because modern DRAMs operate in the GHz range, having hundreds of DRAM devices be driven by handful of memory controllers leads to signal integrity issues. Consequently, server-class DIMMs typically employ a buffer device per each DIMM (e.g., registered DIMM [23] or load-reduced DIMM [24]) which is used to repeat the C/A signals to reduce the high capacitive load and resolve signal integrity issues. Several prior work from both industry [25] and in academia [26, 27, 28] have explored the possibility of utilizing this buffer device space to add custom logic designs to address specific application needs. IBM’s Centaur DIMM [25] for instance utilizes the buffer device to add a MB eDRAM L4 cache and a custom interface between DDR PHY and IBM’s proprietary memory interface.

Ii-B System Architectures for DL

As the complexity of DL applications skyrocket, there has been a growing trend towards dense, scaled-up system node design with multiple PCIe-attached co-processor devices (i.e., DL accelerators such as GPUs/NPUs [29, 30]) to address the problem size growth. A multi-accelerator device solution typically works on the same problem in parallel with occasional inter-device communication to share intermediate data [31]. Because such inter-device communication often lies on the critical path of parallelized DL applications, system vendors are employing high-bandwidth interconnection fabrics that utilize custom high-bandwidth signaling links (e.g., NVIDIA’s DGX-2 [32] or Facebook’s Zion system interconnect fabric [20]). NVIDIA’s DGX-2 [32] for instance contains GPUs, all of which are interconnected using a NVLINK-compatible high-radix (crossbar) switch called NVSwitch [21]. NVLINK provides GB/sec of full-duplex uni-directional bandwidth per link [22], so any given GPU within DGX-2 can communicate with any other GPU at the full uni-directional bandwidth up to GB/sec via NVSwitch. Compared to the thin uni-directional bandwidth of GB/sec (x16) under the CPU-GPU PCIe(v3) bus, such high-bandwidth GPU-side interconnect enables an order of magnitude faster data transfers.

Ii-C DL Applications with Embeddings

Recommender systems. Conventional DL applications for inference (e.g., CNNs and RNNs) generally share a common property where its overall memory footprint fits within the (tens of GBs of) GPU/NPU physical memory capacity. However, recent studies from several hyperscalars [9, 10] call out for imminent system-level challenges in emerging DL workloads that are extremely memory (capacity and bandwidth) limited. Specifically, both hyperscalars pinpoint to embedding layers as the most memory-intensive algorithm deployed in their datacenters. One of the most widely deployed DL application using embeddings is the recommender system [13]

, which is used in numerous application domains such as advertisements (Amazon, Google, Ebay), social networking service (Facebook, Instagram), movie/music/image recommendations (YouTube, Spotify, Fox, Pinterest), news feed (LinkedIn), and many others. Recommendation is typically formulated as a problem of predicting the probability of a certain event (e.g., the probability of a Facebook user clicking “like” for a particular post), where a ML model estimates the likelihood of one or more events happening at the same time. Events or items with the highest probability are ranked higher and recommended to the user.

Without going into a comprehensive review of numerous prior studies, we emphasize that current state-of-the-art recommender systems have evolved into utilizing (not surprisingly) DNNs. While there exists variations regarding how the DNNs are constructed, a commonly employed topological structure for recommender systems is the neural network based collaborative filtering algorithm [33]. Further advances led to the development of more complex models with wider and deeper vector dimensions, successfully being applied and deployed in commercial user-facing products [9].

[width=0.49]./fig/ncf.pdf

Fig. 2: A DNN based recommender system. The embedding layer typically consists of the following two steps. (1) The embedding lookup stage where embeddings are gathered from the (potentially multiple, two tables in this example) look-up tables up to its batch size and form (batched) embedding tensors. These tensors go through several tensor manipulation operations to form the final embedding tensor to be fed into the DNNs. Our proposal utilizes custom tensor ISA extensions (GATHER/REDUCE/AVERAGE) to accelerate this process which we detail in Section IV-D.

Embedding lookups and tensor manipulation. Figure 2 illustrates the usage of embedding layers in recommender systems that incorporate DNNs [33]

. The inputs to the DNNs (which are typically constructed using fully-connected layers or multi-layer perceptrons, FCs or MLPs) are constructed as a combination of dense and sparse features. The dense features are commonly represented as a vector of real numbers whereas sparse features are initially represented as indices of one-hot encoded vectors. These one-hot indices are used to query the

embedding lookup table to project the sparse indices into a dense vector dimensional space. The contents stored inside these lookup tables are called embeddings, which are trained to extract deep learning features (e.g., Facebook trains embeddings to extract information regarding what pages a particular user liked, which are utilized to recommend relevant contents or posts to the user [9]). The embeddings read out of the lookup table are combined with other dense embeddings (feature interaction in Figure 1) from other lookup tables using tensor concatenation or tensor “reduction” operations such as element-wise additions/multiplications/averaging/etc to generate an output tensor, which is then forwarded to the DNNs to derive the final event probability (Figure 2).

Memory capacity limits of embedding layers. A key reason why embedding layers consume significant memory capacity is because each user (or each item) requires a unique embedding vector inside the lookup table. The total number of embeddings therefore scales proportional to the number of users/items, rendering the overall memory footprint to exceed several hundreds of GBs just to keep the model weights themselves, even for inference. Despite its high memory requirements, embedding layers are favored in DL applications because it helps improve the model quality: given enough memory capacity, users seek to further increase the model size of these embedding layers using larger embedding dimensions or by using multiple embedding tables to combine multiple dense features using tensor reduction operations.

Iii Motivation

[width=0.49]./fig/model_size_scaling_trend.pdf

Fig. 3: Model size growth of neural collaborative filtering (NCF [33]) based recommender system when the MLP layer dimension size (x-axis) and the embedding vector dimension size (y-axis) is scaled up. Experiment assumes the embedding lookup table contains million users and million items per each lookup table. As shown, larger embeddings rather than larger MLP dimension size cause a much more dramatic increase in model size.

Iii-a Memory (Capacity) Scaling Challenges

High-end GPUs or NPUs commonly employ a bandwidth-optimized, on-package 3D stacked memory (e.g., HBM [34] or HMC [35]) in order to deliver the highest possible memory bandwidth to the on-chip compute units. Compared to capacity-optimized DDRx most commonly adopted in CPU servers, the bandwidth-optimized stacked memory is capacity-limited, only available with several tens of GBs of storage. While one might expect future solutions in this line of product to benefit from higher memory density, there are technological constraints and challenges in increasing the capacity of these 3D stacked memory in a scalable manner. First, stacking more DRAM layers vertically is constrained by chip pinout required to drive the added DRAM stacks, its wireability on top of silicon interposers, and thermal constraints. Second, current generation of GPUs/NPUs are already close to the reticle limits of processor die size333NVIDIA V100 has already reached the reticle limits of mm die area, forcing researchers to explore alternative options such as multi-chip-module solutions [1] to continue computational scaling. so adding more 3D stack modules within a package inevitably leads to sacrificing area budget for compute units.

Iii-B Memory Limits in Recommender System

As discussed in Section II-C, the model size of embedding lookup tables is in the order of several hundreds of GBs, far exceeding the memory capacity limits of GPUs/NPUs. Due to the memory scaling limits of on-package stacked DRAMs, a solution vendors take today is to first store the embedding lookup tables in the CPU and read out the embeddings using the (capacity-optimized but bandwidth-limited) CPU memory. Two possible implementations beyond this step are as follows. The CPU-only version (CPU-only) goes through the rest of the inference process using the CPU without relying upon the GPU. A hybrid CPU-GPU approach (CPU-GPU[36] on the other hand copies the embeddings to the GPU memory over PCIe using cudaMemcpy, and once the CPUGPU data transfer is complete, the GPU initiates various tensor manipulation operations to form the input tensors to the DNN, followed by the actual DNN computation step (Figure 2).

Given the circumstances, DL practitioners as well as system designers are faced with a conundrum when trying to deploy recommender systems for inference. From a DL algorithm developer’s perspective, you seek to add more embeddings (i.e., more embedding lookup tables to combine various embedding vectors using tensor manipulations, such as tensor reduction) and increase embedding dimensions (i.e., larger embeddings) for complex feature interactions as it improves model quality. Unfortunately, fulfilling the needs of these algorithm developers bloats up overall memory usage (Figure 3) and inevitably results in relying upon the capacity-optimized CPU memory to store the embedding lookup tables. Through a detailed characterization study, we root-cause the following three factors as key limiters of prior approaches (Figure 4) relying on CPU memory for storing embeddings:

[width=0.485]./fig/perf_hybrid_vs_cpu_only

Fig. 4: Performance of baseline CPU-only and hybrid CPU-GPU versions of recommender system, normalized to an oracular GPU-only version. B(N) represents an inference with batch size N. CPU-only exhibits some performance advantage than the CPU-GPU version for some low batch inference scenarios, but both CPU-only and CPU-GPU generally suffer from significant performance loss than oracular GPU-only. Section V details our evaluation methodology.
  1. As embedding lookup tables are stored in the low-bandwidth CPU memory, reading out embeddings (i.e., the embedding gather operation) adds significant latency than an unbuildable, oracular GPU-only version (GPU-only) which assumes infinite GPU memory capacity. Under GPU-only, the entire embeddings can be stored locally in the high-bandwidth GPU memory so gathering embeddings can be done much faster than when using CPU memory. This is because embedding reads is a memory bandwidth-limited operation.

  2. CPU-only versions can sidestep the added latency that hybrid CPU-GPU versions experience during the PCIe communication process of transferring embeddings, but the (relatively) low computation throughput of CPUs can significantly lengthen the DNN computation step.

  3. Hybrid CPU-GPU versions on the other hand can reduce the DNN computation latency, but comes at the cost of additional CPUGPU communication latency when copying the embeddings over PCIe using cudaMemcpy.

Iii-C Our Goal: A Scalable Memory System

Overall, the key challenge rises from the fact that the source operands of key tensor manipulations (i.e., the embeddings subject for tensor concatenation or tensor reduction) are initially located inside the embedding lookup tables, all of which are stored inside the capacity-optimized but bandwidth-limited CPU memory. Consequently, prior solutions suffer from low-bandwidth embedding read operations over CPU memory which adds significant latency. Furthermore, CPU-only and CPU-GPU versions need to trade-off the computational bottleneck of low-throughput CPUs over the communication bottleneck of PCIe, which adds additional latency overheads (Figure 1). What is more troubling is the fact future projections of applications utilizing embedding layers assume even larger number of embedding lookups and larger embeddings themselves [9, 10], with complex tensor manipulations to combine embedding features to improve algorithmic performance. Overall, both current and future memory requirements of embedding layers point to an urgent need for a system-level solution that provides scalable memory capacity and bandwidth expansion. In the next section, we detail our proposed solution that addresses both the computational bottleneck of low-throughput CPUs and the communication bottleneck of low-bandwidth CPU-GPU data transfers.

Iv TensorDIMM: An NDP DIMM Design for Embeddings & Tensor Ops

Iv-a Proposed Approach

Our proposal is based on the following key observations that open up opportunities to tackle the system-level bottlenecks in deploying memory-limited recommender systems.

  1. An important tensor operation used in combining embedding features is the element-wise operation, which is equivalent to a tensor-wide reduction among N embeddings. Rather than having the N embeddings individually be copied over to the GPU memory for the reduction operation to be initiated using the GPU, we can conduct the N tensor-wide reduction “near-DRAM” first and copy a single, reduced tensor to GPU memory. Such NDP approach effectively reduces the data transfer size by a factor of N and alleviates the communication bottleneck of CPUGPU cudaMemcpy (Figure 5).

  2. DL system vendors are deploying GPU/NPU-centric interconnection fabrics [32, 20] that are decoupled from the legacy host-device PCIe. Such technology allows vendors to employ custom high-bandwidth links (e.g., NVLINK, providing higher bandwidth than PCIe) for fast inter-GPU/NPU communication. But more importantly, it is possible to tightly integrate “non”-accelerator components (e.g., a disaggregated memory pool [37, 38]) as separate interconnect endpoints (or nodes), allowing accelerators to read/write from/to these non-compute nodes using high-bandwidth links.

[width=0.45]./fig/proposed_approach

Fig. 5: Our proposed approach: compared to the (a) hybrid CPU-GPU version, (b) our solution conducts tensor-wide reduction locally near-DRAM first, and then transfers the reduced tensor over the high-bandwidth NVLINK. Consequently, our proposal significantly reduces both the latency of gathering embeddings from the lookup table (T) and the time taken to transfer the embeddings from the capacity-optimized CPU/TensorNode memory to the bandwidth-optimized GPU memory (T). The CPU-only version does not experience T but suffers from significantly longer DNN execution time than CPU-GPU and our proposed design. N= in this example.

[width=0.95]./fig/our_solution.pdf

Fig. 6: High-level overview of our proposed system. (a) An NDP core is implemented inside the buffer device of each (b) TensorDIMM, multiples of which are employed as a (c) disaggregated memory pool, called TensorNode. The TensorNode is integrated inside the high-bandwidth GPU-side interconnect. The combination of NVLINK and NVSwitch enables GPUs to read/write data to/from TensorNode at a communication bandwidth up to higher than PCIe.

Based on these observations, we first propose TensorDIMM which is a custom DIMM design including an NDP core tailored for element-wise, tensor “gather” and “reduction” operations. We then propose a disaggregated memory system called TensorNode which is fully populated with our TensorDIMMs. Our last proposition is a software architecture that effectively parallelizes the tensor gather/reduction operation across the TensorDIMMs, providing scalable memory bandwidth and capacity expansion. The three most significant performance limiters in conventional CPU-only or hybrid CPU-GPU recommender system are 1) the embedding gather operation over the low-bandwidth CPU memory, 2) the compute-limited DNN execution using CPUs (for CPU-only approaches) and 3) the CPUGPU embedding copy operation over the PCIe bus (for hybrid CPU-GPU), all of which adds severe latency overheads (Section III-B). Our vertically integrated solution fundamentally addresses these problems thanks to the following three key innovations provided with our approach (Figure 5). First, by storing the entire embedding lookup table inside the TensorDIMMs, the embedding gather operation can be conducted using the ample memory bandwidth available across the TensorDIMMs ( higher than CPU) rather than the low-bandwidth CPU memory. Second, the TensorDIMM NDP cores conduct the N tensor reduction operation before sending them to the GPU, reducing the TensorNodeGPU communication latency by a factor of N, effectively overcoming the communication bottleneck. Furthermore, the high-bandwidth TensorNodeGPU links (i.e., NVLINK) enable further latency reduction, proportional to the bandwidth difference between PCIe and NVLINK (approximately ). Lastly, all DNN computations are conducted using the GPU, overcoming the computation bottleneck of CPU-only implementations. As detailed in Section VI, our proposal achieves an average and performance improvement over the CPU-only and hybrid CPU-GPU, respectively, reaching of an unbuildable, oracular GPU-only implementation with infinite memory capacity. We detail each components of our proposal below.

Iv-B TensorDIMM for Near-DRAM Tensor Ops

Our TensorDIMM is architected under three key design objectives in mind. First, TensorDIMM should leverage commodity DRAM chips as-is while be capable of being utilized as a normal buffered DIMM device in the event that it is not used for DL acceleration. Second, tensor reduction operations should be conducted “near-DRAM” using a lightweight NDP core, incurring minimum power and area overheads to the buffered DIMM device. Third, in addition to memory capacity, the amount of memory bandwidth available to the NDP cores should also scale up proportional to the number of TensorDIMM modules employed in the system.

Architecture. Figure 6(a,b) shows the TensorDIMM architecture, which consists of an NDP core and the associated DRAM chips. As depicted, TensorDIMM does not require any changes to commodity DRAMs since all modifications are limited to a buffer device within a DIMM (Section II-A). The NDP core includes a DDR interface, a vector ALU, and an NDP-local memory controller which includes input/output SRAM queues to stage-in/out the source/destination operands of tensor operations. The DDR interface is implemented with a conventional DDR PHY and a protocol engine.

TensorDIMM usages. For non-DL use-cases, the processor’s memory controller sends/receives DRAM C/A and DQ signals to/from this DDR interface, which directly interacts with the DRAM chips. This allows TensorDIMM to be applicable as a normal buffered DIMM device and be utilized by conventional processor architectures for servicing load/store transactions. Upon receiving a TensorISA instruction for tensor gather/reduction operations however, the instruction is forwarded to the NDP-local memory controller, which translates the TensorISA instruction into low-level DRAM C/A commands to be sent to the DRAM chips. Specifically, the TensorISA instruction is decoded (detailed in Section IV-D) in order to calculate the physical memory address of the target tensor’s location, and the NDP-local memory controller generates the necessary RAS/CAS/activate/precharge/ DRAM commands to read/write data from/to the TensorDIMM DRAM chips. The data read out of DRAM chips are temporarily stored inside the input ( and ) SRAM queues until the ALU reads them out for tensor operations444While one can envision utilizing our TensorDIMM NDP cores (or an extension of it with higher math throughput) for accelerating memory-bandwidth limited MLP layers, we limit the scope of our work to using our NDP units only for accelerating embedding lookups and tensor operations.. In terms of NDP compute, the minimum data access granularity over the eight x8 DRAM chips is bytes for a burst length of , which amounts to sixteen -byte scalar elements. As detailed in Section IV-D, the tensor operations accelerated with our NDP cores are element-wise arithmetic operations (e.g., add, subtract, average, ) which exhibit data-level parallelism across the sixteen scalar elements. We therefore employ a -wide vector ALU which conducts the element-wise operation over the data read out of the ( and ) SRAM queues. The vector ALU checks for any newly submitted pair of data ( bytes each) and if so, pops out a pair for tensor operations, the result of which is stored into the output () SRAM queue. The NDP memory controller checks the output queue for newly inserted results and drain them back into DRAM, finalizing the tensor reduction process. For tensor gathers, the NDP core forwards the data read out of the input queues to the output queue to be committed back into DRAM.

Implementation and overhead. TensorDIMM leverages existing DRAM chips and the associated DDR PHY interface as-is, so the additional components introduced with our TensorDIMM design are the NDP-local memory controller and the -wide vector ALU. In terms of the memory controller, decoding the TensorISA instruction into a series of DRAM commands is implemented as a FSM control logic so the major area/power overheads comes from the input/output SRAM queues. These buffers must be large enough to hold the bandwidth-delay product of the memory sourcing the data to remove idle periods in the output tensor generation stream. A conservative estimate of ns latency from the time the NDP-local memory controller requests data to the time it arrives at the SRAM queue is used to size the SRAM queue capacity. Assuming the baseline PC4-25600 DIMM that provide GB/sec of memory bandwidth, the NDP core requires ( GB/sec ns) = KB of SRAM queue size ( KB overall, for both input/output queues). The -wide vector ALU is clocked at MHz to provide enough computation throughput to seamlessly conduct the element-wise tensor operations over the data read out of the input ( and ) queues. Note that the size of an IBM Centaur buffer device [25] is approximately (mmmm) with a TDP of W. Compared to such design point, our NDP core microarchitecture adds negligible power and area overheads, which we quantitatively evaluate in Section VI-E.

Memory bandwidth scaling. An important design objective of TensorDIMM is to provide scalable memory bandwidth expansion for NDP tensor operations. A key challenge with conventional memory systems is that the maximum bandwidth per each memory channel is fixed (i.e., signaling bandwidth per each pin number of data pins per channel), regardless of the number of DIMMs (or ranks) per channel. For instance, the maximum CPU memory bandwidth available under the baseline CPU system (i.e., NVIDIA DGX [39]) can never exceed ( GB/sec8) GB/sec across the eight memory channels ( channels per each socket), irrespective of the number of DIMMs actually utilized (i.e., DIMMs vs. DIMMs). This is because the physical memory channel bandwidth is time-multiplexed across multiple DIMMs. As detailed in the next subsection, our TensorDIMM is utilized as a basic building block in composing a disaggregated memory system (i.e., TensorNode). The key innovation of our proposal is that, combined with our TensorISA address mapping function (Section IV-D), the amount of aggregate memory bandwidth provided to all the NDP cores within TensorNode increases proportional to the number of TensorDIMM employed: an aggregate of GB/sec memory bandwidth assuming TensorDIMMs, a increase over the baseline CPU memory system. Such memory bandwidth scaling is possible because the NDP cores access its TensorDIMM-internal DRAM chips “locally” within its DIMM, not having to share its local memory bandwidth with other TensorDIMMs. Naturally, the more TensorDIMMs employed inside the memory pool, the larger the aggregate memory bandwidth becomes available to the NDP cores.

Iv-C System Architecture

Figure 6 provides a high-level overview of our proposed system architecture. We propose to construct a disaggregated memory pool within the high-bandwidth GPU system interconnect. Our design is referred to as TensorNode because it functions as an interconnect endpoint, or node. GPUs can read/write data from/to TensorNode over the NVLINK compliant PHY interface, either using fine-grained CC-NUMA accesses or in bulk coarse-granular data transfers using P2P cudaMemcpy555CC-NUMA access or P2P cudaMemcpy among NVLINK-compatible devices is already available in commercial systems (e.g., Power9 [40], GPUs within DGX-2 [32]). Our TensorNode leverages such technology as-is to minimize design complexity. . Figure 6(c) depicts our TensorNode design populated with multiple TensorDIMM devices. The key advantage of TensorNode is threefold. First, TensorNode provides a platform for increasing memory capacity in a scalable manner as the disaggregated memory pool can independently be expanded using density-optimized DDRx, irrespective of the GPU’s local, bandwidth-optimized (but capacity-limited) 3D stacked memory. As such, it is possible to store multiples embedding lookup tables entirely inside TensorNode because the multitude of TensorDIMM devices (each equipped with density-optimized LR-DIMM [24]) enable scalable memory capacity increase. Second, the aggregate memory bandwidth available to the TensorDIMM NDP cores has been designed to scale up proportional to the number of DIMMs provisioned within the TensorNode (Section IV-D). This allows our TensorNode and TensorDIMM design to fulfill not only the current but also future memory (capacity and bandwidth) needs of recommender systems which combines multiple embeddings, the size of which is expected to become even larger moving forward (Section III-B). Recent projections from several hyperscalars [9, 10] state that the memory capacity requirements of embedding layers will increase by hundreds of times larger than the already hundreds of GBs of memory footprint. TensorNode is a scalable, future-proof system-level solution that addresses the memory bottlenecks of embedding layers. Third, the communication channels to/from the TensorNode is implemented using high-bandwidth NVLINK PHYs so transferring embeddings between a GPU and a TensorNode becomes much faster than when using PCIe (approximately ).

Iv-D Software Architecture

We now discuss the software architecture of our proposal: the address mapping scheme for embeddings, TensorISA for conducting near-DRAM operations within TensorDIMM, and the runtime system.

[width=0.485]./fig/tensor_map

(a)

[width=0.36]./fig/addr_map_example

(b)
Fig. 7: (a) Proposed DRAM address mapping scheme for embeddings, assuming each embedding is KB (-dimension) and is split evenly across ranks (or DIMMs) (b) Example showing how rank-level parallelism is utilized to interleave and map each embedding across the TensorDIMMs. Such rank-level parallelism centric address mapping scheme enable the memory bandwidth available to the NDP cores to increase proportional to the number of TensorDIMMs employed.

Address mapping architecture. One of the key objectives of our address mapping function is to provide scalable performance improvement whenever additional TensorDIMMs (i.e., number of DIMMsNDP cores) are added to our TensorNode. To achieve this goal, it is important that all NDP cores within the TensorNode concurrently work on a distinct subset of the embedding vectors for (gather/reduce) tensor operations. Figure 7(a) illustrates our proposed address mapping scheme which utilizes rank-level parallelism to maximally utilize both the TensorDIMM’s NDP computation throughput and memory bandwidth. Because each TensorDIMM has its own NDP core, maximally utilizing aggregate NDP compute throughput requires all TensorDIMMs to work in parallel. Our address mapping function accomplishes this by having consecutive bytes666We assume TensorDIMM is built using a x64 DIMM. With a burst length of , the minimum data access granularity becomes bytes. within each embedding vector be interleaved across different ranks, allowing each TensorDIMM to independently work on its own slice of tensor operation concurrently (Figure 7(b)). As our target algorithm contains abundant data-level parallelism, our address mapping technique effectively partitions and load-balances the tensor operation across the all TensorDIMMs within the TensorNode. In Section VI-A, we quantitatively evaluate the efficacy of our address mapping function in terms of maximum DRAM bandwidth utilization.

It is worth pointing out that our address mapping function is designed to address not just the current but also future projections on how DL practitioners seek to use embeddings. As discussed in Section III-B, given the capability, DL practitioners are willing to adopt larger embedding vector dimensions to improve performance. This translates into a larger number of bits to reference any given embedding vector (e.g., enlarging embedding size from KB to KB increases the embedding’s bit-width from bits to bits in Figure 7(a)) leading to both larger memory footprint to store the embeddings and higher computation and memory bandwidth demands to conduct tensor reductions. Our system-level solution has been designed from the ground-up to effectively handle such user needs: that is, system architects can provision more TensorDIMM ranks within TensorNode and increase memory capacity proportional to the increase in embedding size, which is naturally accompanied by an increase in NDP compute throughput and memory bandwidth.

[width=0.48]./fig/tensor_isa.pdf

Fig. 8: Instruction formats for GATHER, REDUCE, and AVERAGE.

[width=0.49]./fig/code_lookup.pdf

(a)

[width=0.49]./fig/code_reduce.pdf

(b)

[width=0.49]./fig/code_average.pdf

(c)
Fig. 9: Pseudo code explaining the functional behavior of (a) GATHER, (b) REDUCE, and (c) AVERAGE.

TensorISA. The near-DRAM tensor operations are initiated using our custom ISA extension called TensorISA. There are three key TensorISA primitives supported in TensorDIMM: the GATHER instruction for embedding lookups and the REDUCE and AVERAGE instructions for element-wise operations. Figure 8 and Figure 9 summarize the instruction formats of these three instructions and pseudo codes describing each of its functional behavior. Consider the example in Figure 2 which assumes an embedding layer that uses two embedding lookup tables with a batch size to compose two tensors, followed by an element-wise operation among these two for reduction. As discussed in Section II-C, an embedding layer starts with an embedding lookup phase that gathers multiple embeddings up to the batch size from the embedding lookup table, followed by various tensor manipulations. Under our proposed system, the GPU executes this embedding layer as follows. First, the GPU sends three instructions, two GATHER instructions and one REDUCE, to the TensorNode. The TensorISA instruction is broadcasted to all the TensorDIMMs because each NDP core is responsible for locally conducting its share of the embedding lookups as well as its slice of the tensor operation (Figure 9(a,b)). For instance, assuming the address mapping function in Figure 7(a) and a TensorNode configuration with TensorDIMMs, a single GATHER instruction will have each TensorDIMM gather ( bytes) = bytes of data under a contiguous physical address space, the process of which is orchestrated by the TensorDIMM NDP-local memory controller using DRAM read/write transactions (Section IV-B). The GATHER process is undertaken twice to prepare for the two tensor slices per each TensorDIMM rank. Once the two tensors are gathered, a REDUCE instruction is executed by the NDP cores (Section IV-B).

Runtime system. DL applications are typically encapsulated as a direct acyclic graph (DAG) data structure in major DL frameworks [41, 42, 43]. Each node within the DAG represents a DNN layer and the DL framework compiles down the DAG into sequence of host-side CUDA kernel launches that the GPU executes one layer at a time. The focus of this paper is on recommender systems which utilizes embedding lookups and various tensor manipulations. Under our proposed system, embedding layers are still executed using normal CUDA kernel launches but the kernel itself is wrapped around specific information for our TensorDIMM runtime system to utilize for near-DRAM tensor operations. Specifically, as part of the embedding layer’s CUDA kernel context, information such as the number of table lookups, the embedding dimension size, tensor reduction type, the input batch size, and etc are encoded per TensorISA instruction’s format (Figure 8) and is sent to the GPU as part of the CUDA kernel launch. When the GPU runtime receives these instructions, they are forwarded to the TensorNode for near-DRAM processing as discussed in Section IV-B.

As TensorNode is a remote, disaggregated memory pool from the GPU’s perspective, the runtime system should be able to (de)allocate memory inside this remote memory pool. Our proposal builds upon our prior work [44] which proposes several CUDA runtime API extensions for remote memory (de)allocation under a GPU-side disaggregated memory system. We refer the interested readers to [44, 45] for further details on the runtime CUDA APIs required for memory (de)allocations within a disaggregated memory system.

V Evaluation Methodology

Architectural exploration of TensorDIMM and its system-level implication within TensorNode using a cycle-level simulator is challenging for several reasons. First, running a single batch inference for DL applications can take up to several milliseconds even on high-end GPUs, so running cycle-level simulation on several tens to hundreds of batches of inference leads to intractable amount of simulation time. Second, our proposal covers multiple levels in the hardware/software stack, so a cycle-level hardware performance model of TensorDIMM and TensorNode alone will not properly reflect the complex interaction of (micro)architecture, runtime system, and the system software, potentially resulting in misleading conclusions. Interestingly, we note that the key DL operations utilized in embedding layers that are of interest to this study are completely memory bandwidth limited. This allows us to utilize the existing DL hardware/software systems to “emulate” the behavior of TensorDIMM and TensorNode on top of state-of-the-art real DL systems. Recall that the embedding lookups (GATHER) and the tensor reduction operations (REDUCE/AVERAGE) have extremely low compute-to-memory ratio (i.e., all three operations are effectively streaming applications), rendering the execution of these three operations to be bottlenecked by the available memory bandwidth. Consequently, the effectiveness of our proposal primarily lies in 1) the effectiveness of our address mapping scheme on maximally utilizing the aggregate memory bandwidth for the tensor operations across all the TensorDIMMs, and 2) the impact of “PCIe vs. NVLINK” on the communication latency of copying embeddings across the capacity-optimized “CPU vs. TensorNode” memory and bandwidth-optimized GPU memory (Figure 5). We introduce our novel, hybrid evaluation methodology that utilizes both cycle-level simulation and a proof-of-concept prototype developed on real DL systems to quantitatively demonstrate the benefits of our proposal.

DRAM specification DDR4 (PC4-25600)
Number of TensorDIMMs
Memory bandwidth per TensorDIMM GB/sec
Memory bandwidth across TensorNode GB/sec
TABLE I: Baseline TensorNode configuration.

Cycle-level simulation. As the performance of TensorDIMM and TensorNode is bounded by how well they utilize the DRAM bandwidth, an evaluation of our proposal on memory bandwidth utilization is in need. We develop a memory tracing function that hooks into the DL frameworks [41, 42] to gather the necessary read/write memory transactions in executing GATHER/REDUCE/AVERAGE operations for embedding lookups and tensor operations. The traces are fed into Ramulator [46], a cycle-accurate DRAM simulator, which is configured to model 1) the baseline CPU-GPU system configuration with eight CPU-side memory channels, and 2) our proposed address mapping function (Section IV-D) and TensorNode configuration (Table I), which we utilize to measure the effective memory bandwidth utilization when executing the three tensor operations under baseline and TensorNode.

[width=0.45]./fig/tnode_vs_v100

Fig. 10: Emulation of TensorDIMM and TensorNode using a real GPU: the GPU cores (aka SMs [47]) and the GPU-local memory channels (and bandwidth) corresponds to the NDP cores within TensorDIMMs and the aggregate memory bandwidth available across the TensorNode, respectively.

“Proof-of-concept” prototype. We emulate the behavior of our proposed system using the state-of-the-art NVIDIA DGX [39] machine, which includes eight V100 GPUs [29]. Each V100 GPU contains GB/sec of local memory bandwidth with six NVLINKs for communicating with other GPUs up to GB/sec [22]. To emulate the system-level effects of the high-bandwidth NVLINK communication channels between TensorNodeGPU, we use a pair of GPUs in DGX and treat one of them as our proposed TensorNode while the other acts as a normal GPU (Figure 10). Because the tensor operations accelerated using our TensorDIMMs are memory bandwidth-limited, streaming workloads, the performance difference between a (hypothetical) real TensorNode populated with N TensorDIMMs and a single V100 that we emulate as our TensorNode will be small, provided the V100 local memory bandwidth matches that of our assumed TensorNode configuration. As such, when evaluating the effective memory bandwidth utilization of TensorNode, we configured the number of ranks (i.e., TensorDIMMs) to be N= such that the aggregate memory bandwidth available within a single TensorNode approximately matches that of a single V100 ( GB/sec vs. GB/sec, Table I). After validating the effectiveness of TensorNode in utilizing memory bandwidth, commensurate to that of V100, we implement a software prototype of an end-to-end recommender system (configurable to the three DL applications discussed below) using Intel’s Math Kernel Library (MKL version  [48]), cuDNN (version  [49]), cuBLAS [50], and our in-house CUDA implementation of embedding layers (including GATHER/REDUCE/AVERAGE) as well as other layers that do not come with MKL, cuDNN or cuBLAS. We cross-validated our in-house implementation of these memory bandwidth-limited layers by comparing the measured performance against the upper bound, ideal performance, which exhibited little variation. Under our CUDA implementation of tensor reduction operations, the GPU cores (called SMs in CUDA [47]) effectively function as the NDP cores within TensorNode because the CUDA kernels of REDUCE/AVERAGE stage in/out the tensors between (SMGPU local memory) in a streaming fashion as done in TensorDIMM. The TensorNodeGPU communication is orchestrated using P2P cudaMemcpy over NVLINK when evaluating our proposal. When running the sensitivity of TensorDIMM on (TensorNodeGPU) communication bandwidth, we artificially increase (decrease) the data transfer size to emulate the behavior of smaller (larger) communication bandwidth and study its implication on system performance (Section VI-D).

Benchmarks. We choose three neural network based recommender system applications using embeddings to evaluate our proposal: (1) the neural collaborative filtering (NCF[33] based recommender system available in MLPerf [51], (2) the YouTube recommendation system [52] (YouTube), and (3) the Fox movie recommendation system [53] (Fox). To choose a realistic inference batch size representative of real-world inference scenarios, we refer to the recent study from Facebook [9] which states that datacenter recommender systems are commonly deployed with a batch size of . Based on this prior work, we use a batch size of as our default configuration but sweep from batch to when running sensitivity studies. All three workloads are configured with a default embedding vector dimension size of with other key application configurations summarized in Table II. We specifically note when deviating from these default configurations when running sensitivity studies in Section VI-C.

Area/power. The implementation overheads of TensorDIMM are measured with synthesized implementations using Verilog HDL, targeting a Xilinx Virtex UltraScale+ VCU1525 acceleration dev board. The system-level power overheads of TensorNode are evaluated using Micron’s DDR4 power calculator [54]. We detail these results in Section VI-E.

Network Lookup tables Max reduction FC/MLP layers
NCF 4
YouTube 4
Fox 1
TABLE II: Evaluated benchmarks and default configuration.

Vi Evaluation

We explore five design points of recommender systems: the 1) CPU-only version (CPU-only), 2) hybrid CPU-GPU version (CPU-GPU), 3) TensorNode style pooled memory system interfaced inside the high-bandwidth GPU interconnect but utilizes regular capacity-optimized DIMMs, rather than the NDP-enabled TensorDIMMs (PMEM), 4) our proposed TensorNode with TensorDIMMs (TDIMM), and 5) an unbuildable, oracular GPU-only version (GPU-only) which assumes that the entire embeddings can be stored inside GPU local memory, obviating the need for cudaMemcpy.

Vi-a Memory Bandwidth Utilization

To validate the effectiveness of TensorNode in amplifying effective memory bandwidth, we measure the aggregate memory throughput achieved with TensorNode and the baseline CPU-based systems, both of which utilizes a total of DIMMs (Figure 11). TensorNode significantly outperforms the baseline CPU system with an average increase in memory bandwidth utilization (max vs. GB/sec). As discussed in Section IV-B, conventional memory systems time-multiplex the memory channel across multiple DIMMs, so larger number of DIMMs only provide enlarged memory capacity not bandwidth. Our TensorDIMM is designed to ensure that the aggregate memory bandwidth scales proportional to the number of TensorDIMMs, achieving significant memory bandwidth scaling. The benefits of our proposal is more pronounced for more future-looking scenarios with enlarged embedding dimension sizes. Figure 12 shows the effective memory bandwidth for tensor operations when the embedding dimension size is increased by up to , necessitating larger number of DIMMs to house the proportionally increased embedding lookup table. As depicted, the baseline CPU memory system’s memory bandwidth saturates at around GB/sec because of the fundamental limits of conventional memory systems. TensorNode on the other hand reaches up to TB/sec of bandwidth, more than increase than baseline.

[width=0.45]./fig/dram_bw_util.pdf

Fig. 11: Memory bandwidth utilization for the three tensor operations. TensorNode assumes the default configuration in Table I with TensorDIMMs. For CPU-only and CPU-GPU, the tensor operations are conducted over the CPU memory system, so a total of memory channels with DIMMs ( ranks per each memory channel) are assumed.

[width=0.49]./fig/memory_scaling_bandwidth.pdf

Fig. 12: Memory throughput as a function of the number of DIMMs employed within CPU and TensorNode. Evaluation assumes the embedding size is increased from the default value by , which proportionally increases the embedding lookup table size, requiring a proportional increase in memory capacity (i.e., more DIMMs) to store these lookup tables.

Vi-B System-level Performance

TensorDIMM significantly reduces the latency to execute memory-limited embedding layers, thanks to its high memory throughput and the communication bandwidth amplification effects of both near-DRAM tensor operations and the high-bandwidth NVLINK. Figure 13 shows a latency breakdown of our studied workloads assuming a batch size of . With our proposed solution, all three applications enjoy significant reduction in both embedding lookup latency and the embedding copy latency. This is because of the high-bandwidth TensorDIMM tensor operations and the fast communication links utilized for moving the embeddings to GPU memory. Figure 14 summarizes the normalized performance of the five design points across different batch sizes. While CPU-only occasionally achieves better performance than CPU-GPU for low batch inference scenarios, the oracular GPU-only consistently performs best on average, highlighting the advantages of DNN acceleration using GPUs (and NPUs). TensorDIMM achieves an average (no less than ) of the performance of such unbuildable oracular GPU, demonstrating its performance merits and robustness compared to CPU-only and CPU-GPU (an average and speedup).

[width=0.49]./fig/latency_breakdown_all.pdf

Fig. 13: Breakdown of latencies during an inference with batch , normalized to the slowest design point (i.e., CPU-only or CPU-GPU).

[width=1.0]./fig/perf_all_sweep_batch.pdf

Fig. 14: Performance of the five design points of recommender systems, normalized to the oracular GPU (GPU-only).

[width=0.49]./fig/perf_sensitivity_larger_embeddings.pdf

Fig. 15: TensorDIMM performance with larger embedding size (up to ). Results are averaged across the three studied recommender systems.

[width=0.49]./fig/perf_sensitivity_comm_bw.pdf

Fig. 16: Performance sensitivity of PMEM (i.e., pooled memory without NDP acceleration) and TensorDIMM to the communication bandwidth. Results are averaged across the three studied recommender systems, and are normalized to the default configuration of GB/sec data point.

Vi-C TensorDIMM with Large Embeddings

A key motivation of our work is to provide a scalable memory system for embeddings and its tensor operations. So far, we have assumed the default embedding size of each workloads as summarized in Table II. With the availablity of our pooled memory architecture, DL practitioners can provision much larger embeddings to develop recommender systems with superior model quality. Because our evaluation is conducted over an emulated version of TensorDIMMs, we are not able to demonstrate the model quality improvements larger embedding will bring about as we cannot train these scaled-up algorithms (i.e., the memory capacity is still constrained by the GPU memory size). Nonetheless, we conduct a sensitivity study of the TensorDIMMs for scaled-up embedding sizes which is shown in Figure 15. With larger embeddings, the embedding layers cause a much more serious performance bottleneck. TensorNode shows even higher performance benefits under these settings, achieving an average and (maximum ) performance improvement than CPU-only and CPU-GPU, respectively.

Vi-D TensorDIMM with Low-bandwidth System Interconnects

To highlight the maximum potential of TensorDIMM, we discussed its merits under the context of a high-bandwidth GPU-side interconnect. Nonetheless, it is still possible for TensorDIMM to be utilized under conventional, CPU-centric disaggregated memory systems. Concretely, one can envision a system that contains a pooled memory like TensorNode interfaced over a low-bandwidth system interconnect (e.g., PCIe). Such design point looks similar to the hybrid CPU-GPU except for one key distinction: the tensor operations are done using the NDP cores but the reduced tensor must be copied to the GPU over a slower communication channel. Figure 16 summarizes the sensitivity of our proposal on the TensorNodeGPU communication bandwidth. We study both PMEM (i.e., disaggregated memory “without” TensorDIMMs) and TDIMM under low-bandwidth system interconnects to highlight the robustness our TensorDIMM brings about. Overall, PMEM is much more sensitive to the communication bandwidth than TDIMM (maximum performance loss) because the benefits of NDP reduction is lost with PMEM. Our TensorDIMM design (TDIMM) on the other hand only experiences up to performance loss (average ) even with a lower communication bandwidth. These results highlight the robustness and the wide applicability of TensorDIMM.

Vi-E Design Overheads

TensorDIMM requires no changes to the DRAM chip itself but adds a lightweight NDP core inside the buffer device (Section IV-B). We implement and synthesized the major components of our NDP core on a Xilinx Virtex UltraScale+ FPGA board using Verilog HDL. We confirm that the added area/power overheads of our NDP core is mostly negligible as it is dominated by the small ( KB) SRAM queues and the -wide vector ALU (Table III). From a system-level perspective, our work utilizes GPUs as-is, so the major overhead comes from the disaggregated TensorNode design. Assuming a single TensorDIMM uses GB load-reduced DIMM [24], its power consumption becomes W when estimated using Micron’s DDR4 system power calculator [54]. For TensorNode with TensorDIMMs, this amounts to a power overhead of ()=W. Recent specifications for accelerator interconnection fabric endpoints (e.g., Open Compute Project [55]’s open accelerator module [20]) employ a TDP of W so the power overheads of TensorNode is expected to be acceptable.

Vii Related Work

LUT [%] FF [%] DSP [%] BRAM [%]
SRAM queues 0.00 0.00 0.00 0.01
FPU 0.19 0.01 0.20 0.00
ALU 0.09 0.01 0.01 0.00
TABLE III: FPGA utilization of a single NDP core (i.e., SRAM queues and vector ALU with single-precision floating point (FPU) and fixed point ALU) on Xilinx Virtex UltraScale+ VCU1525 acceleration development board.

Disaggregated memory [37, 38] is typically deployed as a remote memory pool connected over PCIe, which helps increase CPU accessible memory capacity. Prior work [44, 56, 45] proposed system-level solutions that embrace the idea of memory disaggregation within the high-bandwidth GPU interconnect, which bears similarity to our proposition on TensorNode. However, the focus of these two prior work is on DL training whereas our study primarily focuses on DL inference. Similar to our TensorDIMM, several prior work [25, 26, 27, 28] explored the possibility of utilizing the DIMM buffer device space to add custom acceleration logics. The scope of all these prior studies is significantly different from what our work focuses on. To the best of our knowledge, this paper is the first to identify and address the memory capacity and bandwidth challenges of embedding layers, which several hyperscalars [10, 9, 12] deem as one of the most crucial challenge in emerging DL workloads.

Other than these closely related prior studies, a large body of prior work has explored the design of a single GPU/NPU architecture for DL [7, 6, 57, 58, 59, 60, 61, 3, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71] with recent interest on leveraging DNN sparsity for further energy-efficiency improvements [11, 4, 72, 8, 73, 74, 75, 76, 77, 5, 78, 79]. A scale-out acceleration platform for training DL algorithms was proposed by Park et al. [80] and a network-centric DL training platform has been proposed by Li et al. [81]. These prior studies are orthogonal to our proposal and can be adopted further for additional enhancements.

Viii Conclusion

In this paper, we propose a vertically integrated, hardware/software co-design that addresses the memory (capacity and bandwidth) wall problem of embedding layers, an important building block for emerging DL applications. Our TensorDIMM architecture synergistically combines NDP cores with commodity DRAM devices to accelerate DL tensor operations. Built on top of a disaggregated memory pool, TensorDIMM provides memory capacity and bandwidth scaling for embeddings, achieving an average and performance improvement than conventional CPU-only and hybrid CPU-GPU implementations of recommender systems. To the best of our knowledge, TensorDIMM is the first that quantitatively explores architectural solutions tailored for embeddings and tensor operations.

Acknowledgment

This research is supported by Samsung Research Funding Center of Samsung Electronics (SRFC-TB1703-03).

References

  • [1] A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa, A. Jaleel, C. Wu, and D. Nellans, “MCM-GPU: Multi-chip-module GPUs for Continued Performance Scalability,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.
  • [2] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter Performance Analysis of a Tensor Processing Unit,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.
  • [3]

    Y. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in

    Proceedings of the International Solid State Circuits Conference (ISSCC), 2016.
  • [4]

    J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing,” in

    Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.
  • [5] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.
  • [6] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A Machine-Learning Supercomputer,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2014.
  • [7] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), 2014.
  • [8] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos, “Stripes: Bit-serial Deep Neural Network Computing,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.
  • [9] J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law, P. Malani, A. Malevich, S. Nadathur, J. Pino, M. Schatz, A. Sidorov, V. Sivakumar, A. Tulloch, X. Wang, Y. Wu, H. Yuen, U. Diril, D. Dzhulgakov, K. H. an B. Jia, Y. Jia, L. Qiao, V. Rao, N. Rotem, S. Yoo, and M. Smelyanskiy, “Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications,” in arxiv.org, 2018.
  • [10] J. Hestness, N. Ardalani, and G. Diamos, “Beyond Human-Level Accuracy: Computational Challenges in Deep Learning,” in Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPOPP), 2019.
  • [11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.
  • [12] V. Rao, “Accelerating Infrastructure - together.” https://2019ocpglobalsummit.sched.com/event/Jiis/accelerating-infrastructure-together-presented-by-facebook.
  • [13] ACM, “The ACM Conference Series on Recommendation Systems.” https://recsys.acm.org/, 2019.
  • [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in arxiv.org, 2017.
  • [15] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in arxiv.org, 2018.
  • [16]

    A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines,” in

    arxiv.org, 2014.
  • [17] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R. Williams, and V. Srikumar, “ISAAC: A Convolutional Neural Network Accelerator with in-situ Analog Arithmetic in Crossbars,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016.
  • [18] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-Domain Compute,” IEEE Journal of Solid-State Circuits, vol. PP, pp. 1–11, 03 2019.
  • [19] F. Tu, W. Wu, S. Yin, L. Liu, and S. Wei, “RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2018.
  • [20] Facebook, “Accelerating Facebook’s infrastructure with Application-Specific Hardware.” https://code.fb.com/data-center-engineering/accelerating-infrastructure/, 2019.
  • [21] NVIDIA, “NVSwitch: Leveraging NVLink to Maximum Effect,” 2018.
  • [22] NVIDIA, “NVLINK High-Speed Interconnect,” 2018.
  • [23] Samsung, “(8GB, 1Gx72 Module) 288pin Registered DIMM based on 4Gb E-die,” 2016.
  • [24] Hynix, “128 GB 3DS LRDIMM: The World’s First Developed 3DS LRDIMM,” 2017.
  • [25] P. Meaney, L. Curley, G. Gilda, M. Hodges, D. Buerkle, R. Siegl, and R. Dong, “The IBM z13 Memory Subsystem for Big Data,” IBM Journal of Research and Development, 2015.
  • [26] A. Farmahini-Farahani, J. Ahn, K. Morrow, and N. Kim, “NDA: Near-DRAM Acceleration Architecture Leveraging Commodity DRAM Devices and Standard Memory Modules,” in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2015.
  • [27] H. Asghari-Moghaddam, Y. Son, J. Ahn, and N. Kim, “Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.
  • [28] M. Alian, S. Min, H. Asgharimoghaddam, A. Dhar, D. Wang, T. Roewer, A. McPadden, O. O’Halloran, D. Chen, J. Xiong, D. Kim, W. Hwu, and N. Kim, “Application-Transparent Near-Memory Processing Architecture with Memory Channel Network,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2018.
  • [29] NVIDIA, “NVIDIA Tesla V100,” 2018.
  • [30]

    Google, “Cloud TPUs: ML accelerators for TensorFlow,” 2017.

  • [31] A. Krizhevsky, “One Weird Trick For Parallelizing Convolutional Neural Networks.” https://arxiv.org/abs/1404.5997, 2014.
  • [32] NVIDIA, “The NVIDIA DGX-2 Deep Learning System,” 2017.
  • [33] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua, “Neural Collaborative Filtering,” in Proceedings of the International Conference on World Wide Web (WWW), 2017.
  • [34] JEDEC, “High Bandwidth Memory (HBM2) DRAM,” 2018.
  • [35] Micron, “Hybrid Memory Cube (HMC),” 2018.
  • [36] NVIDIA, “Accelerating Recommendation System Inference Performance with TensorRT.” https://devblogs.nvidia.com/accelerating-recommendation-system-inference-performance-with-tensorrt/, 2018.
  • [37] L. Kevin, C. Jichuan, M. Trevor, R. Parthasarathy, S. Reinhardt, and T. Wenisch, “Disaggregated Memory for Expansion and Sharing in Blade Servers,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2009.
  • [38] L. Kevin, T. Yoshio, R. Jose, A. Alvin, C. Jichuan, R. Parthasarathy, and T. Wenisch, “System-level Implications of Disaggregated Memory,” in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2012.
  • [39] NVIDIA, “The NVIDIA DGX-1V Deep Learning System,” 2017.
  • [40] IBM, “IBM Power9 Microprocessor,” 2017.
  • [41] Tensorflow. https://www.tensorflow.org, 2016.
  • [42] PyTorch. http://pytorch.org, 2019.
  • [43] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems,” in Proceedings of the Workshop on Machine Learning Systems, 2015.
  • [44] Y. Kwon and M. Rhu, “Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2018.
  • [45] E. Choukse, M. Sullivan, M. O’Connor, M. Erez, J. Pool, D. Nellans, and S. Keckler, “Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs.” https://arxiv.org/abs/1903.02596, 2019.
  • [46] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simulator,” 2015.
  • [47] NVIDIA, “NVIDIA CUDA Programming Guide,” 2016.
  • [48] Intel, “Intel Math Kernel Library.” https://software.intel.com/en-us/mkl, 2019.
  • [49] NVIDIA, “cuDNN: GPU Accelerated Deep Learning,” 2016.
  • [50] NVIDIA, “cuBLAS Library,” 2008.
  • [51] mlperf.org, “MLPerf.” https://mlperf.org/, 2018.
  • [52] P. Covington, J. Adams, and E. Sargin, “Deep Neural Networks for Youtube Recommendations,” in Proceedings of the ACM Conference on Recommender Systems (RECSYS), 2016.
  • [53] M. Campo, C. Hsieh, M. Nickens, J. Espinoza, A. Taliyan, J. Rieger, J. Ho, and B. Sherick, “Competitive Analysis System for Theatrical Movie Releases Based on Movie Trailer Deep Video Representation.” https://arxiv.org/abs/1807.04465, 2018.
  • [54] Micron, “Micron: System Power Calculator (DDR4),” 2017.
  • [55] Facebook, “Open Compute Project.” https://www.opencompute.org/.
  • [56] Y. Kwon and M. Rhu, “A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks,” in IEEE Computer Architecture Letters, 2018.
  • [57] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2015.
  • [58] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Temam, X. Feng, X. Zhou, and Y. Chen, “PuDianNao: A Polyvalent Machine Learning Accelerator,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), 2015.
  • [59] Z. Du, D. Rubin, Y. Chen, L. He, T. Chen, L. Zhang, C. Wu, and O. Temam, “Neuromorphic Accelerators: A Comparison Between Neuroscience and Machine-Learning Approaches,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2015.
  • [60] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. Miguel, H. Lobato, G. Wei, and D. Brooks, “Minerva: Enabling Low-Power, High-Accuracy Deep Neural Network Accelerators,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.
  • [61] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.
  • [62] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An Instruction Set Architecture for Neural Networks,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.
  • [63] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.
  • [64] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.
  • [65] R. LiKamWa, Y. Hou, M. Polansky, Y. Gao, and L. Zhong, “RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.
  • [66] D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdan-bakhsh, J. Kim, and H. Esmaeilzadeh, “TABLA: A unified Template-based Framework for Accelerating Statistical Machine Learning,” in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2016.
  • [67] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. Kim, C. Shao, A. Misra, and H. Esmaeilzadeh, “From High-level Deep Neural Models to FPGAs,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.
  • [68] D. Moss, E. Nurvitadhi, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. Leong, “High Performance Binary Neural Networks on the Xeon+FPGA Platform,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), 2017.
  • [69] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), 2017.
  • [70] D. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. Leong, “A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study,” in Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays (FPGA), 2018.
  • [71] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.
  • [72] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-X: An Accelerator for Sparse Neural Networks,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.
  • [73] J. Albericio, A. Delmas, P. Judd, S. Sharify, G. O’Leary, R. Genov, and A. Moshovos, “Bit-pragmatic Deep Neural Network Computing,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2017.
  • [74] G. Venkatesh, E. Nurvitadhi, and D. Marr, “Accelerating Deep Convolutional Networks using Low-precision and Sparsity,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
  • [75] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong, Y. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh, “Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?,” in Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays (FPGA), 2017.
  • [76] P. Whatmough, S. Lee, H. Lee, S. Rama, D. Brooks, and G. Wei, “A 28nm SoC with a 1.2 GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with 0.1 Timing Error Rate Tolerance for IoT Applications,” in Proceedings of the International Solid State Circuits Conference (ISSCC), 2017.
  • [77] P. Whatmough, S. Lee, N. Mulholland, P. Hansen, S. Kodali, D. Brooks, and G. Wei, “DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses,” in Hot Chips: A Symposium on High Performance Chips, 2017.
  • [78] A. Delmas, P. Judd, D. Stuart, Z. Poulos, M. Mahmoud, S. Sharify, M. Nikolic, and A. Moshovos, “Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How.” https://arxiv.org/abs/1803.03688, 2018.
  • [79] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler, “Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks,” in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2018.
  • [80] J. Park, H. Sharma, D. Mahajan, J. Kim, P. Olds, and H. Esmaeilzadeh, “Scale-Out Acceleration for Machine Learning,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2017.
  • [81] Y. Li, J. Park, M. Alian, Y. Yuan, Q. Zheng, P. Pan, R. Wang, A. G. Schwing, H. Esmaeilzadeh, and N. S. Kim, “A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2018.