The complexity of deep neural network (DNN) based deep learning (DL) algorithms are scaling up rapidly. To meet the demands of these computation-hungry algorithms, accelerator-centric systems based on GPUs or custom-designed ASICs for DNNs, often referred to as neural processing units (NPUs), are widely being utilized for accelerating DL. Similar to how GPUs have evolved into a mainstream processor architecture, it is expected that NPUs will become first-class citizens in heterogeneous computing platforms due to the increasing number of application domains it is expected to accelerate.
Based on these trends, an important challenge that arises is how NPUs should be exposed to the end-user and how the memory address space be exposed to the NPUs. Traditionally, the I/O attached accelerators such as GPUs had separate memory address space than the CPU, forcing programmers to manage these distinct address regions through manual memory allocations, data copies, etc. As GPUs evolved into having a proper memory management unit (MMU) [78, 98, 77], programmers are now given the illusion of a unified CPU-GPU memory address [68, 38] allowing CPU and GPU to share a globally addressable memory regardless of whether the physical memory is shared or separate. Other key features enabled by GPU MMUs include memory oversubscription [57, 87], NUMA [1, 71, 73], and spatial sharing of a single GPU substrate while supporting page-granularity protections . Unfortunately, these features are yet to be available for NPUs because they currently do not have an MMU for decoupling virtual addresses against physical addresses. Consequently, the range of applications that can utilize these accelerators is limited. For instance, NPUs cannot page-fault on missing pages nor can oversubscribe the NPU memory, so the working set of a target DNN application must precisely fit within the physical memory capacity: otherwise, the runtime crashes .
Given this landscape, we argue that future NPUs will need dedicated architectural support for virtual-to-physical address translation services. Conventional wisdom in DNN’s memory access characteristics is that they are highly regular and predictable, allowing the compiler to optimally decide how much memory to allocate and when/where to read/write data for NPU processing, obviating the need for fine-grained, page-level translations or NUMA capabilities. While such property held true for regular, “dense
” DNNs (e.g, convolutional, recurrent, and multi-layer perceptrons, CNNs/RNNs/MLPs), emerging DL workloads employingembedding layers exhibit a highly “sparse”, irregular, and random memory access pattern over a large embedding lookup table (Figure 1). Recent studies from Baidu and Facebook [35, 76] state that their production-level DL workloads using embeddings already reached close to GBs of memory footprint because of these embedding tables, even for inference (Section III-A). Because current NPUs incorporate only tens of GBs of local memory, these memory-limited DL applications must partition its memory usage across CPUNPU (or NPUNPU under multi-NPU systems [24, 30]), incurring frequent CPUNPU (or NPUNPU) data transfers. Because of the irregular, random memory access nature of embedding layers, an MMU-less NPU must rely on the CPU to manually orchestrate data transfers on its behalf, experiencing significant performance overheads (Section V). Similar to how demand paging or NUMA has been a crucial component in CPUs (and now GPUs, especially under CPU-GPU  or multi-GPU  systems), we believe that a robust NPU address translation service will enable a more diverse range of emerging, memory-hungry DL applications to be seamlessly executed on NPUs without falling into the pitfalls and performance overheads of manual management of NPU physical memory.
To this end, this paper explores the design space of NPU MMUs and identifies and addresses the unique challenges in adding architectural support for address translation based on a data-driven approach. Our study consists of two main parts:
As such, we start by first employing prior GPU-centric MMU solutions [77, 78] that utilize I/O memory management units (IOMMUs) to handle NPU address translations for conventional, dense DNNs. Interestingly, our analysis shows that, due to the fundamental architectural differences between GPUs and NPUs, a naive IOMMU address translation incurs significant performance overhead even for these dense DNNs. Concretely, while GPUs commonly use the on-chip SRAM for register-files and caches, NPUs almost exclusively utilize its SRAM for software managed scratchpads
. The activations and weights the NPUs operate on are typically multi-dimensional tensors, mapped to a traditional, linear (1D) memory subsystem. These tensors are much larger than the scratchpad, so the DMA unit blocks the activations/weights intotiles and sequence them across multiple (tile) fetch operations via double-buffering. As these tiles are also multi-dimensional tensors, fetching them into the scratchpad involves projecting the multi-dimensional coordinates into the linear space of DRAM memory. A single tile is therefore decomposed into minimum number of linearized memory transactions, which can be up to several thousands, because a tile is sized at several MBs to maximally utilize the scratchpad. Consequently, a single tile fetch invokes significant bursts of page translations that conventional MMUs fail to effectively capture, leading to an average performance overhead.
Overall, we observe that the bursty nature of scratchpad-based NPU address translation traffic renders the translation throughput of baseline IOMMU’s (multiple) page-table walkers (PTWs) a key performance bottleneck. As a result, unlike GPUs which are optimized for translation locality [77, 78], we argue that NPU MMUs should be designed for high translation throughput first and locality second. To this respect, we propose a throughput-centric NPU MMU (NeuMMU) design that effectively handles high burst of translation requests. NeuMMU is designed to reduce the address translation overhead by leveraging the deterministic memory access behavior of dense DNNs and the inherent translation locality therein. Concretely, while TLBs are not as effective for NPUs than CPUs or GPUs, we identify how translation burst locality exists within a given tile fetch in DNNs. To capture such intra-tile translation locality, we first propose our novel pending request merging buffer (PRMB) microarchitecture as a translation bandwidth filtering mechanism to reduce the number of distinct page-table walks concurrently in-flight, boosting effective translation bandwidth. In addition to the PRMB microarchitecture, the MMU also requires high concurrent address translations and we evaluate the need for a larger number of PTWs to maximize translation throughput. Unfortunately, the large amount of translations can incur significant power overheads due to the additional memory accesses involved in the translation process. We make the key observation that dense DNN layers exhibit a regular dataflow, rendering its memory access patterns to be highly deterministic with only a handful of key data structures being manipulated (i.e., input/output activations, weights). This allows us to employ a lightweight translation path “register”, unlike the traditional MMU cache , that effectively filters down on the number of translation-invoked DRAM accesses (a reduction of average ). Putting everything together, while a naive IOMMU causes an average performance overhead, our NeuMMU effectively closes this gap, incurring only an average performance loss (Section IV).
With our robust NeuMMU design in hand, we then demonstrate the usefulness of MMU’s address translation feature for handling embedding layers. We show that an MMU-less NPU suffers from (redundant) manual data copy operations between CPU and NPUs, leading to an average performance loss when executing embedding layers. Our NeuMMU again effectively closes this performance gap using direct NUMA accesses across remote memory regions, achieving significant performance improvements than an MMU-less NPU design (Section V). To summarize our key contributions:
To our knowledge, this work is the first to explore architectural support for NPU MMUs, a significant first step in exploring this emerging design space.
We conduct a detailed, data-driven analysis on conventional (dense) DNNs and emerging (sparse) embedding layers, root-causing the limitations of GPU-centric IOMMUs in handling the bursty address translations of NPUs.
We propose our throughput-centric NeuMMU design based on our novel pending request merging buffer, a high throughput parallel page-table walker, and a translation path register, only incurring an average performance loss.
Using DNN-based recommendation system models as driving examples, we showcase how efficiently sparse embedding layers can be handled using fine-grained NUMA or page-migrations, enabled by NeuMMU.
Ii Background & Methodology
Ii-a Address Translation in SPM-centric NPUs
NPUs generally utilize most of its on-chip SRAM as a scratchpad memory (SPM111While it is possible for future NPUs to adopt user-transparent caches, we argue that SPMs are a more natural fit for NPUs. This is because DNNs exhibit a highly deterministic dataflow (i.e., locality and data reuse information is statically available), making them more amenable for optimizations using the software-managed scratchpads for maximal resource utilization. ) whereas GPUs allocate the bulk of this space as register-files in order to spawn as many threads as possible for latency tolerance  (Figure 2). In contrast, NPUs commonly leverage task-level parallelism to double-buffer the SPM so that the latency to fetch the input activations (IA) and weight filters (W) is hidden inside the latency to execute a layer. Because the size of IA and W can be hundreds to thousands of MBs, the DMA unit blocks the IA and W into smaller tiles and sequence them in and out of the SPM (e.g., typically tens of MBs in state-of-the-art NPUs [44, 72]) across multiple iterations (Figure 3). A key reason why NPUs prefer a SPM is because of the predictable performance it delivers: once the data is brought in from memory to the SPM, the latency to read/write data from/to the SPM is much more deterministic than caches (i.e., SPM hit rate is ). From an MMU standpoint, address translation is not required when the processing elements (PEs) access the SPM during the compute phase for layer computations. In other words, the PEs need not have to query an MMU for address translations when accessing the SPM. During the memory phases however, the DMA unit does require information regarding where inside the NPU physical address space the IA and W are located, as detailed below.
Ii-B NPU Programming Model and IOMMUs
Current NPU programming model. NPUs generally feature a private, physically-addressed memory. Consequently, the CPU must explicitly copy the necessary data structures (e.g., IA and W) from the host memory to the NPU (physical) memory address space. After the CPUNPU data transfer is complete, the NPU-side DMA unit is given the target layer’s IA and W mapping information within the NPU physical address space. Concretely, the DMA unit is given the base (base) and the boundary value (bound) of the data allocated, which is utilized to derive the physical address of target data elements, obviating the need for a separate address translation. Such approach is similar to the “old” GPGPU programming model, which suffers from the same problems users had to face: 1) the working set must fit within the NPU physical memory, preventing DNNs that oversubscribe NPU memory from being executed (e.g., large batch DNNs ), and 2) it becomes challenging to support “pointer-is-a-pointer” semantics, reducing programmability and complicating situations where the CPU and NPU (or among multiple NPUs ) share data. To tackle these limitations, the I/O MMU (IOMMU)  can be utilized to service accelerator-side virtual-to-physical address translations and overcome the aforementioned limitations.
IOMMU hardware/software architecture. The IOMMU is assigned with the access privilege to walk the CPU’s page-tables, allowing the CPU and the (GPU/NPU) accelerators to share a unified global address space. When the accelerator is not able to locate a proper translation for its virtual address (VA), a translation request is sent as an ATS (address translation service) request packet over PCIe to the IOMMU. The IOMMU can include its own TLB hierarchy (called IOTLB), which is checked first when an ATS is received. When IOTLB misses, a hardware page-table walker (PTW) inside the IOMMU walks the CPU page-tables to retrieve the translated physical address (PA). Because a single IOMMU block is designed to be shared by multiple accelerators (e.g., GPUs, DSPs, ISPs, and NPUs), current IOMMUs employ multiple PTWs (typically but can be up to ), allowing multiple translations in-flight.
Ii-C Evaluation Methodology
Baseline NPU architecture. Our baseline NPU architecture assumes a Google TPU-style systolic-array microarchitecture (Figure 2), which we modeled as a detailed, cycle-level performance simulator by cross-referencing publicly disclosed documents and patents from Google [44, 83, 84, 86, 85]. Our performance model is cross-validated against Google Cloud TPU , achieving an average correlation in terms of effective throughput. The baseline NPU model employs a SPM based on-chip memory hierarchy and a weight-stationary dataflow , as implemented in the original TPU (Table I). Similar to discrete NPUs such as Intel-Nervana’s Neural Network Processor  or Habana’s Gaudi , our NPU model utilizes a local, high-bandwidth memory (e.g., HBM ). Similar to prior work [3, 46, 74, 16], we modeled the memory system as having fixed latency and bandwidth rather than employing a cycle-level DRAM simulator [82, 13] to reduce simulation time. When modeling IOMMUs, we assume an x86-64 style, hierarchical -level page-tables with key configuration parameters following those from prior related literature [77, 78, 32]. While the remainder of this paper assumes a systolic-array based NPU for our discussions, the effectiveness of our NeuMMU design remains intact for other NPU designs, such as spatial-array based microarchitectures [17, 16, 3, 74, 75] as these NPUs are also based on an SPM-centric memory hierarchy. We discuss the implication of alternative NPU architectures and DNN dataflows on our MMU proposal in Section VI-B.
|Operating frequency of PE||GHz|
|Scratchpad size (activations/weights)||/ MB|
|Number of memory channels|
|Memory access latency||cycles|
|Number of TLB entries|
|TLB hit latency||cycles|
|Number of page-table walkers|
|Latency to walk page-tables||cycles per level|
|NUMA access latency across sytem interconnect||cycles|
|CPUNPU Interconnect Bandwidth||GB/sec|
|NPUNPU Interconnect Bandwidth||GB/sec|
Benchmarks. We study six DL applications as part of our dense DNN workloads. We chose AlexNet, GoogLeNet, and ResNet [49, 92, 33] as our CNN application suite (denoted as CNN-1/CNN-2/CNN-3, respectively) because they cover a wide range of filter and activation sizes. We also include three RNNs from DeepBench 
, one regular GEMV (general matrix-vector multiplication) based RNN (RNN-1) and two LSTM based RNNs (RNN-2/RNN-3). For these workloads, we observe an intractable amount of simulation time when the batch size is larger than, so our analysis assumes a batch size of // (denoted as b01/b04/b08), which is reasonable for inference scenarios. To accommodate training scenarios, we experiment a subset of the layers (i.e., a common layer configuration exhibited in each of our DNN) with large batch sizes in Section VI-C as a sensitivity study to explore the implication of address translation on large batch training. When studying the effectiveness of NeuMMU in handling sparse DNN layers (e.g., embedding layers) in Section V, we use two recommendation system models: the neural collaborative filtering (NCF) based recommendation system  from MLPerf 
and the recently open-sourced DLRM (deep learning recommendation model) from Facebook.
Page sizes (small vs. large). A key consideration in designing a virtual memory system is its page size. Compared to baseline KB pages, large (MB) pages can potentially reduce the translation invoked stalls by increasing TLB reach and reducing TLB misses. As we discuss in Section VI-A, we find that MB large pages do in fact decrease the performance overhead of address translations for conventional, dense CNNs/RNNs which exhibits highly regular dataflows. Unfortunately, for emerging DL applications employing sparse embedding layers with irregular memory access patterns (Section III-B), we find that demand paging with large page sizes incurs significant performance loss compared to small pages (an average vs. performance loss for small vs. large pages). Consequently, large pages alone are no silver bullet in designing a virtual memory system for NPUs, motivating the importance of robust address translation for small pages. As large pages perform well for conventional CNNs/RNNs, we assume the baseline KB pages for our default evaluation. We revisit the implication of large pages on address translations, its pitfalls for emerging, sparse DNN layers in Section VI-A.
Iii Motivation: Why MMU for NPUs?
Iii-a Emerging, Memory-limited DL Workloads
A common property that conventional DL applications share (for both training/inference) is that its working set always fits within the tens of GBs of NPU local memory budget – an artifact of the physically-addressed NPU memory. However, recent studies from several hyperscalars [35, 76] project that emerging DL workloads are heavily memory “capacity” limited, exhibiting several hundreds of GBs of memory footprint. DL applications such as recommendation systems , word/character language models , and speech recognition , for instance, employ embedding layers which require several tens to hundreds of GBs of memory to store just the model weights themselves, even for inference. Figure 4 illustrates the usage of embedding layers in recommendation systems that incorporate neural networks , which are the current state-of-the-art algorithms being deployed for news feed, search, and ads. Facebook, for instance, stores deep learning features (e.g., the pages a particular user liked) as vectors called embeddings which are utilized to recommend relevant posts or contents to users . Each user has a unique embedding vector so the total number of vectors scale proportional to the number of users. Embedding layers therefore house billions of weight parameters, which leads to its tens to hundreds of GBs of memory usage. As shown in Figure 4, recommender systems consist of two phases: 1) an embedding “lookup” phase that gathers multiple embedding vectors from potentially multiple lookup table (e.g., two tables in Figure 4) to batch them into a single tensor, and 2) using the batched tensor to execute several multi-layer perceptron (MLP) layers. Because the model size of these embedding lookup tables are far beyond the memory capacity limits of GPUs/NPUs, the solutions vendors predominantly take are:
Iii-B NPU MMU for Remote Memory Access
Figure 5 shows a DNN-based recommendation system parallelized in an “accelerator-centric” fashion (Section III-A). That is, the compute-dominated MLPs are parallelized using data-parallelism to improve performance of MLPs, whereas the memory-capacity limited embedding tables are model-parallelized to overcome the constraints of (only tens of GBs of) accelerator local memory capacity. Assuming different accelerator is allocated with a different lookup table, an all-to-all communication is required in order to shuffle the results of an embedding lookup of an entire minibatch on each accelerator into parts of a minibatch of embedding lookups on all accelerators. If the accelerators we assume here are GPUs, they have several options that enable the all-to-all communication process: 1) all GPUs can be passed with a (shared) pointer to each embedding table, potentially stored in a remote GPU’s local memory, which allows any GPU with the pointer to directly load data in a CC-NUMA fashion (over NVLINK [1, 70]), or 2) use P2P cudaMemcpy to initiate direct GPUGPU DMA copy operations without having to utilize host-side pinned memory as an intermediate step. Unfortunately, neither of these options are available for an MMU-less NPU because it does not have the ability to address memory that is outside its local, physical memory address space. In other words, the NPU is not able to reference data that is not already available within its physical memory. As such, the CPU runtime must manually copy the embeddings from the source NPU memory to an intermediate CPU-side pinned memory, and then do another copy of these embeddings into the destination NPU memory. As we quantitatively detail in Section V, such multi-step data copies and data duplication adds significant latency, leading to an average performance overhead.
Given this landscape, we argue that NPUs are in urgent need for architectural support for robust address translations. In the remainder of this section, we first discuss the fundamental architectural differences between GPUs and NPUs and the limitations of blindly employing prior GPU-centric MMUs as-is. We then motivate the need of an NPU-optimized MMU design based on a data-driven approach. We re-visit the usefulness of our NPU MMU design in handling DL applications using sparse embedding layers in Section V, which improves the performance of an MMU-less NPU by .
Iii-C Data-driven Analysis of NPU MMUs
Translation bursts in SPM-centric NPUs. As discussed in Section II-A, the data movements between main memory and SPM are conducted in coarse-grained tile chunks, which can be several MBs. For instance, our baseline NPU employs MB of SPM each for IA and W, so the tile size of IA and W can be as large as (/)= MB. Putting this number into perspective, assuming NPUs have an MMU that enables VA-to-PA translations, a single tile request by the DMA seeking to fully populate the MB on-chip SPM will need to access a minimum of ( MB/ KB) = K distinct pages under the baseline 4 KB page. The actual number of pages accessed can be much larger than this minimum number because the DMA is not necessarily fetching data in page-granularity in a dense fashion (i.e., worst case, the DMA fetches only a single word from a single page in a sparse manner). Figure 6 illustrates the average and maximum number of distinct pages accessed by a single tile requested by the DMA.
Note that the IA/W tiles are multi-dimensional tensors mapped to a traditional, linear (1D) DRAM memory. Consequently, a single tile tensor can be decomposed into multiple, linearized memory transactions by the DMA unit. Each of these memory transactions require address translation to determine which page it belongs to, so the actual number of translations invoked can be much larger than the number of pages accessed (Figure 6). To make matters worse, these address translation requests are generated in large bursts within a short timeframe (henceforth referred to as translation bursts), which cause significant translation bandwidth pressure on the MMU (Figure 7). While these numbers might at first glance seem surprising, we observe that this is a natural outcome of NPU architectures optimized for data-/task-level parallelism using an SPM based on-chip memory hierarchy. State-of-the-art NPUs typically contain tens of thousands of ALUs on-chip, so the SPM must be large enough to seamlessly feed these processing engines with useful work. As there is an implicit barrier enforced at the boundaries of a particular tile’s compute and memory phase (i.e., any given tile’s computation can be initiated only when the entire tile is fully fetched into the SPM, see Figure 3), the DMA unit tries to concurrently launch the data read requests to DRAM to maximize memory-level parallelism and fetch IA/W tiles as soon as possible, inevitably leading to translation bursts.
Pitfalls of GPU-centric MMUs for NPUs. As convolutions or matrix-multiplication operations are well-known to exhibit high data reuse thanks to its regular dataflow , one might think that conventional TLBs should effectively capture the translation reuse with high TLB hit rates. However, NPUs have fundamental architectural differences than GPUs, rendering prior GPU-centric MMU solutions ineffective in handling the aforementioned translation bursts. Recall that state-of-the-art NPU architectures are based on a SPM-centric memory hierarchy. The (PESPM) data traffic do not require address translations because SPM is addressed using VAs rather than PAs. Consequently, unlike a GPU where a per-core, post-coalescing TLB can effectively reduce a substantial amount of GPU translation requests [77, 78], a per-PE TLB cannot help in fitering out the translation burst bandwidth pressure of NPUs. However, the intra-tile translation locality does exist for (SPMDRAM) traffics when the DMA unit invokes multiple data fetch requests from memory to the SPM that fall under the same page. We observe however that such translation locality is not adequately captured with a conventional TLB hierarchy because the bursts of translations often query the TLB even before the PTW delivers the VA-to-PA translations! Such phenomenon is a unique characteristic of the SPM-centric NPUs: for GPUs, memory read/write operations are initiated through load and store instructions, which only amounts to of the instruction mixes  and is therefore likely to be distanced apart in time when sent over to the MMU for address translations. The SPM-centric NPU however invokes bursts of these translation request traffic to the MMU within a short time-window leading to its high translation throughput requirements.
Translation throughput vs. translation locality. In general, we observe that NPUs that utilize a naive, strawman IOMMU design enhanced with some key GPU-centric optimizations as-is (i.e., per-PE TLBs, parallel PTWs, a local multi-level TLB hierarchy) is not able to properly handle the NPU translation bursts as it is optimized for translation locality rather than translation throughput, experiencing severe performance slowdown as shown in Figure 8. In effect, the sheer volume of translations requested to the IOMMU leads to a large number of page-table walks, even after being filtered by the TLBs. These massive number of address translations eventually becomes bottlenecked by the limited parallelism provided with a handful of (eight) shared IOMMU PTWs. Overall, our data-driven analysis shows that conventional MMUs which are primarily designed to capture translation locality (i.e., TLBs), rather than translation throughput (i.e., number of PTWs), are inadequate in handling the translation bursts in NPUs. To validate whether the TLB can be a primary target for improvement in NPU MMUs, we sweep the numbers of TLB entries on top of our baseline IOMMU with eight hardware PTWs. Even with an unrealistically large TLB with K entries ( increase over baseline K TLB entries, Table I), the NPU fails to completely filter out the bursts of translation requests, achieving less than performance improvement than baseline IOMMU. Overall, we conclude that NPU local TLBs, while beneficial, is not sufficiently performant enough to filter out most of the translations. This is because the bursts of address translations cause significant number of page-table walks instantiated and be bottlenecked by the translation throughput provided with IOMMUs. This is in stark contrast with GPUs where prior work [77, 78] has shown TLBs to be effective in capturing an average 7080 of translations. One might think that by having the DMA unit send data requests in a less bursty fashion (e.g., only allow up to a limited number of data and address translations that the IOMMU can sustain), the effectiveness of TLBs can be restored and performance loss reduced. Unfortunately, such design decision will inevitably reduce memory-level parallelism and memory bandwidth utilization, significantly slowing down memory-limited applications like RNNs.
Proposed approach: throughput-centric MMU. Based on our data-driven analysis, we conclude that translation throughput should be the primary design target for NPU MMUs. This is because of the SPM-centric NPU’s unique architectural characteristic, where the tile-based bulk DMA transfers invoke translation bursts which are not adequately captured using the locality-optimized, GPU-centric MMUs. In the following section, we propose a “throughput”-centric NPU MMU design that effectively balances translation throughput while also adequately capturing translation locality.
Iv NeuMMU: Designing an MMU for NPUs
Iv-a PRMB: Translation Bandwidth Filter
As discussed in Section III-C, a key challenge with NPU address translation is that the SPM-centric memory hierarchy invokes a burst of several thousands of address translations when moving data in/out of main memory from/to SPM. Our first proposal is based on the key observation that a significant fraction of translation bursts hit in the same page that is already being translated by the PTW. To capture such translation locality within translation bursts, we propose a PTW design enhanced with a pending request merging buffer (PRMB) that absorbs the page translation requests falling under an already inflight, pending translation initiated by that same PTW. Figure 9 illustrates the microarchitecture of the proposed PTW with our PRMB assuming that it can merge up to identical page translations per each PTW. Any memory transaction that misses in the TLB is first routed to the pending translation scoreboard (PTS) to check whether any one of the parallel PTWs is currently under the process of translating the corresponding page. The PTS is a fully-associative cache with cache entries (equivalent to the number of PTWs) and is tagged with the virtual page number (VPN). A hit in the PTS implies that a VAPA translation for this particular VPN is currently inflight. If there are vacant PRMB mergeable slots within the PTW, the PTS-hit request is merged inside the PRMB and waits until the translation comes back. A PTS miss however implies that neither the TLB nor any one of the PTWs contains the translation for this VPN. The PTS therefore assigns one of the vacant PTWs (if any) as the designated translation unit to walk the page-tables, and registers the VPN information inside one of the PTS entries so that future translations to this particular VPN can be merged. When all the PTWs as well as all possible PRMB mergeable slots are full, any further translation requests are blocked until the translation bandwidth is available.
Because the translations that are merged inside PRMB do not send a separate page-walk request and instead waits for the already inflight translation request to come back, our PRMB microarchitecture saves not only memory bandwidth but more importantly, the PTW “translation bandwidth”, effectively functioning as a translation bandwidth filtering mechanism. Once the translation is available, the PTW controller queries the PRMB and returns the merged requests back to the DMA unit on a cycle-by-cycle basis. Figure 10 shows the effect of PRMB on overall performance as a function of how many merge-able entries are provisioned inside each PRMB. As depicted, for our studied DNNs, having - mergeable slots per each PTW can significantly capture the translation burst locality thereby minimizing the redundant translation requests from wasting memory and translation bandwidth. This allows our PRMB-enhanced NPU to achieve an average (max ) performance of an oracular MMU, a significant improvement over the baseline IOMMU. Nonetheless, there still exists a significant performance gap of motivating us to our second proposition.
Iv-B (Translation) Throughput-centric MMU
The PRMB-enhanced PTW helps capture translation burst locality while minimizing waste in memory and translation throughput. Nonetheless, the low average TLB hit rate and the sheer volume of required address translations render significant pressure on the meager IOMMU page-table walkers. While Powers et al.  similarly observed that enhancing parallelism to the PTW helps improve GPU’s translation throughput, adding more parallel PTWs was only able to achieve, on average, of the oracular MMU design point under the GPU context. This is because leveraging translation locality, using the per-core/post-coalescing TLB and multi-level TLB hierarchy, was shown to be more important than enhancing raw translation throughput for GPUs [77, 78].
Our work, on the other hand, makes the unique observation that the bursty nature of NPU translation requests, coupled with the relatively low TLB hit rates, “mandates” a throughput-centric MMU as a primary design objective. As such, the key insight our data-driven analysis delivers is that the SPM-centric NPUs should be designed for improving translation throughput first, and translation locality second. As such, our second proposition is that the NPU MMU should be further enhanced for translation throughput by adding a larger number of PTWs. Figure 11 shows the NPU performance sensitivity on address translation throughput, where increasing the number of PTWs from to closes the performance gap from an average to for baseline KB pages: PTWs turned out to be a good design point for the set of benchmarks we have evaluated, but larger/smaller PTWs might be required for alternative NPU configurations. We discuss the sensitivity of NeuMMU for alternative design points in Section VI-C. As noted in Section III-C, the IOMMU is designed to be shared by multiple accelerators. To make sure the NPU alone does not saturate the address translation throughput, we argue that the number of PTWs be sufficiently provisioned such that it does not become a performance hotspot. Studying efficient MMU resource allocation strategies across multiple accelerators for QoS is beyond our scope and we leave it as future work.
It is worth pointing out that blindly increasing the number of PTWs alone, without employing our translation bandwidth filtering PRMB microarchitecture, can cause significant overheads in energy-efficiency. Figure 12(a) shows the performance of baseline IOMMU enhanced with a larger number of PTWs without our PRMB design adopted. With PTWs with no PRMB, the performance does in fact match the performance of NeuMMU with PRMB and PTWs. Such design point however consumes significantly more energy as shown in Figure 12(b). Without the translation bandwidth filtering effects of PRMB, a significant fraction of translations that walk the page-tables are redundant and causes up to more energy consumption than the nominal PRMB and PTWs of NeuMMU. Our novel PRMB microarchitecture and the throughput-centric parallel PTW design effectively balances performance and energy-efficiency, reaching of the performance of oracle while consuming much less energy than single-handedly relying on large PTWs without PRMB.
Iv-C Translation Path “Registers” (not Caches)
An important challenge with the NeuMMU design so far is that a significant fraction of translations still require a page-table walk. While the abundant address translation throughput provided with NeuMMU design so far effectively hides the latency of translations, the number of page-table walks themselves are still relatively high because of the low TLB hit rate. As our study assumes an x86-64 style, hierarchical 4-level page-tables, a single page-table walk operation would incur up to four memory transactions with significant power overheads. For power-limited environment, the overhead of adding address translations can be prohibitive which leads to our last proposal: a lightweight translation path “register” (TPreg) that allows PTWs to skip some page-table walking steps. Our TPreg microarchitecture is inspired by the well-known MMU caches  (aka translation path caches), widely adopted in CPUs/GPUs, but TPreg leverages the unique characteristics of DNNs to minimize its implementation overheads (i.e., less than bytes per PTW) while reducing the number page-table walk invoked memory transactions by more than .
Benefits of caching translation paths. Under x86-64 based translation system, the paged virtual memory is implemented using a radix tree for their page-tables. The translation path caches accelerate the page-table walking process by allowing the processor (in our case, the NPU PTW) to skip over a single or more levels of the radix tree. The virtual address space is decomposed into a page-number and a page-offset, where the page number is further split into a sequence of indices, four in x86-64. The first index (L4) is used to select an entry from the root of the radix tree, which could potentially contain a pointer to a node in the next lower level (i.e., L3) of the tree. If a valid entry is found, the next index value is used to jump to the next tree level, which can again potentially find a pointer to the node in the next lower level (L2) of the tree. Such procedure is repeated until the selected entry is invalid or the tree search finalizes at a data page using the PA. As x86-64 currently uses bits out of the memory addressable bits, a baseline KB page size utilizes the lower bits and the remaining bits are divided into four bit indices. Because page-table walks to two consecutive VA pages will most likely use the same L4/L3/L2 entries, significant translation locality exists across spatially close VA regions.
Design space of translation path caches. x86-64 processor vendors already employ private, low-latency translation path caches that store upper-levels of the page-table entries [41, 5] and the tradeoffs of alternative translation path cache designs are well-understood through prior literature . A full design space exploration of all available options for our NPU MMU design is beyond the scope of this work. Nevertheless, we briefly discuss two representative design points that are inspired by translation path caches employed by CPUs from Intel/AMD, which drives our proposed TPreg design. The most intuitive implementation of a translation path cache is to store page-table entries tagged by the corresponding entry’s physical address in memory. Entries from different levels of the (L4/L3/L2/L1) page-tables are mixed and shared inside a unified cache, all indexed and tagged by their physical address. Such unified page-table cache (UPTC) is known to be adopted in AMD’s processor designs. Intel, on the other hand, employs a translation cache design that is tagged using the virtual address. The translation-path cache (TPC) microarchitecture , for instance, is tagged by the L4/L3/L2 indices of the virtual address. Key intuition behind the TPC design is that the three separate UPTC cache entries allocated to keep track of the three page-table lookups can be merged into a single TPC lookup when: 1) all physical page numbers are concatenated and merged into a single data entry, and 2) is tagged using a concatenation of the virtual L4/L3/L2 indices. Under such design, a single TPC entry corresponds to an entire path, including all of the intermediate entries for a given page-table walk operation.
Our study reveals that TPC is much more effective than UPTC in capturing the NPU address translation locality. On average, the L4/L3/L2 tag hit rate of TPC was // across the studied workloads whereas UPTC achieved an average hit rate. This allows TPC to reduce less page table walks when compared to UPTC.
Translation path “registers” (not caches). Based on our design space exploration above, we conclude that a TPC-based translation caching to be a more robust architecture than UPTC. As shown in Figure 13, employing a single translation path “register” (TPreg) per each PTW (which caches the L4/L3/L2 entries as done in TPC) can capture most of the performance benefits of translation caching while removing significant fraction of the redundant page-table walks. As such, TPreg can be a lightweight, cost-effective solution to reduce the number of memory transactions for NPU page-table walks. Below we detail the key insights behind the effectiveness of our TPreg microarchitecture.
The number of distinct VA regions accessed is confined within a handful of large segments in the VA space (i.e., IA and W), so translations to VA pages that fall under the same (IA/W) segment are highly likely to share a common L4/L3/L2 translation entry.
Another important observation we make is that the DMA unit initiates tile fetch requests for IA and W one at a time, meaning the data fetch request is not interleaved across IA and W. This implies that the majority of address translation requests invoked in the memory-to-SPM data fetch process will naturally share the L4/L3/L2 entries. Figure 14 illustrates the virtual addresses accessed in time, confirming the high temporal locality of address translations which translation caching can effectively take advantage of.
While the upper L4/L3/L2 entries exhibit high temporal locality, the locality in the lower entries can be low because the VA accessed in time exhibit a streaming access pattern as exhibited in Figure 14. This allows a TPC-style translation cache with only a handful of entries to be able to capture most of the performance benefits of translation path caches (Figure 13), motivating our lightweight translation path “register”, not a full-blown cache microarchitecture.
Energy-efficiency improvements. While the effect of TPreg on performance is small, its impact on energy-efficiency is substantial. Using the energy table for a nm CMOS process , we derive the energy overheads of walking the page-tables for the two design points: our NeuMMU with PTWs/ PRMB entries with and without the single entry TPreg. Our lightweight TPreg substantially reduces the energy-overheads by an average thanks to the high translation hit rates and the resulting smaller number of memory transactions to walk the page-tables.
Iv-D Putting Everything Together
Overall, we show that the baseline IOMMU fails to capture the translation locality and throughput requirements of NPU MMUs, causing an average performance overhead for baseline pages. Based on a data-driven application characterization study, we motivated the need for a throughput-centric MMU and proposed three unique solutions tailored for the algorithmic nature of DNNs and the SPM-centric NPU architecture. Putting all three solutions together, our NeuMMU design incurs an average performance overhead for baseline/small pages when compared against an oracular MMU that assumes all translations hit in the TLB with no additional TLB access latency. Furthermore, NeuMMU consumes less energy than the baseline IOMMU, thanks to the PRMB and TPreg, which reduce the number page-table walk invoked memory transactions by .
Iv-E Implementation Overhead
We measure NeuMMU’s design overhead using CACTI and synthesized implementations over an FPGA board. The additional SRAM storage required for the per-PTW PRMB and TPreg is as follows. Each PRMB
entry is conservatively estimated to bebytes, so a total of () = KB SRAM storage is needed across the PTWs, PRMB entries per each PTW. Regarding the TPreg, each consumes bytes so the PTWs consume KB. The PTS is a fully-associative cache with cache entries, each entry sized at bytes. All these amount to an area of mm under nm with mW of leakage power consumption when estimated with CACTI . We also synthesize both the baseline IOMMU and NeuMMU on a Xilinx Virtex UltraScale+ VCU1525 dev board and compare its resource usage. The amount of additional resources NeuMMU consumes are less than of the available resources, incurring negligible overheads.
V Case Study: NUMA NPUs for Sparse Embedding Layers
As discussed in Section III-A, state-of-the-art recommendation systems using embeddings exhibit highly sparse, irregular memory accesses over a large embedding table (Figure 4). To overcome the memory capacity bottleneck, Facebook’s DLRM  for instance employ an accelerator-centric parallelization strategy where the embedding table is model-parallelized across multiple GPUs (in our case the NPUs). This however comes at the cost of an all-to-all communication among the GPUs (NPUs) to gather all the embeddings from remote memory regions (Figure 5). Because current MMU-less NPUs cannot address memory outside its local, physical memory, an intermediate solution that facilitates remote embedding gathers will be to have the CPU runtime manually copy the (remote) embedding vectors to a buffer allocated in CPU memory, which is then copied over to the (local) NPU memory. Since the embedding vectors are sparse with only several hundreds of bytes in size, current GPUs can instead utilize its MMU to directly access remote GPU memory for embedding gathers in a CC-NUMA fashion or migrate the missing pages into its local memory, obviating the need for a CPU-involved, intermediate data transfer and copy operation. With our proposed NeuMMU architecture in hand, the NPU can page-fault on missing pages mapped to a remote NPU’s physical memory, and either 1) directly fetch the embeddings using fine-grained NUMA accesses, or 2) migrate the missing page into its local memory (Section VI-A). Figure 15 shows the performance advantage of utilizing NUMA accesses for handling sparse embedding layers. As depicted, the baseline MMU-less NPU suffers from significant increase in latency because of the (redundant) manual data copies over CPU memory. NeuMMU is able to reduce the latency spent in gathering the embeddings as this process is undertaken directly over the (PCIe or NVLINK) system interconnect in a fine-grained NUMA fashion, achieving an average and latency reduction than baseline NPU without an MMU. These results highlight the merits of featuring address translations in NPUs, which we will be of utmost importance as DL workloads evolve into having high capacity demands with irregular memory access behaviors.
Vi-a NeuMMU with Large Pages
The computation and memory access characteristics of conventional dense CNNs/RNNs are highly regular with its tensor data (i.e., IA and W) sized at several hundreds of MBs. We observe that MB large pages can substantially decrease the performance overhead of baseline IOMMU address translations for the dense CNNs/RNNs (average , worst case ). Our NeuMMU architecture successfully removes such performance overheads as in the baseline KB pages. Given the high efficiency of IOMMU based address translation under large pages, one might presume that NPU MMUs should simply employ large pages exclusively without baseline small pages. However, for DL workloads that exhibit sparse data access patterns with large memory footprint (such as our studied sparse embedding layers), large pages can incur a much significant performance overhead compared to small pages. Large pages increase the data transfer size of each demand paged request (recall that a single embedding is only hundreds of bytes with low temporal/spatial locality, Figure 4) which, not only causes significant (redundant) communication traffic on the system interconnect and hurt performance, but also wastes NPU physical memory by causing memory bloats via internal fragmentation . Rather than using NUMA to gather sparse embeddings (as assumed in Section V), Figure 16 summarizes the performance of small and large pages when utilizing NeuMMU to page-fault and migrate the missing (sparse) data using demand-paging into the NPU physical memory. For small pages, NeuMMU performs well to recover the lost performance and improves performance from up to an average of oracle. Unfortunately, the performance overhead of large pages cannot be recovered with NeuMMU because of the (redundant) prefetching effects of large pages over the sparse access patterns of NCF and DLRM. These results highlight the importance of providing robust address translation service for both small and large pages for current and future DL workloads. Prior work by Ausavarungnirun et al.  that synergistically combines small and large pages concurrently can be a promising solution to address these issues. Nonetheless, our paper focuses on efficient address translation support so evaluating efficient page-fault handling and demand paging solutions that closes such performance for large pages is beyond our scope and we leave it as future work.
Vi-B NeuMMU with Alternative NPU Architectures
There have been numerous NPU designs proposed in prior literature so it is challenging to define a “generic” NPU architecture that represents all design points in such fast-evolving space. As such, our baseline NPU assumed Google’s systolic-array microarchitecture as it is to date the most successfully deployed NPU design. To study the applicability of NeuMMU on alternative NPU designs, we also developed a cycle-level performance model that follows several representative prior work based on the spatial architecture design paradigm [17, 16, 3, 74, 75]. Our spatial-array based NPU design is modeled similar to DaDianNao  or Eyeriss , which employs a two-dimensional grid of PEs, each of which contains a vector ALU that handles dot-product operations. These spatial NPU architectures also employ an SPM-centric on-chip memory hierarchy, which our NeuMMU design is founded upon, and our evaluation showed that our NeuMMU architecture is able to similarly close the performance gap of baseline IOMMUs, only incurring an average performance overhead. We omit the results due to space limitations.
The robustness of NeuMMU has been studied over several different architecture configurations as well as different application batch sizes. For NeuMMU design space exploration, we sweep the number of PRMB mergeable slots ( to ), parallel PTWs ( to ), TPreg entries, and total TLB entries ( to ). Across all the sensitivity studies, the performance achieved was never less than with an average of the oracular MMU. We also studied our workloads with large batch size of , , and . As mentioned in Section II-C, large batches lead to intractable amount of simulation time, so we limit our evaluation to the common layer configuration of each DNN. Similar to small batches, the baseline IOMMU achieves an average of oracle. NeuMMU successfully closes this performance gap for large batches, reaching of oracular MMU. These results demonstrate the robustness of our throughput-centric NeuMMU for SPM-centric NPUs.
Vii Related Work
Our work builds on top of prior studies on hardware and software mechanisms to accelerate MMUs. Here we first summarize closely related work on CPU/GPU MMUs, followed by a summary on prior literature designing ML accelerator architectures.
Address Translation for CPUs. As application memory footprint increases, commercial CPUs have started including multi-level TLB hierarchies [90, 40] with per-core MMU caches to accelerate the page-table walking process. Barr et al.  explored the design space of MMU caches, showing that the most effective ones are unified translation caches. Prefetching translations [42, 88, 47], shared TLBs , and MMU caches  has also been studied in the literature to alleviate translation overheads in various CPU context.
Address Translation for Accelerators. There have been some pioneering work by Power et al  and Pichai et al.  that explored the benefits of GPUs in utilizing IOMMU for VA-to-PA address translations. Both of these studies proposed a per-core, post-coalescer TLB, dedicated logic to walk the page-tables, multi-level shared TLB hierarchy, and a page translation cache, similar to our proposal. Hao et al.  studied the utilization of IOMMUs for accelerator-centric systems, similarly proposing shared TLBs, parallel PTWs, and MMU caches. Our work differentiates itself from all these prior studies as we specifically target NPUs with a carefully designed, throughput-centric MMU that is tailored for DNNs. As discussed in Section IV, we quantitatively demonstrated that prior translation locality-centric MMUs are not able to sufficiently handle the translation bursts of DNNs. Our study provides the key insight that NPU MMUs should be designed for enhancing translation throughput, rather than translation locality, leading to our novel PRMB and TPreg microarchitecture on top of a massively parallel page-table walker design.
Accelerator architectures for ML. Aside from these closely related prior work on MMUs, there has been a large body of prior work exploring the design of space of ML accelerator architectures [14, 17, 22, 59, 23, 79, 18, 16, 60, 89, 48, 58, 61, 91, 65, 25, 64, 80, 52, 53, 51, 19, 54] with recent interest on sparsity-optimized solutions for further energy-efficiency improvements [31, 3, 97, 46, 2, 94, 67, 95, 96, 74, 20, 81]. Our work on NPU MMUs is orthogonal to these prior art as our primary focus is on adding new features to these ML accelerator designs.
As the computation demands for DL workloads increase, we expect NPUs to evolve into first-class citizens in heterogeneous computing platforms. We make a case for providing address translation capabilities for NPUs, an important first step in evolving these devices as primary compute engines. Through a data-driven application characterization study, we root-cause the challenges in prior GPU-centric MMU solutions and propose three novel architecture designs tailored for the application behavior of DNNs. Compared to an oracular MMU design, our proposal achieves only an average performance overhead while allowing CPUs and NPUs to share a unified global address space.
-  (2015) Page Placement Strategies for GPUs within Heterogeneous Memory Systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), Cited by: §I, §I, §III-B, Fig. 15.
-  (2017) Bit-pragmatic Deep Neural Network Computing. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §VII.
-  (2016) . In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §II-C, §VI-B, §VII.
-  (2018) AMD’s I/O Virtualization Technology (IOMMU) Specification. Cited by: §II-B.
-  (2018) Developer Guides, Manuals and ISA Documents – AMD. Cited by: §IV-C.
-  (2017) Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §VI-A.
-  (2018) MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), Cited by: §I.
-  (2017) DeepBench: Benchmarking Deep Learning Operations on Different Hardware. Note: https://github.com/baidu-research/DeepBench Cited by: §II-C.
-  (2010) Translation Caching: Skip, Don’t Walk (the Page Table). In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §I, §IV-C, §IV-C, §VII.
-  (2011) Shared Last-level TLBs for Chip Multiprocessors. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), Cited by: §VII.
-  (2013) Large-reach Memory Management Unit Caches. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §VII.
-  (2016) Listen, Atten and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §III-A.
-  (2012) USIMM: the Utah SImulated Memory Module. Cited by: §II-C.
DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), Cited by: §VII.
-  (2016) Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §II-C.
-  (2016) Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In Proceedings of the International Solid State Circuits Conference (ISSCC), Cited by: §II-C, §III-C, §VI-B, §VII.
-  (2014) DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §II-C, §VI-B, §VII.
-  (2016) A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2020-02) PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), Cited by: §VII.
-  (2018) Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How. Note: https://arxiv.org/abs/1803.03688 Cited by: §VII.
-  (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In arxiv.org, Cited by: Fig. 1.
-  (2015) ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2015) Neuromorphic Accelerators: A Comparison Between Neuroscience and Machine-Learning Approaches. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §VII.
-  (2019) Accelerating Facebook’s infrastructure with Application-Specific Hardware. Note: https://code.fb.com/data-center-engineering/accelerating-infrastructure/ Cited by: §I.
-  (2017) TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), Cited by: §VII.
Cloud TPUs: ML accelerators for TensorFlow. Cited by: §II-B.
-  (2018) Google Cloud TPU Beta Release. Note: https://cloud.google.com/tpu/docs/release-notes Cited by: §II-C.
-  (2014) Neural Turing Machines. In arxiv.org, Cited by: Fig. 1.
-  (2019) The Architectural Implications of Facebook’s DNN-based Personalized Recommendation. In arxiv.org, Cited by: item 1.
-  (2019) Gaudi Training Platform White Paper. Cited by: §I, §II-C.
-  (2016) EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2017) Supporting Address Translation for Accelerator-Centric Architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), Cited by: §II-C, §VII.
-  (2016) Deep Residual Learning for Image Recognition. In , Cited by: §II-C.
-  (2017) Neural Collaborative Filtering. In Proceedings of the International Conference on World Wide Web (WWW), Cited by: Fig. 1, Fig. 4, §II-C, §III-A, Fig. 16.
-  (2019) Beyond Human-Level Accuracy: Computational Challenges in Deep Learning. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPOPP), Cited by: §I, Fig. 5, item 2, §III-A.
-  (2013) Energy Table for 45nm Process. Cited by: Fig. 12, §IV-C.
-  (2016) CACTI: An Integrated Cache and Memory Access Time, Cycle Time, Area, Leakage, and Dynamic Power Model. Note: http://www.hpl.hp.com/research/cacti/ Cited by: Fig. 12, §IV-E.
-  (2018) Heterogeneous System Architecture. Cited by: §I.
-  (2018) Intel Nervana Hardware: Neural Network Processor (Lake Crest). Cited by: §II-C.
-  (2013) 4th Generation Intel Core Processor, Codenamed Haswell. Cited by: §VII.
-  (2018) Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide Part 1.. Cited by: §IV-C.
-  (1998) A Look at Several Memory Management Units, TLB-Refill Mechanisms, and Page Table Organizations. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), Cited by: §VII.
-  (2018) High Bandwidth Memory (HBM2) DRAM. Cited by: §II-C.
-  (2017) In-datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: Fig. 2, §II-A, §II-C.
-  (2016) Exploring the Limits of Language Modeling. In arxiv.org, Cited by: §III-A.
-  (2016) Stripes: Bit-serial Deep Neural Network Computing. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §II-C, §VII.
-  (2002) Going the Distance for TLB Prefetching: An Application-driven Study. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2016) Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2012) ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), Cited by: §II-C.
-  (2014) One Weird Trick For Parallelizing Convolutional Neural Networks. Note: https://arxiv.org/abs/1404.5997 Cited by: item 2.
-  (2019) TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §VII.
-  (2018) A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks. In IEEE Computer Architecture Letters, Cited by: §VII.
-  (2018) Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §VII.
-  (2019) A Disaggregated Memory System for Deep Learning. In IEEE Micro, Cited by: §VII.
-  (2016) Coordinated and Efficient Huge Page Management with Ingens. In OSDI, Cited by: §VI-A.
-  (2013) GPUWattch : Enabling Energy Optimizations in GPGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §III-C.
-  (2019) A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), Cited by: §I.
-  (2016) RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2015) PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), Cited by: §VII.
-  (2016) Cambricon: An Instruction Set Architecture for Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2016) TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), Cited by: §VII.
-  (2017) Beyond the Socket: NUMA-Aware GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: Fig. 15.
-  (2019) MLPerf: A Broad ML Benchmark Suite for Measuring Performance of ML Software Frameworks, ML Hardware Accelerators, and ML Cloud Platforms. Note: https://github.com/mlperf/inference/tree/master/cloud Cited by: Fig. 1, §II-C.
-  (2018) A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays (FPGA), Cited by: §VII.
-  (2017) High Performance Binary Neural Networks on the Xeon+FPGA Platform. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), Cited by: §VII.
-  (2019) Deep Learning Recommendation Model for Personalization and Recommendation Systems. In arxiv.org, Cited by: Fig. 1, §II-C, Fig. 5, item 1, item 2, §V, Fig. 16.
-  (2017) Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays (FPGA), Cited by: §VII.
-  (2013) Unified Memory in CUDA 6. Cited by: §I.
-  (2016) NVIDIA CUDA Programming Guide. Cited by: §II-A.
-  (2016) NVLINK High-Speed Interconnect. Cited by: §III-B, Fig. 15.
-  (2017) The NVIDIA DGX-2 Deep Learning System. Cited by: §I, §I.
-  (2018) NVIDIA Tesla V100. Cited by: §II-A.
-  (2018) NVSwitch: Leveraging NVLink to Maximum Effect. Cited by: §I.
-  (2017) SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §II-C, §VI-B, §VII.
Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §II-C, §VI-B.
-  (2018) Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications. In arxiv.org, Cited by: §I, §III-A.
-  (2014) Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), Cited by: §I, §I, §I, §II-C, §III-C, §III-C, §IV-B, §VII.
-  (2014) Supporting x86-64 Address Translation for 100s of GPU Lanes. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), Cited by: §I, §I, §I, §II-C, §III-C, §III-C, §IV-B, §VII.
-  (2016) Minerva: Enabling Low-Power, High-Accuracy Deep Neural Network Accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2016) vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §I, §II-B, §VII.
-  (2018) Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), Cited by: §VII.
-  (2011) DRAMSim2: A Cycle Accurate Memory System Simulator. Cited by: §II-C.
-  (2015-05) Neural Network Processor. Note: PatentUS 9747546B2 Cited by: §II-C.
-  (2015-05) Computing Convolutions Using a Neural Network Processor. Note: PatentUS 9697463B2 Cited by: §II-C.
-  (2015-05) Rotating Data for Neural Network Computations. Note: PatentUS 9747548B2 Cited by: §II-C.
-  (2015-05) Prefetching Weights for Use in a Neural Network Processor. Note: PatentUS 9805304B2 Cited by: §II-C.
-  (2017) Unified Memory on Pascal and Volta. Cited by: §I.
-  (2000) Recency-based TLB-preloading. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2016) ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In Proceedings of the International Symposium on Computer Architecture (ISCA), Cited by: §VII.
-  (2012) Sparc T4: A Dynamically Threaded Server-on-a-chip. In IEEE Micro, Cited by: §VII.
-  (2016) From High-level Deep Neural Models to FPGAs. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §VII.
-  (2015) Going Deeper with Convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-C.
-  (2017) Attention Is All You Need. In arxiv.org, Cited by: Fig. 1.
-  (2017) Accelerating Deep Convolutional Networks using Low-precision and Sparsity. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §VII.
-  (2017-02) A 28nm SoC with a 1.2 GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with 0.1 Timing Error Rate Tolerance for IoT Applications. In Proceedings of the International Solid State Circuits Conference (ISSCC), Cited by: §VII.
-  (2017-08) DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses. In Hot Chips: A Symposium on High Performance Chips, Cited by: §VII.
-  (2016-10) Cambricon-X: An Accelerator for Sparse Neural Networks. In Proceedings of the International Symposium on Microarchitecture (MICRO), Cited by: §VII.
-  (2016-03) Toward High-Performance Paged-Memory for GPUs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), Cited by: §I, Fig. 15.