Bulk data movement, the movement of thousands or millions of bytes between two memory locations, is a common operation performed by an increasing number of real-world applications (e.g., [37, 57, 74, 82, 85, 88, 94, 99, 110, 89, 58, 6]). Therefore, it has been the target of several architectural optimizations (e.g., [4, 35, 88, 103, 110, 58, 86, 70, 40, 6]). In fact, bulk data movement is important enough that modern commercial processors are adding specialized support to improve its performance, such as the ERMSB instruction recently added to the x86 ISA .
In today’s systems, to perform a bulk data movement between two locations in memory, the data needs to go through the processor even though both the source and destination are within memory. To perform the movement, the data is first read out one cache line at a time from the source location in memory into the processor caches, over a pin-limited off-chip channel (typically 64 bits wide). Then, the data is written back to memory, again one cache line at a time over the pin-limited channel, into the destination location. By going through the processor, this data movement incurs a significant penalty in terms of latency and energy consumption.
To address the inefficiencies of traversing the pin-limited channel, a number of mechanisms have been proposed to accelerate bulk data movement (e.g., [35, 63, 88, 110]). The state-of-the-art mechanism, RowClone , performs data movement completely within a DRAM chip, avoiding costly data transfers over the pin-limited memory channel. However, its effectiveness is limited because RowClone can enable fast data movement only when the source and destination are within the same DRAM subarray. A DRAM chip is divided into multiple banks (typically 8), each of which is further split into many subarrays (16 to 64) , shown in Figure (a)a, to ensure reasonable read and write latencies at high density [8, 32, 33, 45, 101].111We refer the reader to our prior works [9, 45, 57, 56, 44, 43, 10, 22, 11, 54, 58, 55, 61, 60, 75, 8, 88, 89, 21, 41, 42, 39] for a detailed background on DRAM. Each subarray is a two-dimensional array with hundreds of rows of DRAM cells, and contains only a few megabytes of data (e.g., 4MB in a rank of eight 1Gb DDR3 DRAM chips with 32 subarrays per bank). While two DRAM rows in the same subarray are connected via a wide (e.g., 8K bits) bitline interface, rows in different subarrays are connected via only a narrow 64-bit data bus within the DRAM chip (Figure (a)a). Therefore, even for previously-proposed in-DRAM data movement mechanisms such as RowClone , inter-subarray bulk data movement incurs long latency and high memory energy consumption even though data does not move out of the DRAM chip.
While it is clear that fast inter-subarray data movement can have several applications that improve system performance and memory energy efficiency [37, 74, 82, 85, 88, 110, 89, 55, 6], there is currently no mechanism that performs such data movement quickly and efficiently. This is because no wide datapath exists today between subarrays within the same bank (i.e., the connectivity of subarrays is low in modern DRAM). Our goal is to design a low-cost DRAM substrate that enables fast and energy-efficient data movement across subarrays.
2 Low-Cost Inter-Linked Subarrays (LISA)
We make two key observations that allow us to improve the connectivity of subarrays within each bank in modern DRAM. First, accessing data in DRAM causes the transfer of an entire row of DRAM cells to a buffer (i.e., the row buffer, where the row data temporarily resides while it is read or written) via the subarray’s bitlines. Each bitline connects a column of cells to the row buffer, interconnecting every row within the same subarray (Figure (a)a). Therefore, the bitlines essentially serve as a very wide bus that transfers a row’s worth of data (e.g., 8K bits in a chip) at once. Second, subarrays within the same bank are placed in close proximity to each other. Thus, the bitlines of a subarray are very close to (but are not currently connected to) the bitlines of neighboring subarrays (as shown in Figure (a)a).
Key Idea. Based on these two observations, we introduce a new DRAM substrate, called Low-cost Inter-linked SubArrays (LISA). LISA enables low-latency, high-bandwidth inter-subarray connectivity by linking neighboring subarrays’ bitlines together with isolation transistors, as illustrated in Figure (b)b. We use the new inter-subarray connection in LISA to develop a new DRAM operation, row buffer movement (RBM), which moves data that is latched in an activated row buffer in one subarray into an inactive row buffer in another subarray, without having to send data through the narrow internal data bus in DRAM. RBM exploits the fact that the activated row buffer has enough drive strength to induce charge perturbation within the idle (i.e., precharged) bitlines of neighboring subarrays, allowing the destination row buffer to sense and latch this data when the isolation transistors are enabled. We describe the detailed operation of RBM in our HPCA 2016 paper .
By using a rigorous DRAM circuit model that conforms to the JEDEC standards  and ITRS specifications [30, 31], we show that RBM performs row buffer movement at 26x the bandwidth of a modern 64-bit DDR4-2400 memory channel (500 GB/s vs. 19.2 GB/s), even after we conservatively add a large (60%) timing margin to account for process and temperature variation.
Die Area Overhead. To evaluate the area overhead of adding isolation transistors, we use area values from prior work, which adds isolation transistors to disconnect bitlines from sense amplifiers . That work shows that adding an isolation transistor to every bitline incurs a total of 0.8% die area overhead in a 28nm DRAM process technology. Similar to prior work that adds isolation transistors to DRAM [57, 73], our LISA substrate also requires additional control logic outside the DRAM banks to control the isolation transistors, which incurs a small amount of area and is non-intrusive to the cell arrays.
3 Applications of LISA
We exploit LISA’s fast inter-subarray movement capability to enable many applications that can improve system performance and energy efficiency. In our HPCA 2016 paper , we implement and evaluate three applications of LISA, which significantly improve system performance in different ways.
3.1 Rapid Inter-Subarray Bulk Data Copying (LISA-RISC)
Due to the narrow memory channel width, bulk copy operations used by applications and operating systems are performance limiters in today’s systems [35, 37, 88, 110, 55]. These operations are commonly performed due to the memcpy and memmov. Recent work reported that these two operations consume 4-5% of all of Google’s data center cycles , making them an important target for lightweight hardware acceleration.
Our goal is to design a new mechanism that enables low-latency and energy-efficient memory copy between rows in different subarrays within the same bank. To this end, we propose a new in-DRAM copy mechanism that uses LISA to exploit the high-bandwidth links between subarrays. The key idea, step by step, is to: (1) activate a source row in a subarray; (2) rapidly transfer the data in the activated source row buffers to the destination subarray’s row buffers, through LISA’s RBM operation; and (3) activate the destination row, which enables the contents of the destination row buffers to be latched into the destination row. We call this inter-subarray row-to-row copy mechanism LISA-Rapid Inter-Subarray Copy (LISA-RISC).
3.1.1 DRAM Latency and Energy Consumption
Figure 4 shows the DRAM latency and DRAM energy consumption of memcpy (i.e, the baseline system), RowClone  (state-of-the-art work), and LISA-RISC for copying a row of data (8KB). The exact latency and energy numbers are listed in Table 1. For LISA-RISC, we define a hop as the number of subarrays that LISA-RISC needs to copy data across to move the data from the source subarray to the destination subarray. For example, if the source and destination subarrays are adjacent to each other, the number of hops is 1. The DRAM chips we evaluate have 16 subarrays per bank, so the maximum number of hops is 15.
|Copy Commands (8KB)||Latency (ns)||Energy (J)|
|memcpy (via mem. channel)||1366.25||6.2|
|RC-InterSA / Bank / IntraSA||1363.75 / 701.25 / 83.75||4.33 / 2.08 / 0.06|
|LISA-RISC (1 / 7 / 15 hops)||148.5 / 196.5 / 260.5||0.09 / 0.12 / 0.17|
We make two observations from these numbers. First, although inter-subarray RowClone (RC-InterSA) incurs similar latencies as memcpy, it consumes 1.43x less energy, as it does not transfer data over the channel and DRAM I/O for each copy operation. However, as we discuss in Section 4.1 of our HPCA 2016 paper , RC-InterSA incurs a higher system performance penalty because it is a blocking long-latency memory command. Second, copying between subarrays using LISA reduces the copy latency by 9x and copy energy by 48x compared to RowClone, even though the total latency of LISA-RISC grows linearly with the hop count. An additional benefit of using LISA-RISC is that its inter-subarray copy operations are performed completely inside a bank. As the internal DRAM data bus is untouched, other banks can concurrently serve memory requests, exploiting bank-level parallelism.
We briefly summarize the system performance improvement due to LISA-RISC on a quad-core system. We evaluate our system using Ramulator [41, 83], an open-source cycle-accurate DRAM simulator, driven by traces generated from Pin . Our workload evaluation results show that LISA-RISC outperforms RowClone and memcpy: its average performance improvement and energy reduction over the best performing inter-subarray copy mechanism (i.e., memcpy) are 66.2% and 55.4%, respectively, on a quad-core system, across 50 workloads that perform bulk copies. We refer the reader to Section 9 of our HPCA 2016 paper  for detailed evaluation and analysis.
3.2 In-DRAM Caching Using
Heterogeneous Subarrays (LISA-VILLA)
Our second application aims to reduce the DRAM access latency for frequently-accessed (hot) data. We propose to introduce heterogeneity within a bank by designing heterogeneous-latency subarrays. We call this heterogeneous DRAM design VarIabLe LAtency DRAM (VILLA-DRAM). To design a low-cost fast subarray, we take an approach similar to prior work, attaching fewer cells to each bitline to reduce the parasitic capacitance and resistance. This reduces the latency of the three fundamental DRAM operations–activation, precharge, and restoration–when accessing data in the fast subarrays [57, 67, 94]. Activation “opens” a row of DRAM cells to access stored data. Precharge “closes” an activated row. Restoration restores the charge level of each DRAM cell in a row to prevent data loss. Together, these three operations predominantly define the latency of a memory request [9, 45, 57, 56, 44, 43, 10, 22, 11, 54, 58, 55, 61, 60, 75, 8, 88, 89, 21, 41, 42, 39]. In this work, we focus on managing the fast subarrays in hardware, as doing so offers better adaptivity to dynamic changes in the hot data set.
In order to take advantage of VILLA-DRAM, we rely on LISA-RISC to rapidly copy rows across subarrays, which significantly reduces the caching latency. We call this synergistic design, which builds VILLA-DRAM using LISA, LISA-VILLA. Nonetheless, the cost of transferring data to a fast subarray is still non-negligible, especially if the fast subarray is far from the subarray where the data to be cached resides. Therefore, an intelligent cost-aware mechanism is required to make astute decisions on which data to cache and when.
3.2.1 Caching Policy for LISA-VILLA
We design a simple epoch-based caching policy to evaluate the benefits of caching a row in LISA-VILLA. Every epoch, we track the number of accesses to rows by using a set of 1024 saturating counters for each bank.222The hardware cost of these counters is low, requiring only 6KB of storage in the memory controller (see Section 7.1 of our HPCA 2016 paper ). The counter values are halved every epoch to prevent staleness. At the end of an epoch, we mark the 16 most frequently-accessed rows as hot, and cache them when they are accessed the next time. For our cache replacement policy, we use the benefit-based caching policy proposed by Lee et al. . Specifically, it uses a benefit counter for each row cached in the fast subarray: whenever a cached row is accessed, its counter is incremented. The row with the least benefit is replaced when a new row needs to be inserted. Note that a large body of work proposes various caching policies (e.g., [20, 23, 26, 34, 38, 66, 79, 87, 104, 91, 78, 100, 59, 106]), each of which can potentially be used with LISA-VILLA.
shows the system performance improvement of LISA-VILLA over a baseline without any fast subarrays in a four-core system. It also shows the hit rate in VILLA-DRAM, i.e., the fraction of accesses that hit in the fast subarrays. We make two main observations. First, by exploiting LISA-RISC to quickly cache data in VILLA-DRAM, LISA-VILLA improves system performance for a wide variety of workloads — by up to 16.1%, with a geometric mean of 5.1%. This is mainly due to reduced DRAM latency of accesses that hit in the fast subarrays. The performance improvement heavily correlates with the VILLA cache hit rate. Second, the VILLA-DRAM design, which consists of heterogeneous subarrays, is not practical without LISA. Figure5 shows that using RC-InterSA (i.e., RowClone copying data across subarrays) to move data into the cache reduces performance by 52.3% due to slow data movement, which overshadows the benefits of caching. The results indicate that LISA is an important substrate to enable not only fast bulk data copy, but also a fast in-DRAM caching scheme.
3.3 Fast Precharge Using Linked Precharge Units (LISA-LIP)
Our third application aims to accelerate the process of precharge. The precharge time for a subarray is determined by the drive strength of the precharge unit (i.e., a circuitry in a subarray’s row buffer for precharging the connected subarray). We observe that in modern DRAM, while a subarray is being precharged, the precharge units (PUs) of other subarrays remain idle.
We propose to exploit these idle PUs to accelerate a precharge operation by connecting them to the subarray that is being precharged. Our mechanism, LISA-LInked Precharge (LISA-LIP), precharges a subarray using two sets of PUs: one from the row buffer that is being precharged, and a second set from a neighboring subarray’s row buffer (which is already in the precharged state), by enabling the links between the two subarrays.
To evaluate the accelerated precharge process, we use the same DRAM circuit model described in Section 2 and simulate the linked precharge operation in SPICE. Our SPICE simulation reports that LISA-LIP significantly reduces the precharge latency by 2.6x compared to the baseline (5ns vs. 13ns). Our system evaluation shows that LISA-LIP improves performance by 10.3% on average, across 50 four-core workloads. We refer the reader to Section 6 of our HPCA 2016 paper  for a detailed analysis of LISA-LIP.
3.4 Evaluation: Putting Everything Together
As all of the three proposed applications are complementary to each other, we evaluate the effect of putting them together on a four-core system. Figure 6 shows the system performance improvement of adding LISA-VILLA to LISA-RISC, as well as combining all three optimizations, compared to our baseline using memcpy and standard DDR3-1600 memory across 50 workloads. We refer the reader to our full paper  for the detailed configuration and workloads. We draw several key conclusions. First, the performance benefits from each scheme are additive. On average, adding LISA-VILLA improves performance by 16.5% over LISA-RISC alone, and adding LISA-LIP further provides an 8.8% gain over LISA-(RISC+VILLA). Second, although LISA-RISC alone provides a majority of the performance improvement over the baseline (59.6% on average), the use of both LISA-VILLA and LISA-LIP further improves performance, resulting in an average performance gain of 94.8% and memory energy reduction (not plotted) of 49.0%. Taken together, these results indicate that LISA is an effective substrate that enables a wide range of high-performance and energy-efficient applications in the DRAM system.
We conclude that LISA is an effective substrate that can greatly improve system performance and reduce system energy consumption by synergistically enabling multiple different applications. Our HPCA 2016 paper  provides many more experimental results and analyses confirming this finding.
4 Related Work
To our knowledge, this is the first work to propose a DRAM substrate that supports fast data movement between subarrays in the same bank, which enables a wide variety of applications for DRAM systems. We now discuss prior works that focus on each of the optimizations that LISA enables.
4.1 Bulk Data Transfer Mechanisms
Prior works [16, 17, 36, 7, 108] propose to add scratchpad memories to reduce CPU pressure during bulk data transfers, which can also enable sophisticated data movement (e.g., scatter-gather ), but they still require data to first be moved on-chip. A patent proposes a DRAM design that can copy a page across memory blocks , but lacks concrete analysis and evaluation of the underlying copy operations. Intel I/O Acceleration Technology  allows for memory-to-memory DMA transfers across a network, but cannot transfer data within main memory.
Zhao et al.  propose to add a bulk data movement engine inside the memory controller to speed up bulk-copy operations. Jiang et al.  design a different copy engine, placed within the cache controller, to alleviate pipeline and cache stalls that occur when these transfers take place. However, these works do not directly address the problem of data movement across the narrow memory channel.
A concurrent work by Lu et al.  proposes a heterogeneous DRAM design similar to VILLA-DRAM, called DAS-DRAM, but with a very different data movement mechanism from LISA. It introduces a row of migration cells into each subarray to move rows across subarrays. Unfortunately, the latency of DAS-DRAM is not scalable with movement distance, because it requires writing the migrating row into each intermediate subarray’s migration cells before the row reaches its destination, which prolongs data transfer latency. In contrast, LISA provides a direct path to transfer data between row buffers between adjacent subarrays without requiring intermediate data writes into any subarray.
4.2 Cached DRAM
Several prior works (e.g., [20, 23, 26, 38, 109]) propose to add a small SRAM cache to a DRAM chip to lower the access latency for data that is kept in the SRAM cache (e.g., frequently or recently used data). There are two main disadvantages of these works. First, adding an SRAM cache into a DRAM chip is very intrusive: it incurs a high area overhead (38.8% for 64KB in a 2Gb DRAM chip) and design complexity [57, 45]. Second, transferring data from DRAM to SRAM uses a narrow global data bus, internal to the DRAM chip, which is typically 64-bit wide. Thus, installing data into the DRAM cache incurs high latency. Compared to these works, our LISA-VILLA design enables low latency without significant area overhead or complexity.
4.3 Heterogeneous-Latency DRAM
Prior works propose DRAM architectures that provide heterogeneous latency either spatially (dependent on where in the memory an access targets) or temporally (dependent on when an access occurs).
Spatial Heterogeneity. Prior work introduces spatial heterogeneity into DRAM, where one region has a fast access latency but fewer DRAM rows, while the other has a slower access latency but many more rows [57, 94]. Recent works show that latency heterogeneity inherent in DRAM chips due to process or design-induced variation can also naturally enable such heterogeneous-latency substrates [9, 54]. The fast region in DRAM can be utilized as a caching area, for the frequently or recently accessed data. We briefly describe two state-of-the-art works that offer different heterogeneous-latency DRAM designs.
CHARM  introduces heterogeneity within a rank by designing a few fast banks with (1) shorter bitlines for faster data sensing, and (2) closer placement to the chip I/O for faster data transfers. To exploit these low-latency banks, CHARM uses an OS-managed mechanism to statically map hot data to these banks, based on profiled information from the compiler or programmers. Unfortunately, this approach cannot adapt to program phase changes, limiting its performance gains. If it were to adopt dynamic hot data management, CHARM would incur high migration costs over the narrow 64-bit bus that internally connects the fast and slow banks.
TL-DRAM  provides heterogeneity within a subarray by dividing it into fast (near) and slow (far) segments that have short and long bitlines, respectively, using isolation transistors. The fast segment can be managed as an OS-transparent hardware cache. The main disadvantage is that it needs to cache each hot row in two near segments as each subarray uses two row buffers on opposite ends to sense data in the open-bitline architecture (as discussed in our HPCA 2016 paper ). This prevents TL-DRAM from using the full near segment capacity. As we can see, neither CHARM nor TL-DRAM strike a good design balance for heterogeneous-latency DRAM. Our proposal, LISA-VILLA, is a new heterogeneous DRAM design that offers fast data movement with a low-cost and easy-to-implement design.
Temporal Heterogeneity. Prior work observes that DRAM latency can vary depending on when an access occurs. The key observation is that a recently-accessed or refreshed row has nearly full electrical charge in the cells, and thus the following access to the same row can be performed faster [22, 21, 92]. We briefly describe two state-of-the-art works that focus on providing heterogeneous latency temporally.
ChargeCache  enables faster access to recently-accessed rows in DRAM by tracking the addresses of recently-accessed rows. NUAT  enables accesses to recently-refreshed rows at low latency because these rows are already highly-charged. In contrast to ChargeCache and NUAT, LISA does not require data to be recently-accessed/refreshed in order to reduce DRAM latency. Adaptive-Latency DRAM (AL-DRAM)  adapts the DRAM latency of each DRAM module to temperature, observing that each module can be operated faster at lower temperatures. LISA is orthogonal to AL-DRAM. The ideas of LISA can be employed in conjunction with works that exploit the temporal heterogeneity of DRAM latency.
4.4 Other Latency Reduction Mechanisms
Many prior works propose memory scheduling techniques, which generally reduce latency to access DRAM [43, 44, 96, 97, 72, 71, 102, 3, 68, 13, 98, 51, 53, 29, 69, 15, 52]. Other works propose mechanisms to perform in-memory computation to reduce data movement and access latency [25, 89, 5, 24, 40, 6, 88, 46, 76, 18, 2, 1, 107, 77, 62, 95]. LISA is complementary to these works, and it can work synergistically with in-memory computation mechanisms by enabling fast aggregation of data.
Our HPCA 2016 paper  proposes a new DRAM substrate that significantly improves the performance and efficiency of bulk data movement in modern systems. In this section, we briefly discuss the expected future impact of our work, and discuss several research directions that our work motivates.
5.1 Potential Industry Impact
We believe that our LISA substrate can have a large impact on mobile systems as well as data centers that consume a significant amount of cycle time performing bulk data movement. A recent study  by Google reports that memcpy() and memmove() library functions alone represent 4-5% of their data center cycles even though Google has a significant workload diversity running within their data centers. Another recent study shows that 62.7% of system energy is spent on data movement on consumer devices (e.g., smartphones, wearable devices, web-based computers such as Chromebooks) . In this work, we demonstrate that one potential application of using the LISA substrate is to accelerate memcpy() and memmove(), as discussed in Section 3.1. Our detailed DRAM circuit model reports that LISA reduces the latency and DRAM energy of these functions by 9x and 69x compared to today’s systems, respectively. Hence, we expect LISA can improve the efficiency and performance of both mobile and data center systems.
5.2 Future Research Directions
This work opens up several avenues of future research directions. In this section, we describe several directions that can enable researchers to tackle other problems related to memory systems based on the LISA substrate.
Reducing Subarray Conflicts via Remapping. When two memory requests access two different rows in the same bank, they have to be served serially, even if they are to different subarrays. To mitigate such bank conflicts, Kim et al.  propose subarray-level parallelism (SALP), which enables multiple subarrays to remain activated at the same time. However, if two accesses are to the same subarray, they still have to be served serially. This problem is exacerbated when frequently-accessed rows reside in the same subarray. To help alleviate such subarray conflicts, LISA can enable a simple mechanism that efficiently remaps or moves the conflicting rows to different subarrays by exploiting fast RBM operations.
Enabling LISA to Perform 1-to-N Memory Copy or Move Operations. A typical memcpy or memmove call only allows the data to be copied from one source location to one destination location. To copy or move data from one source location to multiple different destinations, repeated calls are required. The problem is that such repeated calls incur long latency and high bandwidth consumption. One potential application that can be enabled by LISA is performing memcpy or memmove from one source location to multiple destinations completely in DRAM without requiring multiple calls of these operations.
By using LISA, we observe that moving data from the source subarray to the destination subarray latches the source row’s data in all the intermediate subarrays’ row buffer. As a result, activating these intermediate subarrays would copy their row buffers’ data into the specified row within these subarrays. By extending LISA to perform multi-point (1-to-N) copy or move operations, we can significantly increase system performance of several commonly-used system operations. For example, forking multiple child processes can utilize 1-to-N copy operations to efficiently copy memory pages from the parent’s address space to all the children. As another example, LISA can extend the range of in-DRAM bulk bitwise operations [89, 85]. Thus, LISA can efficiently enable architectural support to a new, useful system and programming primitive: 1-to-N bulk memory copy/movement.
In-Memory Computation with LISA. One important requirement of efficient in-memory computation is being able to move data from its stored location to the computation units with very low latency and energy. We believe using the LISA substrate can enable a new in-memory computation framework. The idea is to add a small computation unit inside each or a subset of banks, and connect these computation units to the neighboring subarrays which store the data. Doing so allows the system to utilize LISA to move bulk data from the subarrays to the computation units with low latency and low area overhead.
Extending LISA to Non-Volatile Memory. In this work, we only focus on the DRAM technology. A class of emerging memory technology is non-volatile memory (NVM), which has the capability of retaining data without power supply. We believe that the LISA substrate can be extended to NVM (e.g., PCM [48, 50, 81, 105, 49, 80, 104] and STT-MRAM [47, 19, 12]) since the memory organization of NVM mostly resembles that of DRAM. A potential application of LISA in NVM is an efficient file copy operation that does not incur costly I/O data transfer. We believe LISA can provide further benefits when main memory becomes persistent .
We present a new DRAM substrate, low-cost inter-linked subarrays (LISA), that expedites bulk data movement across subarrays in DRAM. LISA achieves this by creating a new high-bandwidth datapath at low cost between subarrays, via the insertion of a small number of isolation transistors. We describe and evaluate three applications that are enabled by LISA. First, LISA significantly reduces the latency and memory energy consumption of bulk copy operations between subarrays over state-of-the-art mechanisms . Second, LISA enables an effective in-DRAM caching scheme on a new heterogeneous DRAM organization, which uses fast subarrays for caching hot data in every bank. Third, we reduce precharge latency by connecting two precharge units of adjacent subarrays together using LISA. We experimentally show that the three applications of LISA greatly improve system performance and memory energy efficiency when used individually or together, across a variety of workloads and system configurations.
We conclude that LISA is an effective substrate that enables several effective applications. We believe that this substrate, which enables low-cost interconnections between DRAM subarrays, can pave the way for other applications that can further improve system performance and energy efficiency through fast data movement in DRAM. We greatly encourage future work to 1) investigate new applications and benefits of LISA, and 2) develop new low-cost interconnection substrates within a DRAM chip to improve internal connectivity and data transfer ability.
We thank the anonymous reviewers and SAFARI group members for their helpful feedback. We acknowledge the support of Google, Intel, NVIDIA, Samsung, and VMware. This research was supported in part by the ISTC-CC, SRC, CFAR, and NSF (grants 1212962, 1319587, and 1320531). Kevin Chang was supported in part by the SRCEA/Intel Fellowship.
=0mu plus 1mu bstctl:etal, bstctl:nodash, bstctl:simpurl
-  J. Ahn et al., “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,” in ISCA, 2015.
-  J. Ahn et al., “PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture,” in ISCA, 2015.
-  R. Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” in ISCA, 2012.
-  S. Blagodurov et al., “A Case for NUMA-Aware Contention Management on Multicore Systems,” in USENIX ATC, 2011.
-  A. Boroumand et al., “LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory,” CAL, 2016.
-  A. Boroumand et al., “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks,” in ASPLOS, 2018.
-  J. Carter et al., “Impulse: Building a Smarter Memory Controller,” in HPCA, 1999.
-  K. K. Chang et al., “Improving DRAM Performance by Parallelizing Refreshes with Accesses,” in HPCA, 2014.
-  K. K. Chang et al., “Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization,” in SIGMETRICS, 2016.
-  K. K. Chang et al., “Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM,” in HPCA, 2016.
-  K. K. Chang et al., “Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms,” in SIGMETRICS, 2017.
-  M. T. Chang et al., “Technology Comparison for Large Last-Level Caches (L3Cs): Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized eDRAM,” in HPCA, 2013.
-  E. Ebrahimi et al., “Parallel Application Memory Scheduling,” in MICRO, 2011.
-  S. Eyerman and L. Eeckhout, “System-Level Performance Metrics for Multiprogram Workloads,” IEEE Micro, 2008.
-  S. Ghose et al., “Improving Memory Scheduling via Processor-Side Load Criticality Information,” in ISCA, 2013.
-  M. Gschwind, “Chip Multiprocessing and the Cell Broadband Engine,” in CF, 2006.
-  J. Gummaraju et al., “Architectural Support for the Stream Execution Model on General-Purpose Processors,” in PACT, 2007.
-  Q. Guo et al., “3D-Stacked Memory-Side Acceleration: Accelerator and System Design,” in WONDP, 2014.
-  X. Guo et al., “Resistive Computation: Avoiding the Power Wall with Low-Leakage, STT-MRAM Based Computing,” in ISCA, 2010.
-  C. A. Hart, “CDRAM in a Unified Memory Architecture,” in Intl. Computer Conference, 1994.
-  H. Hassan et al., “SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.
-  H. Hassan et al., “ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality,” in HPCA, 2016.
-  H. Hidaka et al., “The Cache DRAM Architecture,” IEEE Micro, 1990.
-  K. Hsieh et al., “Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems,” in ISCA, 2016.
-  K. Hsieh et al., “Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation,” in ICCD, 2016.
W.-C. Hsu and J. E. Smith, “Performance of Cached DRAM Organizations in Vector Supercomputers,” inISCA, 1993.
-  Intel Corp., “Intel®I/O Acceleration Technology,” http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html.
-  Intel Corp., “Intel 64 and IA-32 Architectures Optimization Reference Manual,” 2012.
E. Ipek et al.
, “Self-Optimizing Memory Controllers: A Reinforcement Learning Approach,” inISCA, 2008.
-  ITRS, http://www.itrs.net/ITRS1999-2014Mtgs,Presentations&Links/2013ITRS/2013Tables/FEP_2013Tables.xlsx, 2013.
-  ITRS, http://www.itrs.net/ITRS1999-2014Mtgs,Presentations&Links/2013ITRS/2013Tables/Interconnect_2013Tables.xlsx, 2013.
-  JEDEC, “DDR3 SDRAM Standard,” 2010.
-  JEDEC, “DDR4 SDRAM Standard,” 2012.
-  X. Jiang et al., “CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms,” in HPCA, 2010.
-  X. Jiang et al., “Architecture Support for Improving Bulk Memory Copying and Initialization Performance,” in PACT, 2009.
-  J. A. Kahle et al., “Introduction to the Cell Multiprocessor,” IBM JRD, 2005.
-  S. Kanev et al., “Profiling a Warehouse-Scale Computer,” in ISCA, 2015.
-  G. Kedem and R. P. Koganti, “WCDRAM: A Fully Associative Integrated Cached-DRAM with Wide Cache Lines,” Duke Univ. Dept. of Computer Science, Tech. Rep. CS-1997-03, 1997.
-  J. S. Kim et al., “The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency–Reliability Tradeoff in Modern DRAM Devices,” in HPCA, 2018.
-  J. S. Kim et al., “GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies,” BMC Genomics, 2018.
-  Y. Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator,” CAL, 2015.
-  Y. Kim et al., “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” in ISCA, 2014.
-  Y. Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” in HPCA, 2010.
-  Y. Kim et al., “Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior,” in MICRO, 2010.
-  Y. Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” in ISCA, 2012.
-  P. M. Kogge, “EXECUBE-A New Architecture for Scaleable MPPs,” in ICPP, 1994.
-  E. Kultursay et al., “Evaluating STT-RAM as an energy-efficient main memory alternative,” in ISPASS, 2013.
-  B. C. Lee et al., “Architecting Phase Change Memory as a Scalable DRAM Alternative,” in ISCA, 2009.
-  B. C. Lee et al., “Phase Change Memory Architecture and the Quest for Scalability,” CACM, vol. 53, no. 7, pp. 99–106, 2010.
-  B. C. Lee et al., “Phase-Change Technology and the Future of Main Memory,” IEEE Micro, vol. 30, no. 1, pp. 143–143, 2010.
-  C. J. Lee et al., “Prefetch-Aware DRAM Controllers,” in MICRO, 2008.
-  C. J. Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems,” Univ. of Texas at Austin, High Performance Systems Group, Tech. Rep. TR-HPS-2010-002, 2010.
-  C. J. Lee et al., “Improving Memory Bank-Level Parallelism in the Presence of Prefetching,” in MICRO, 2009.
-  D. Lee et al., “Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIGMETRICS, 2017.
-  D. Lee et al., “Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM,” in PACT, 2015.
-  D. Lee et al., “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” in HPCA, 2015.
-  D. Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” in HPCA, 2013.
-  D. Lee et al., “Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface,” TACO, 2016.
-  Y. Li et al., “Utility-Based Hybrid Memory Management,” in CLUSTER, 2017.
-  J. Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms,” in ISCA, 2013.
-  J. Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” in ISCA, 2012.
-  Z. Liu et al., “Concurrent Data Structures for Near-Memory Computing,” in SPAA, 2017.
-  S.-L. Lu et al., “Improving DRAM Latency with Dynamic Asymmetric Subarray,” in MICRO, 2015.
-  C.-K. Luk et al., “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation,” in PLDI, 2005.
-  J. Meza et al., “A Case for Efficient Hardware/Software Cooperative Management of Storage and Memory,” in WEED, 2013.
-  J. Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” CAL, 2012.
-  Micron Technology, Inc., “576Mb: x18, x36 RLDRAM3,” 2011.
-  T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory Service in Multi-core Systems,” in USENIX Security, 2007.
-  J. Mukundan and J. F. Martinez, “MORSE: Multi-objective Reconfigurable Self-Optimizing Memory Scheduler,” in HPCA, 2012.
-  O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” IMW, 2013.
-  O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” in MICRO, 2007.
-  O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
-  S. O et al., “Row-Buffer Decoupling: A Case for Low-Latency DRAM Microarchitecture,” in ISCA, 2014.
-  J. K. Ousterhout, “Why Aren’t Operating Systems Getting Faster as Fast as Hardware?” in USENIX Summer Conf., 1990.
-  M. Patel et al., “The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions,” in ISCA, 2017.
-  D. Patterson et al., “A Case for Intelligent RAM,” IEEE Micro, 1997.
-  A. Pattnaik et al., “Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities,” in PACT, 2016.
-  M. Qureshi et al., “A Case for MLP-Aware Cache Replacement,” in ISCA, 2006.
-  M. K. Qureshi et al., “Adaptive Insertion Policies for High-Performance Caching,” in ISCA, 2007.
-  M. K. Qureshi et al., “Enhancing Lifetime and Security of PCM-based Main Memory with Start-gap Wear Leveling,” in MICRO, 2009.
-  M. K. Qureshi et al., “Scalable High Performance Main Memory System Using Phase-change Memory Technology,” in ISCA, 2009.
-  M. Rosenblum et al., “The Impact of Architectural Trends on Operating System Performance,” in SOSP, 1995.
-  SAFARI Research Group, “Ramulator – GitHub Repository,” https://github.com/CMU-SAFARI/ramulator.
-  S.-Y. Seo, “Methods of Copying a Page in a Memory Device and Methods of Managing Pages in a Memory System,” U.S. Patent Application 20140185395, 2014.
-  V. Seshadri et al., “Fast Bulk Bitwise AND and OR in DRAM,” CAL, 2015.
-  V. Seshadri et al., “Page overlays: An enhanced virtual memory framework to enable fine-grained memory management,” in ISCA, 2015.
-  V. Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing,” in PACT, 2012.
-  V. Seshadri et al., “RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization,” in MICRO, 2013.
-  V. Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
V. Seshadri et al.
, “Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-Unit Strided Accesses,” inMICRO, 2015.
-  V. Seshadri et al., “Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks,” TACO, vol. 11, no. 4, pp. 51:1–51:22, 2015.
-  W. Shin et al., “NUAT: A Non-Uniform Access Time Memory Controller,” in HPCA, 2014.
-  A. Snavely and D. Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,” in ASPLOS, 2000.
-  Y. H. Son et al., “Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations,” in ISCA, 2013.
-  H. S. Stone, “A Logic-in-Memory Computer,” IEEE TC, 1970.
-  L. Subramanian et al., “BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling,” in IEEE TPDS, 2016.
-  L. Subramanian et al., “The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost,” in ICCD, 2014.
-  L. Subramanian et al., “Mise: Providing performance predictability and improving fairness in shared main memory systems,” in HPCA, 2013.
-  K. Sudan et al., “Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement,” in ASPLOS, 2010.
-  G. Tyson et al., “A Modified Approach to Data Cache Management,” in MICRO, 1995.
-  A. N. Udipi et al., “Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores,” in ISCA, 2010.
-  H. Usui et al., “DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators,” TACO, vol. 12, no. 4, pp. 65:1–65:28, 2016.
-  S. Wong et al., “A Hardware Cache memcpy Accelerator,” in FPT, 2006.
-  H. Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” in ICCD, 2012.
-  H. Yoon et al., “Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories,” TACO, vol. 11, no. 4, pp. 40:1–40:25, 2014.
-  X. Yu et al., “Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation,” in MICRO, 2017.
-  D. Zhang et al., “TOP-PIM: Throughput-Oriented Programmable Processing in Memory,” in HPDC, 2014.
-  L. Zhang et al., “The Impulse Memory Controller,” IEEE TC, vol. 50, no. 11, pp. 1117–1132, 2001.
-  Z. Zhang et al., “Cached DRAM for ILP Processor Memory Access Latency Reduction,” IEEE Micro, vol. 21, no. 4, Jul. 2001.
-  L. Zhao et al., “Hardware Support for Bulk Data Movement in Server Platforms,” in ICCD, 2005.