A Memory Controller with Row Buffer Locality Awareness for Hybrid Memory Systems

04/30/2018 ∙ by HanBin Yoon, et al. ∙ 0

Non-volatile memory (NVM) is a class of promising scalable memory technologies that can potentially offer higher capacity than DRAM at the same cost point. Unfortunately, the access latency and energy of NVM is often higher than those of DRAM, while the endurance of NVM is lower. Many DRAM-NVM hybrid memory systems use DRAM as a cache to NVM, to achieve the low access latency, low energy, and high endurance of DRAM, while taking advantage of the large capacity of NVM. A key question for a hybrid memory system is what data to cache in DRAM to best exploit the advantages of each technology while avoiding the disadvantages of each technology as much as possible. We propose a new memory controller design that improves hybrid memory performance and energy efficiency. We observe that both DRAM and NVM banks employ row buffers that act as a cache for the most recently accessed memory row. Accesses that are row buffer hits incur similar latencies (and energy consumption) in both DRAM and NVM, whereas accesses that are row buffer misses incur longer latencies (and higher energy consumption) in NVM than in DRAM. To exploit this, we devise a policy that caches heavily-reused data that frequently misses in the NVM row buffers into DRAM. Our policy tracks the row buffer miss counts of recently-used rows in NVM, and caches in DRAM the rows that are predicted to incur frequent row buffer misses. Our proposed policy also takes into account the high write latencies of NVM, in addition to row buffer locality and more likely places the write-intensive pages in DRAM instead of NVM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multiprogrammed and multithreaded workloads on chip multiprocessors require large amounts of main memory to support the working sets of many concurrently-executing threads. The demand for memory is increasing rapidly, as the number of cores or accelerators (collectively called agents) on a chip continues to increase and data-intensive applications become more widespread [35, 109, 91, 87]. Dynamic Random Access Memory

(DRAM) is used to compose main memory in modern computers. Though strides in DRAM manufacturing process technology have enabled DRAM to scale to smaller feature sizes, and, thus, higher densities (capacity per unit area), it is predicted that DRAM density scaling will result in higher costs and lower reliability as the process technology feature size continues to decrease 

[117, 99, 75, 19, 45, 87, 64, 91, 50, 88]. Satisfying increasingly higher memory demands with exclusively DRAM will soon become too expensive in terms of both cost and energy.111We refer the reader to our prior works [51, 52, 49, 64, 63, 40, 67, 18, 20, 107, 17, 61, 105, 68, 19, 39, 53, 65, 62, 50, 93, 48] for a detailed background on DRAM.

1.1 Non-Volatile Memory

Emerging non-volatile memory (NVM) technologies such as phase-change memory (PCM) [55, 57, 123, 56, 97, 126, 79], spin-transfer torque magnetic RAM (STT-MRAM) [54, 38, 21, 92], resistive RAM (ReRAM) [24, 110, 70], and 3D XPoint [81], have shown promise for future main memory system designs to meet the increasing memory capacity demands of data-intensive workloads. With projected scaling trends, NVM cells can be manufactured more easily at smaller feature sizes than DRAM cells, achieving high density and capacity [55, 57, 54, 70, 99, 25, 26, 119, 97, 131, 56, 123, 38, 21, 92, 126, 79]. This is due to two reasons: (1) while a DRAM cell stores data in the form of charge, an NVM cell uses resistive values to represent the data, which is expected to scale to smaller feature sizes; and (2) unlike DRAM, several NVM devices use multi-level cell technology, which stores more than one bit of data per memory cell.

For example, PCM is a non-volatile memory technology that stores data by varying the electrical resistance of a material known as chalcogenide [123, 99, 55]. A PCM memory cell is programmed by applying heat (via electrical current) to the chalcogenide and then cooling it at different rates, depending on the data to be stored. Rapid quenching places the chalcogenide into an amorphous state which has high resistance, representing the bit value of ‘0’ in single-level cell PCM, and slow cooling places the chalcogenide into a crystalline state which has low resistance, representing the bit value of ‘1’ in single-level cell PCM. Multi-level cell PCM can store multiple bits of data by providing more than two distinguishable resistance levels for each cell, very similar to the MLC NAND flash technology that is prevalent in modern storage systems [6, 7, 15, 10, 11, 16, 73, 14, 12, 13, 9, 12, 8, 126, 96].

However, NVM has a number of disadvantages. Compared to DRAM, NVM typically has a longer access latency, higher write energy, and lower endurance [55, 97]. For example, PCM’s long cooling duration required to crystallize chalcogenide leads to high PCM write latency, high read (sensing) latency, high read energy, and high write energy compared to those of DRAM [77]. Furthermore, the repeated thermal expansions and contractions of a PCM cell during programming lead to finite write endurance

, which is estimated at

writes, an issue not present in DRAM [55].

1.2 Hybrid Memory Systems

Hybrid memory systems [97, 29, 100, 129, 76, 66, 4, 95, 34, 30, 1, 94, 5, 22] aim to combine the strengths of DRAM and emerging memory technologies (e.g., NVM, reduced-latency DRAM [80, 104, 64], reduced reliability DRAM [97, 74]). Many previous DRAM-NVM hybrid memory system designs employ DRAM as a small cache [97] or write buffer [29, 129] to NVM of large capacity. In this work, we utilize PCM to provide increased overall memory capacity (which leads to reduced page faults in the system), while the DRAM cache serves a large portion of the memory requests at low latency and low energy with high endurance. The combined effect increases overall system performance and energy efficiency [97]. A key question in the design of a DRAM-PCM hybrid memory system is how to place data between DRAM and PCM to best exploit the strengths of each technology while avoiding their weaknesses as much as possible.

1.3 Memory Device Architecture

In our ICCD 2012 paper [125], we develop new mechanisms for deciding how data should be placed in a DRAM-PCM hybrid memory system. Our main observation is that both DRAM and PCM devices consist of banks that employ row buffer circuitry. The organization of a memory bank is illustrated in Figure 1. Cells (memory elements) are typically laid out in arrays of rows (cells sharing a common wordline) and columns (cells sharing a common bitline). An access to the array occur at the granularity of a row. To read from the array, a wordline is first asserted to select a row of cells. Then, through the bitlines, the contents of the selected cells are detected by sense amplifiers (labeled S/A in the figure) and latched by peripheral circuitry known as the row buffer.

Figure 1: Memory cells organized in a 2D array of rows and columns. Reproduced from [125].

Once the contents of a row are latched in the row buffer, subsequent memory requests to that row are served promptly from the row buffer, without having to bear the delay of accessing the array. Such memory accesses are called row buffer hits. However, if a row different from the one latched in the row buffer is requested, then the newly requested row is read from the array into the row buffer (replacing the row buffer’s previous contents). Such a memory access incurs the high latency and energy of activating the array, and is called a row buffer miss. Row buffer locality (RBL) refers to the repeated reference to a row while its contents are in the row buffer. Memory requests to data with high row buffer locality are served efficiently (at low latency and energy) without having to frequently re-activate the memory cell array.

2 Row Buffer Locality-Aware Caching Policy

Figure 2: DRAM-PCM hybrid memory system organization. Reproduced from [125].
Figure 3: Conceptual example showing the importance of row buffer locality-awareness in hybrid memory data placement decisions. Reproduced from [125].

Our ICCD 2012 paper [125] proposes Row Buffer Locality-Aware (RBLA) caching policies, which a hybrid memory controller can use to guide data placement. RBLA can be used in any hybrid memory system where each underlying memory technology consists of banks with row buffers. We study an example hybrid memory system that consists of a large amount of PCM backed by a small DRAM cache [97, 72, 76, 66], whose organization is shown in Figure 2. Our main observation is that memory requests that hit in the row buffer incur similar latencies and energy consumption in both DRAM and PCM [55, 57], whereas requests that miss in the row buffer incur higher latency and energy in PCM than in DRAM. As a result, placing data that mostly leads to row buffer hits (i.e., data that has high row buffer locality) in DRAM provides little benefit over placing the same data in PCM. On the other hand, placing heavily reused data that leads to frequent row buffer misses (i.e., data that has low row buffer locality) in DRAM avoids the high latency and energy of PCM array accesses.

This observation is illustrated in the example shown in Figure 3. In the example, the service timelines for memory requests to rows A–D are shown. Prior hybrid memory and cache management proposals seek to improve the reuse of data placed in the cache and reduce the access bandwidth of the next level of memory (e.g., [98, 43]). We call this approach to cache management conventional mapping. Conventional mapping (top half of Figure 3) can place rows A and B (which have low row buffer locality) both in PCM, causing the high PCM array latency to become a bottleneck. In contrast, row buffer locality-aware mapping (bottom half of Figure 3) places rows A and B in DRAM such that they can benefit from DRAM’s lower array latency, leading to faster overall memory service.222Even though the figure shows some requests being served in parallel, if the individual requests arrived in the same order at different times, the average request latency would still be improved significantly. Placing rows C and D (high row locality) in DRAM provides little benefit over placing them in PCM.

Based on this observation, we devise a hybrid memory caching policy that caches in DRAM the rows that mostly miss in the row buffer and are frequently reused. To implement this policy, the memory controller maintains a count of the row buffer misses for recently-used rows in PCM, and places in DRAM the data of rows whose row buffer miss counts exceed a certain threshold (dynamically adjusted at runtime in the RBLA-Dyn mechanism, which we describe in Section 2.3).

2.1 Measuring Row Buffer Locality

The RBLA mechanism tracks the row buffer locality statistics for a small number of recently-accessed rows, in a hardware structure called the stats store. The stats store resides in the memory controller, and is organized similarly to a cache, however its data payload per entry is a single row buffer miss counter.

On each PCM access, the memory controller looks for an entry in the stats store using the address of the accessed row. If there is no corresponding entry, a new entry is allocated for the accessed row, possibly evicting an older entry. If the access results in a row buffer miss, the row’s row buffer miss counter is incremented. If the access results in a row buffer hit, no additional action is taken.

2.2 Triggering Row Caching

Rows that exhibit low row buffer locality and high reuse will have high row buffer miss counter values. The RBLA mechanism selectively caches these rows by triggering the caching of a row in DRAM when the row’s row buffer miss counter exceeds a threshold value, MissThresh. Setting this MissThresh to a low value causes more rows with a higher row buffer locality to be cached.

Caching rows based on their row buffer locality attempts to migrate data between PCM and DRAM only when such data movement is beneficial. This affects system performance in three ways. First, placing in DRAM rows that have low row buffer locality improves average memory access latency, due to the lower row buffer miss latency of DRAM compared to PCM. Second, by selectively caching data that benefits from being migrated to DRAM, RBLA reduces unnecessary data movement between DRAM and PCM (i.e., data that frequently hits in the row buffer incurs the same access latency in PCM as in DRAM, and is thus left in PCM). This reduces memory bandwidth consumption, allowing more bandwidth to be used to serve demand requests, and enables better utilization of the DRAM cache space. Third, allowing data that frequently hits in the row buffer to remain in PCM contributes to balancing the memory request load between DRAM and PCM.

To prevent rows with low reuse from gradually building up large enough row buffer miss counts over an extended period of time to exceed MissThresh and trigger row caching, we apply a periodic reset to all of the row buffer miss count values. We set this reset interval to 10 million cycles empirically.

2.3 Dynamic Threshold Adaptation: RBLA-Dyn

We improve the adaptivity of RBLA to workload and system variations by dynamically determining the value of MissThresh. The key idea behind this scheme, which we call RBLA-Dyn, is that the number of cycles saved by caching rows in DRAM should outweigh the cost of migrating that data to DRAM. RBLA-Dyn estimates, on an interval basis, the first order cost and benefit of employing a certain MissThresh value, and increases or decreases the MissThresh value to maximize the net benefit (i.e., benefit minus cost).

Since data migration operations can delay demand requests, we approximate cost as the number of cycles spent migrating each row across the memory channels () times the number of rows migrated ():

(1)

If these data migrations are eventually beneficial, the access latency to main memory will decrease. Hence, we can compute the benefit of migration as the number of cycles saved by accessing the data from the DRAM cache as opposed to PCM:

(2)

In this equation, and are the number of reads and writes performed in DRAM after migration, and are the read and write latency of a DRAM row buffer miss, and and are the read and write latency of a PCM row buffer miss. RBLA-Dyn accounts for reads and writes separately, as they incur different latencies in many NVM technologies, such as PCM.

RBLA-Dyn uses a simple hill-climbing algorithm (see Algorithm 1 in our ICCD 2012 paper [125]) to find the value of MissThresh that maximizes the net benefit. The algorithm is executed at the end of each interval (10 million cycles in our setup). We refer the reader to Section IV-C of our ICCD 2012 paper [125] for more details on the RBLA-Dyn mechanism.

2.4 Implementation and Hardware Cost

The primary hardware cost incurred in implementing a row buffer locality-aware caching mechanism on top of an existing hybrid memory system is the stats store. We model a 16-way, 128-set, LRU-replacement stats store using 5-bit row buffer miss counters, which occupies a total of 9.25 KB. This stats store achieves within 0.3% of the system performance (and within 2.5% of the memory lifetime) of an unlimited-sized stats store for RBLA-Dyn.

3 Evaluation Methodology

We use a cycle-level in-house x86 multi-core simulator, whose front-end is based on Pin. The simulator is an early predecessor of Ramulator [53, 103] and the ThyNVM simulator [100]. We collect results using multiprogrammed workloads consisting of server- and cloud-type applications (including TPC-C/H [118], Apache Web Server, and video processing benchmarks) for a 16-core system. We compare our row buffer locality-aware caching policy (RBLA) against a policy that caches data that is frequently accessed (FREQ, similar in approach to [43]). We use this competitive baseline because we find that conventional LRU caching performs worse due to its high memory bandwidth consumption. FREQ caches a row when the number of accesses to the row exceeds a threshold value. FREQ-Dyn adopts the same dynamic threshold adjustment algorithm as RBLA-Dyn (Section 2.3). Our methodology and workloads are described in detail in Section VI of our ICCD 2012 paper [125].

4 Evaluation

Performance. Figure 4 shows the weighted speedup of the four caching techniques that we evaluate. As we observe from the figure, RBLA-Dyn provides the highest performance (14% improvement in weighted speedup over FREQ) among the four techniques. RBLA and RBLA-Dyn outperform FREQ and FREQ-Dyn, respectively, because the RBLA techniques place data with low row buffer locality in DRAM where it can be accessed at the lower DRAM array access latency, while keeping data with high row buffer locality in PCM where it can be accessed at the already-low row buffer hit latency.

Figure 4: Weighted speedup of the four caching techniques: FREQ, FREQ-Dyn, RBLA, and RBLA-Dyn. Reproduced from [125].
Figure 5: Fairness of the four caching techniques: FREQ, FREQ-Dyn, RBLA, and RBLA-Dyn (lower is better). Reproduced from [125].

Thread Fairness. Figure 5 shows the fairness of each caching technique. We measure fairness using maximum slowdown [28, 51, 52, 3, 27, 86, 115, 121, 122, 113, 112], which is the highest slowdown (reciprocal of speedup) experienced by any benchmark within the multiprogrammed workload. A lower maximum slowdown indicates higher fairness. We observe from the figure that RBLA-Dyn provides the highest thread fairness (6% improvement in maximum slowdown over FREQ) out of all evaluated policies. RBLA-Dyn throttles back on non-beneficial data migrations, reducing the amount of memory bandwidth and DRAM space consumed due to such migrations. Combined with the reduced average memory access latency, RBLA-Dyn reduces contention for memory bandwidth among co-running applications, providing higher fairness.

Memory Energy Efficiency. Figure 6 shows that RBLA-Dyn achieves the highest memory energy efficiency (10% improvement over FREQ) compared to other policies, in terms of performance per Watt. This is because RBLA-Dyn places data with low row buffer locality in DRAM, making the energy cost of row buffer miss accesses lower than it would be if such data were placed in PCM. RBLA-Dyn also reduces energy consumption by reducing the amount of non-beneficial or useless data migrations.

Figure 6: Energy efficiency of the four caching techniques: FREQ, FREQ-Dyn, RBLA, and RBLA-Dyn. Reproduced from [125].

We provide the following other evaluation results in Section VII of our ICCD 2012 paper [125]:

  • Impact of RBLA-Dyn on average memory latency (Section VII-A of [125]).

  • Impact of RBLA-Dyn on DRAM and PCM channel utilization (Section VII-A of [125]).

  • Memory access breakdown of each workload to DRAM and PCM (Section VII-A of [125]).

  • Impact of RBLA-Dyn on PCM lifetime (Section VII-D of [125]).

  • Comparison with all-PCM and all-DRAM systems (Section VII-E of [125]).

As we discuss in detail in our ICCD 2012 paper [125], RBLA-Dyn bridges the gap in performance between homogeneous all-DRAM and all-PCM memory systems of equal addressable capacity (achieving within 29% of the performance of an all-DRAM system, and improving performance by 31% over an all-PCM system), while providing close to seven years of memory lifetime.333Note that lifetime can be further improved by enabling more aggressive write optimization [106], and by taking advantage of application-level error tolerance [74].

We conclude that taking row buffer locality into account enables new hybrid memory caching policies that achieve high performance and energy efficiency.

5 Related Work

To our knowledge, our ICCD 2012 paper [125] is the first work to observe that row buffer hit latencies are similar in different memory technologies, and uses this observation to devise a caching policy that improves the performance and energy efficiency of a hybrid memory system. No previous work, as far as we know, considered row buffer locality as a key metric for deciding what data to cache and what not to cache. We discuss related work on caching policies and hybrid memory systems.

Caching Based on Data Access Frequency. Jiang et al. [43] propose caching only the data that experiences a high number of accesses in an on-chip DRAM cache (in 4–8 KB block sizes), to reduce off-chip memory bandwidth consumption. Johnson and Hwu [44] use a counter-based mechanism to track data reuse at a granularity larger than a cache block. Cache blocks in a region with less reuse bypass a direct-mapped cache if that region conflicts with another that has more reuse. We propose to take advantage of row buffer locality in memory banks when employing off-chip DRAM and PCM. We exploit the fact that accesses to DRAM and PCM have similar average latencies for rows that have high row buffer locality.

Ramos et al. [98] adapt a buffer cache replacement algorithm to rank pages based on their frequency and recency of accesses, and place the highly-ranking pages in DRAM, in a DRAM-PCM hybrid memory system. Our work is orthogonal, because the page-ranking algorithm can be adapted to rank pages based on their frequency and recency of row buffer misses (not counting accesses that are row buffer hits), for which we expect improved performance.

Caching Based on Locality of Data Access. Gonzalez et al. [37] propose placing data in one of two last-level caches depending on whether it exhibits spatial or temporal locality. They also propose bypassing the cache when accessing large data structures with large strides (e.g., big matrices) to prevent cache thrashing. Rivers and Davidson [101] propose separating out data without temporal locality from data with, and placing it in a special buffer to prevent the pollution of the L1 cache. These works are primarily concerned with on-chip L1/L2 caches that have access latencies on the order of a few to tens of processor clock cycles, where off-chip memory bank row buffer locality is less applicable.

There have been many works in on-chip caching to improve cache utilization (e.g., a recent one uses an evicted address filter to predict cache block reuse [108]), but none of these consider the row buffer locality of cache misses.

Caching Based on Other Criteria. Chatterjee et al. [22] observe that the first word of cache blocks is critical to performance, and propose to store only the first word of each block in fast DRAM. Phadke and Narayanasamy [95]

propose to classify applications into three categories based on memory-level parallelism (MLP): latency-sensitive, bandwidth-sensitive, and insensitive-to-both. To estimate MLP, they profile the misses per kilo-instruction (MPKI) and stall time of each application offline during the compilation stage. Applications with high MPKI but low stall time are considered to have good MLP.

Hybrid Memory Systems. Qureshi et al. [97] propose increasing the size of main memory by adopting PCM, and using DRAM as a conventional cache to PCM. The reduction in page faults due to the increase in main memory size brings performance and energy improvements to the system. Our ICCD 2012 paper [125] proposes a new, effective DRAM caching policy to PCM, and studies performance effects without page faults present.

Li et al. [66] propose UHM, a utility-based hybrid memory management mechanism that expands upon our RBLA policy. UHM estimates the utility of each page, which is the benefit to system performance of placing each page in different types of memory (e.g., DRAM and NVM). UHM migrates to the fast memory of a hybrid memory system only those pages whose utility would improve the most after migration.

Ren et al. [100] propose ThyNVM, which manages the DRAM and PCM spaces carefully and adapts the granularity of management to the access patterns in a manner that provides crash consistency in a persistent memory system.

Dhiman et al. [29] propose a hybrid main memory system that exposes DRAM and PCM addressability to the software (OS). If the number of writes to a particular PCM page exceeds a certain threshold, the contents of the page are copied to another page (either in DRAM or PCM), thus facilitating PCM wear-leveling. Mogul et al. [82] suggest that the OS exploit metadata information available to it to make data placement decisions between DRAM and non-volatile memory. Similar to [29], their data placement criteria are centered around the write frequency to data. Our proposal is complementary to this work, and row buffer locality information, if exposed, can be used by the OS to place pages in DRAM or PCM.

Bivens et al. [4] examine the various design concerns of a heterogeneous memory system such as memory latency, bandwidth, and endurance requirements of employing storage class memory (e.g., PCM, STT-MRAM, NAND flash memory). Their hybrid memory organization is similar to ours and that in [97], in that DRAM is used as a cache to a slower memory medium, transparently to software. Phadke et al. [95] propose to profile the memory access patterns of individual applications in a multi-core system, and place their working sets in the particular type of DRAM that best suits the application’s memory demands. In contrast, RBLA dynamically makes fine-grained data placement decisions at a row granularity, depending on the row buffer locality characteristics of each page.

Agarwal et al. [1] propose a software-based approach to manage huge pages (e.g., 2MB pages) in hybrid memory systems. The mechanism profiles the memory access patterns of huge pages, and uses the profiling information to guide page migration between DRAM and NVM. Peña and Balaji [94] propose a profiling tool to assess the impact of distributing memory objects across memory devices in hybrid memory systems. Bock et al. [5] propose a scheme that allows concurrent migration of multiple pages between different types of memory devices without significantly affecting the memory bandwidth. Gai et al. [34] propose a data placement scheme that aims to minimize the energy consumption of hybrid memory systems. Liu et al. [69] propose a scheme that manages the entire memory hierarchy, which includes caches, memory channels, and DRAM/NVM banks. Dulloor et al. [30] propose a programmer-guided data placement tool, which requires programmers to modify the source code, and needs data from a representative profiling run of the application, prior to making placement decisions. Ideas from all of these works can be combined with RBLA for better performance and efficiency.

Exploiting Row Buffer Locality. Many previous works exploit row buffer locality to improve memory system performance, but none (to our knowledge) develop a cache data placement policy that considers the row buffer locality of the block to be cached. Lee et al. [55, 56, 57] propose to use multiple short row buffers in PCM devices, much like an internal device cache, to increase the row buffer hit rate. Meza et al. [77] examine the case for small row buffers for NVM devices. Sudan et al. [116] propose a mechanism that identifies frequently referenced sub-rows of data, and migrates them to reserved rows. By co-locating these frequently accessed sub-rows, this scheme aims to increase the row buffer hit rate of memory accesses, and improve performance and energy consumption. DRAM-aware last-level cache writeback schemes [111, 60] speculatively issue writeback requests that are predicted to hit in the row buffer. RBLA is complementary to these works, and can be applied together with them because RBLA targets a different problem.

Row buffer locality is also commonly exploited in memory scheduling algorithms. The First-Ready First-Come-First-Serve algorithm (FR-FCFS) [102, 132] prioritizes memory requests that hit in the row buffer, improving the latency, throughput, and energy cost of serving memory requests. Many other memory scheduling algorithms [89, 90, 51, 52, 59, 33, 85, 60, 113, 115, 120, 130, 41, 58, 111, 84, 36, 124, 71, 112, 114, 83, 86, 46, 47, 3, 42, 121, 31, 32] build upon this “row-hit first” principle.

Muralidhara et al. [86] use a thread’s row buffer locality as a metric to decide which channel the thread’s pages should be allocated to in a multi-channel memory system. Their goal is to reduce memory interference between threads, and as such their technique is complementary to ours.

6 Significance

Our ICCD 2012 paper [125] makes several novel contributions that we expect will have a long-term impact on the design of memory systems, and we believe that our work inspires several new research questions.

6.1 Long-Term Impact

The memory scaling bottleneck continues to be a significant hurdle to system performance and energy efficiency [87, 91, 88]. Emerging applications operate on increasingly-larger sets of data, and require high-capacity, high-performance main memories, but the poor scaling of DRAM limits the ability of these applications to fit their entire working sets within a DRAM-based main memory. Because DRAM cannot keep pace with application needs, we expect that the demand for alternative memory technologies will continue to grow in the coming years.

Hybrid memory systems can allow systems to harness these alternative memory technologies without fully sacrificing the benefits of DRAM. By combining slower but larger memories (e.g., NVM) with faster but smaller memories (e.g., DRAM), a hybrid memory system has the potential to provide the illusion of a fast and large memory system at a reasonable cost. However, as we discuss, this potential can only be realized by carefully considering which pieces of data are placed in each of the constituent memories of a hybrid memory system. To our knowledge, our ICCD 2012 paper [125] is the first to show that the organization of the underlying memory technologies, such as the existence of row buffers, can be used to make more intelligent data placement decisions.

While our ICCD 2012 paper [125] shows the impact of our proposed data placement policy on a hybrid memory consisting of DRAM and PCM, it can be used to enable a wide range of hybrid memory systems. For example, STT-MRAM devices can make use of a row buffer [54, 77, 78, 2], and expensive reduced-latency DRAM devices [80, 104] also make use of a row buffer. RBLA can be used to improve the performance of hybrid memories that include any of these memory technologies, as our general observations on row buffer locality remain the same. We expect that this versatility will increase the potential impact of RBLA, as no single memory technology has yet to emerge as the dominant replacement for DRAM.

6.2 Research Questions

As we show in our ICCD 2012 paper [125], the efficient management of hybrid memory systems requires the identification and consideration of the key similarities and trade-offs of each memory type. An open research question inspired by RBLA’s use of row buffer locality is what other properties of memory systems should hybrid memory management mechanisms consider? For example, one of our recent works [66] incorporates information on memory-level parallelism (MLP) into data placement decisions in hybrid memory management. As that work shows, we can use a combination of access frequency, row buffer locality, and MLP to predict the overall performance impact of migrating a page between each memory type. As future memory technologies are developed, we expect that other such properties will be important to consider, in order to maximize the benefits provided by the hybrid memory system.

Several works propose on-chip DRAM caches [43, 23, 127, 128], where a small amount of DRAM is used as a last-level cache to reduce the number of accesses to a larger off-chip DRAM. This is akin to the design of a hybrid memory system, but there are different trade-offs in each design. For example, while the row buffer hit latency is typically similar across memory technologies in hybrid memories, both a row buffer hit and a row buffer miss take longer when accessing the off-chip DRAM as opposed to accessing the on-chip DRAM cache. This inspires us to ask how can principles of hybrid memory systems be applied to DRAM cache management, and vice versa? Extending upon this, can we design general mechanisms that can be applied to both hybrid memory systems and DRAM cache management? As one example, our recent work [66] on predicting the utility of data placement decisions is highly parameterized, and these parameters can easily be tuned to represent the trade-offs in both hybrid memory systems and in systems with a DRAM cache. We believe and hope that future works should strive to develop other such general mechanisms.

7 Conclusion

Our ICCD 2012 paper [125] observes that row buffer access latency (and energy) in DRAM and PCM are similar, while PCM array access latency (and energy) is much higher than DRAM array access latency (and energy). Therefore, in a hybrid memory system where DRAM is used as a cache to PCM, it makes sense to place in DRAM data that would cause frequent row buffer misses as such data, if placed in PCM, would incur the high PCM array access latency. We develop a caching policy that achieves this effect by keeping track of rows that have high row buffer miss counts (i.e., low row buffer locality, but high reuse) and places only such rows in DRAM. Our final policy dynamically determines the threshold used to decide whether a row has low locality based on cost-benefit analysis. Evaluations show that the proposed row buffer locality aware caching policy provides better performance, fairness, and energy-efficiency compared to caching policies that only consider access frequency or recency. Our mechanisms are applicable to and can improve the performance of other hybrid memory systems consisting of different technologies. We hope that our findings can help ease the adoption of emerging memory technologies in future systems, and inspire further research in data management policies.

Acknowledgments

We thank Saugata Ghose for his dedicated effort in the preparation of this article. We acknowledge the support of AMD, HP, Intel, Oracle, and Samsung. This research was partially supported by the NSF (grants 0953246 and 1212962), GSRC, Intel URO, and ISTC on Cloud Computing. HanBin Yoon was partially supported by the Samsung Scholarship.

References

  • [1] N. Agarwal and T. F. Wenisch, “Thermostat: Application-transparent Page Management for Two-tiered Main Memory,” in ASPLOS, 2017.
  • [2] T. W. Andre, J. J. Nahas, C. K. Subramanian, B. J. Garni, H. S. Lin, A. Omair, and W. L. Martino, Jr., “A 4-Mb 0.18-µm 1T1MTJ Toggle MRAM with Balanced Three Input Sensing Scheme and Locally Mirrored Unidirectional Write Drivers,” JSSC, 2005.
  • [3] R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and O. Mutlu, “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” in ISCA, 2012.
  • [4] A. Bivens et al., “Architectural design for next generation heterogeneous memory systems,” in IMW, 2010.
  • [5] S. Bock, B. R. Childers, R. Melhem, and D. Mossé, “Concurrent Migration of Multiple Pages in Software-Managed Hybrid Main Memory,” in ICCD, 2016.
  • [6] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives,” Proc. IEEE, 2017.
  • [7] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives,” arXiv:1706.08642 [cs.AR], 2017.
  • [8] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery,” arXiv:1711.11427 [cs.AR], 2017.
  • [9] Y. Cai, S. Ghose, Y. Luo, K. Mai, O. Mutlu, and E. F. Haratsch, “Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques,” in HPCA, 2017.
  • [10] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis,” in DATE, 2012.
  • [11] Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, and O. Mutlu, “Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery,” in HPCA, 2015.
  • [12] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. Unsal, and K. Mai, “Flash Correct and Refresh: Retention Aware Management for Increased Lifetime,” in ICCD, 2012.
  • [13] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. Unsal, and K. Mai, “Error Analysis and Retention-Aware Error Management for NAND Flash Memory,” Intel Technology Journal, 2013.
  • [14] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, O. Unsal, A. Cristal, and K. Mai, “Neighbor Cell Assisted Error Correction in MLC NAND Flash Memories,” in SIGMETRICS, 2014.
  • [15] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis, and Modeling,” in DATE, 2013.
  • [16] Y. Cai, Y. Luo, S. Ghose, E. F. Haratsch, K. Mai, and O. Mutlu, “Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery,” in DSN, 2015.
  • [17] K. Chang, A. Yaglikci, S. Ghose, A. Agrawal, N. Chatterjee, A. Kashyap, D. Lee, M. O’Connor, H. Hassan, and O. Mutlu, “Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms,” in SIGMETRICS, 2017.
  • [18] K. K. Chang, D. Lee, Z. Chishti, A. R. Alameldeen, C. Wilkerson, Y. Kim, and O. Mutlu, “Improving DRAM Performance by Parallelizing Refreshes with Accesses,” in HPCA, 2014.
  • [19] K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhimenko, S. Khan, and O. Mutlu, “Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization,” in SIGMETRICS, 2016.
  • [20] K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu, “Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM,” in HPCA, 2016.
  • [21] M. T. Chang, P. Rosenfeld, S. L. Lu, and B. Jacob, “Technology Comparison for Large Last-Level Caches (L3Cs): Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized eDRAM,” in HPCA, 2013.
  • [22] N. Chatterjee, M. Shevgoor, R. Balasubramonian, A. Davis, Z. Fang, R. Illikkal, and R. Iyer, “Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access,” in MICRO, 2012.
  • [23] C. C. Chou, A. Jaleel, and M. K. Qureshi, “CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache,” in MICRO, 2014.
  • [24] L. Chua, “Memristor—The Missing Circuit Element,” TCT, Sep. 1971.
  • [25] K. C. Chun, H. Zhao, J. Harms, T.-H. Kim, J.-P. Wang, and C. Kim, “A Scaling Roadmap and Performance Evaluation of In-Plane and Perpendicular MTJ Based STT-MRAMs for High-Density Cache Memory,” JSSC, 2013.
  • [26] S. Chung et al., “Fully Integrated 54nm STT-RAM with the Smallest Bit Cell Dimension for High Density Memory Application,” in IEDM, 2010.
  • [27] R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi, “Application-to-core Mapping Policies to Reduce Memory System Interference in Multi-core Systems,” in HPCA, 2013.
  • [28] R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, “Application-Aware Prioritization Mechanisms for On-Chip Networks,” in MICRO, 2009.
  • [29] G. Dhiman et al., “PDRAM: a Hybrid PRAM and DRAM Main Memory System,” in DAC, 2009.
  • [30] S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish, R. Sankaran, J. Jackson, and K. Schwan, “Data Tiering in Heterogeneous Memory Systems,” in EuroSys, 2016.
  • [31] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Fairness via Source Throttling: A Configurable and High-performance Fairness Substrate for Multi-core Memory Systems,” in ASPLOS, 2010.
  • [32] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Prefetch-aware Shared Resource Management for Multi-core Systems,” in ISCA, 2011.
  • [33] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt, “Parallel Application Memory Scheduling,” in MICRO, 2011.
  • [34] K. Gai, M. Qiu, H. Zhao, and L. Qiu, “Smart Energy-Aware Data Allocation for Heterogeneous Memory,” in HPCC/SmartCity/DSS, 2016.
  • [35] W. Gao, L. Wang, J. Zhan, C. Luo, D. Zheng, Z. Jia, B. Xie, C. Zheng, Q. Yang, and H. Wang, “A Dwarf-based Scalable Big Data Benchmarking Methodology,” arXiv CoRR, 2017.
  • [36] S. Ghose, H. Lee, and J. F. Martínez, “Improving Memory Scheduling via Processor-side Load Criticality Information,” in ISCA, 2013.
  • [37] A. González et al., “A data cache with multiple caching strategies tuned to different types of locality,” in ICS, 1995.
  • [38] X. Guo, E. İpek, and T. Soyata, “Resistive Computation: Avoiding the Power Wall with Low-Leakage, STT-MRAM Based Computing,” in ISCA, 2009.
  • [39] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee, O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.
  • [40] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and O. Mutlu, “ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality,” in HPCA, 2016.
  • [41]

    E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana, “Self-optimizing memory controllers: A reinforcement learning approach,” in

    ISCA, 2008.
  • [42] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, “A QoS-Aware Memory Controller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC,” in DAC, 2012.
  • [43] X. Jiang et al., “CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms,” in HPCA, 2010.
  • [44] T. L. Johnson and W.-m. Hwu, “Run-time adaptive cache hierarchy management via reference analysis,” in ISCA, 1997.
  • [45] U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. Choi, “Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling,” in The Memory Forum, 2014.
  • [46] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar, “Bounding Memory Interference Delay in COTS-based Multi-core Systems,” in RTAS, 2014.
  • [47] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar, “Bounding and Reducing Memory Interference in COTS-based Multi-core Systems,” Real-Time Systems, vol. 52, no. 3, May 2016.
  • [48] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern DRAM Devices,” in HPCA, 2018.
  • [49] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” in ISCA, 2012.
  • [50] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” in ISCA, 2014.
  • [51] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” in HPCA, 2010.
  • [52] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior,” in MICRO, 2010.
  • [53] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simulator,” CAL, 2015.
  • [54] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” in ISPASS, 2013.
  • [55] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change Memory As a Scalable DRAM Alternative,” in ISCA, 2009.
  • [56] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Phase Change Memory Architecture and the Quest for Scalability,” Communications of the ACM, 2010.
  • [57] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger, “Phase-Change Technology and the Future of Main Memory,” IEEE Micro, vol. 30, January 2010.
  • [58] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving Memory Bank-Level Parallelism in the Presence of Prefetching,” in MICRO, 2009.
  • [59] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-aware DRAM Controllers,” in MICRO, 2008.
  • [60] C. J. Lee, V. Narasiman, E. Ebrahimi, O. Mutlu, and Y. N. Patt, “DRAM-Aware Last-level Cache Writeback: Reducing Write-Caused Interference in Memory Systems,” Univ. of Texas at Austin, High Performance Systems Group, Tech. Rep. TR-HPS-2010-002, 2010.
  • [61] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko, V. Seshadri, and O. Mutlu, “Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIGMETRICS, 2017.
  • [62] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost,” TACO, 2016.
  • [63] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” in HPCA, 2015.
  • [64] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” in HPCA, 2013.
  • [65] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, “Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM,” in PACT, 2015.
  • [66] Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, and O. Mutlu, “Utility-Based Hybrid Memory Management,” in CLUSTER, 2017.
  • [67] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-Aware Intelligent DRAM Refresh,” in ISCA, 2012.
  • [68] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms,” in ISCA, 2013.
  • [69] L. Liu, H. Yang, Y. Li, M. Xie, L. Li, and C. Wu, “Memos: A Full Hierarchy Hybrid Memory Management Framework,” in ICCD, 2016.
  • [70] T. Liu et al., “A 130.7 2-Layer 32Gb ReRAM Memory Device in 24nm Technology,” JSSC, 2014.
  • [71] W. Liu, P. Huang, T. Kun, T. Lu, K. Zhou, C. Li, and X. He, “LAMS: A Latency-aware Memory Scheduling Policy for Modern DRAM Systems,” in IPCCC, 2016.
  • [72] G. H. Loh and M. D. Hill, “Efficiently enabling conventional block sizes for very large die-stacked DRAM caches,” in MICRO, 2011.
  • [73] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, “Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory,” JSAC, 2016.
  • [74] Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu, B. Khessib, K. Vaid, and O. Mutlu, “Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-reliability Memory,” in DSN, 2014.
  • [75] J. A. Mandelman et al., “Challenges and Future Directions for the Scaling of Dynamic Random-Access Memory (DRAM),” IBM JRD, vol. 46, 2002.
  • [76] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” CAL, 2012.
  • [77] J. Meza, J. Li, and O. Mutlu, “A Case for Small Row Buffers in Non-Volatile Main Memories,” in ICCD Poster Session, 2012.
  • [78] J. Meza, J. Li, and O. Mutlu, “Evaluating Row Buffer Locality in Future Non-Volatile Main Memories,” Carnegie Mellon Univ., SAFARI Research Group, Tech. Rep. TR-SAFARI-2012-002, 2012.
  • [79] J. Meza, Y. Luo, S. Khan, J. Zhao, Y. Xie, and O. Mutlu, “A Case for Efficient Hardware/software Cooperative Management of Storage and Memory,” in WEED, 2013.
  • [80] Micron Technology, Inc., “576Mb: x18, x36 RLDRAM3,” 2011.
  • [81] Micron Technology, Inc., “Breakthrough Nonvolatile Memory Technology,” https://www.micron.com/about/our-innovation/3d-xpoint-technology, 2016.
  • [82] J. C. Mogul et al., “Operating system support for NVM+DRAM hybrid main memory,” in HotOS, 2009.
  • [83] T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory Service in Multi-core Systems,” in USENIX Security, 2007.
  • [84] T. Moscibroda and O. Mutlu, “Distributed Order Scheduling and Its Application to Multi-core DRAM Controllers,” in PODC, 2008.
  • [85] J. Mukundan and J. F. Martinez, “MORSE: Multi-objective Reconfigurable Self-optimizing Memory Scheduler,” in HPCA, 2012.
  • [86] S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda, “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” in MICRO, 2011.
  • [87] O. Mutlu, “Memory scaling: A systems architecture perspective,” in IMW, 2013.
  • [88] O. Mutlu, “The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser,” in DATE, 2017.
  • [89] O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” in MICRO, 2007.
  • [90] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
  • [91] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory Systems,” SUPERFRI, 2015.
  • [92] H. Naeimi, C. Augustine, A. Raychowdhury, S.-L. Lu, and J. Tschanz, “STT-RAM Scaling and Retention Failure,” Intel Technol. J., May 2013.
  • [93] M. Patel, J. Kim, and O. Mutlu, “The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions,” in ISCA, 2017.
  • [94] A. J. Peña and P. Balaji, “Toward the Efficient Use of Multiple Explicitly Managed Memory Subsystems,” in CLUSTER, 2014.
  • [95] S. Phadke and S. Narayanasamy, “MLP Aware Heterogeneous Memory System,” in DATE, 2011.
  • [96] M. K. Qureshi, M. M. Franceschini, L. A. Lastras-Montaño, and J. P. Karidis, “Morphable Memory System: A Robust Architecture for Exploiting Multi-level Phase Change Memories,” in ISCA, 2010.
  • [97] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Performance Main Memory System using Phase-change Memory Technology,” in ISCA, 2009.
  • [98] L. E. Ramos et al., “Page placement in hybrid memory systems,” in ICS, 2011.
  • [99] S. Raoux et al., “Phase-Change Random Access Memory: A Scalable Technology,” IBM JRD, 2008.
  • [100] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutlu, “ThyNVM: Enabling software-transparent crash consistency in persistent memory systems,” in MICRO, 2015.
  • [101] J. Rivers and E. Davidson, “Reducing conflicts in direct-mapped caches with a temporality-based design,” in ICPP, 1996.
  • [102] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory Access Scheduling,” in ISCA, 2000.
  • [103] SAFARI Research Group, “Ramulator: A DRAM Simulator – GitHub Repository,” https://github.com/CMU-SAFARI/ramulator.
  • [104] Y. Sato et al., “Fast cycle RAM (FCRAM): A 20-ns Random Row Access, Pipe-Lined Operating DRAM,” in VLSIC, 1998.
  • [105] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
  • [106] V. Seshadri, A. Bhowmick, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “The Dirty-Block Index,” in ISCA, 2014.
  • [107] V. Seshadri et al., “RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization,” in MICRO, 2013.
  • [108] V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, “The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing,” in PACT, 2012.
  • [109] M. Silva, M. R. Hines, D. Gallo, Q. Liu, K. D. Ryu, and D. da Silva, “CloudBench: Experiment Automation for Cloud Environments,” in IC2E, 2013.
  • [110] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The Missing Memristor Found,” Nature, May 2008.
  • [111] J. Stuecheli, D. Kaseridis, D. Daly, H. C. Hunter, and L. K. John, “The Virtual Write Queue: Coordinating DRAM and Last-level Cache Policies,” in ISCA, 2010.
  • [112] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling,” in IEEE TPDS, 2016.
  • [113] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost,” in ICCD, 2014.
  • [114] L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu, “The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-application Interference at Shared Caches and Main Memory,” in MICRO, 2015.
  • [115] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu, “MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems,” in HPCA, 2013.
  • [116] K. Sudan et al., “Micro-pages: increasing DRAM efficiency with locality-aware data placement,” in ASPLOS, 2010.
  • [117] The International Technology Roadmap for Semiconductors, “Process integration, devices, and structures,” 2010.
  • [118] Transaction Performance Processing Council, “TPC Benchmarks,” http://www.tpc.org/.
  • [119] Y.-H. Tseng, C.-E. Huang, C. H. Kuo, Y. D. Chih, and C.-J. Lin, “High Density and Ultra Small Cell Size of Contact ReRAM (CR-RAM) in 90nm CMOS Logic Technology and Circuits,” in IEDM, 2009.
  • [120] H. Usui, L. Subramanian, K. Chang, and O. Mutlu, “SQUASH: Simple qos-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators,” arXiv CoRR, 2015.
  • [121] H. Usui, L. Subramanian, K. Chang, and O. Mutlu, “DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators,” ACM TACO, vol. 12, no. 4, Jan. 2016.
  • [122] H. Vandierendonck and A. Seznec, “Fairness Metrics for Multi-threaded Processors,” IEEE CAL, Feb 2011.
  • [123] H. Wong et al., “Phase change memory,” Proc. of the IEEE, 2010.
  • [124] D. Xiong, K. Huang, X. Jiang, and X. Yan, “Memory Access Scheduling Based on Dynamic Multilevel Priority in Shared DRAM Systems,” ACM TACO, vol. 13, no. 4, Dec. 2016.
  • [125] H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu, “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” in ICCD, 2012.
  • [126] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, and O. Mutlu, “Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-change Memories,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 11, no. 4, p. 40, 2014.
  • [127] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation,” in MICRO, 2017.
  • [128] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation,” arXiv:1704.02677 [CoRR], 2017.
  • [129] W. Zhang et al., “Exploring Phase Change Memory and 3D Die-Stacking for Power/Thermal Friendly, Fast and Durable Memory Architectures,” in PACT, 2009.
  • [130] J. Zhao, O. Mutlu, and Y. Xie, “FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems,” in MICRO, 2014.
  • [131] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “A Durable and Energy Efficient Main Memory Using Phase Change Memory Technology,” in ISCA, 2009.
  • [132] W. K. Zuravleff and T. Robinson, “Controller for a Synchronous DRAM That Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order,” US Patent No. 5,630,096, 1997.