Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation

04/10/2017 ∙ by Xiangyao Yu, et al. ∙ ETH Zurich 0

Putting the DRAM on the same package with a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-package DRAM is as a large cache. Unfortunately, most previous DRAM cache designs mainly optimize for hit latency and do not consider off-chip bandwidth efficiency as a first-class design constraint. Hence, as we show in this paper, these designs are suboptimal for use with in-package DRAM. We propose a new DRAM cache design, Banshee, that optimizes for both in- and off-package DRAM bandwidth efficiency without degrading access latency. The key ideas are to eliminate the in-package DRAM bandwidth overheads due to costly tag accesses through virtual memory mechanism and to incorporate a bandwidth-aware frequency-based replacement policy that is biased to reduce unnecessary traffic to off-package DRAM. Our extensive evaluation shows that Banshee provides significant performance improvement and traffic reduction over state-of-the-art latency-optimized DRAM cache designs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In-package DRAM technology integrates the CPU and a high-capacity multi-GB DRAM in the same package, enabling much higher bandwidth than traditional off-package DRAM. For emerging memory bandwidth-bound applications (e.g., graph and machine learning algorithms, sparse linear algebra-based HPC codes), in-package DRAM can significantly boost system performance 

[1, 2]. Several hardware vendors are either offering or will soon offer processors with in-package DRAM (e.g., Intel’s Knights Landing [3], AMD’s Fiji [4], and Nvidia’s Pascal [5]) and a large number of designs have been proposed in both industry and academia [6, 7, 8, 9, 10, 11, 12].

One critical property of in-package DRAM is that, while it provides high bandwidth, its latency will still be similar to or even worse than off-package DRAM [13, 14]. This is one of the reasons why the products first incorporating it are all in the throughput computing space, where the target applications are typically latency-tolerant, but very bandwidth-hungry. Many previous DRAM cache designs, however, assumed low latency in-package DRAM and threfore are not necessarily the best fit.

In particular, many of the designs incur large amounts of traffic to in-package and/or off-package DRAM for meta data management (e.g., tags, LRU bits) and cache replacement. In page-granularity DRAM caches, previous works (e.g., Tagless DRAM cache, TDC [10, 15]) have proposed storing the page mapping information in the Page Table Entries (PTEs) and Translation Lookaside Buffers (TLBs), by giving different physical address regions to in- and off-package DRAMs. This completely removes the bandwidth overhead for tag lookups. However, the bandwidth inefficiency for DRAM cache replacement still remains. Some techniques have been proposed to improve replacement bandwidth efficiency (e.g., footprint cache [15, 16] and frequency based replacement [17]), but existing solutions still incur significant overhead.

Supporting efficient replacement in PTE/TLB-based DRAM cache designs is inherently difficult due to the TLB coherence problem. When a page is remapped, an expensive mechanism is required to keep all TLBs coherent. Due to the complexity, previous work had certain requirements with respect to when replacement is allowed to happen (e.g., on every miss [10]) making it hard to design bandwidth efficient replacement.

In this paper, we propose Banshee, a DRAM cache design aimed at maximizing the bandwidth efficiency of both in- and off-package DRAM, while also providing low access latency. Similar to TDC [10], Banshee avoids tag lookup by storing DRAM cache presence information in the page table and TLBs. Banshee’s key innovation over TDC is its bandwidth-efficient replacement policy, and design decisions that enable its usage. Specifically, Banshee uses a hardware-managed frequency-based replacement (FBR) policy that only caches hot pages to reduce unnecessary data replacement traffic. To reduce the cost of accessing/updating frequency counters (which are stored in in-package DRAM), Banshee uses a new sampling approach to only read/write counters for a fraction of memory accesses. Since Banshee manages data at page granularity, sampling has minimal effect on the accuracy of frequency prediction. This strategy significantly brings down the bandwidth overhead of cache replacement. The new replacement policy also allows Banshee to support large (2 MB) pages efficiently with simple extensions. Traditional page based DRAM cache algorithms, in contrast, failed to cache large pages due to the overhead of frequent page replacement [15].

To enable the usage of this replacement scheme, we need new techniques to simplify TLB coherence. Banshee achieves this by not updating the page table and TLBs for every page replacement, but only doing so lazily in batches to amortize the cost. The batch update mechanism is implemented through software/hardware co-design where a small hardware table (Tag Buffer) maintains the up-to-date mapping information at each memory controller, and triggers the software routine to update page tables and TLBs whenever the buffer is full.

Specifically, Banshee makes the following contributions:

  1. Banshee significantly improves the bandwidth efficiency for DRAM cache replacement through a bandwidth-aware frequency-based replacement policy implemented in hardware. It minimizes unnecessary data and meta-data movement.

  2. Banshee resolves the address consistency problem, and greatly simplifies the TLB coherence problem that are in previous PTE/TLB-based DRAM cache designs, via a new, lazy TLB coherence mechanism. This allows more efficient replacement policies to be implemented.

  3. By combining PTE/TLB-based page mapping management and efficient hardware replacement, Banshee significantly improves in-package DRAM bandwidth efficiency. Compared to other three state-of-the-art DRAM cache designs, Banshee outperforms the best of them (Alloy Cache [7]) by 15.0% and reduces in-package DRAM traffic by 35.8%.

  4. Banshee can efficiently support large pages (2 MB) using PTEs/TLBs. Replacement overhead of large pages is significantly reduced through our bandwidth-efficient replacement policy.

2 Background

In this section, we discuss the design space of DRAM caches, and where previous proposals fit in that space. We focus on two major considerations in DRAM cache designs: how to track the contents of the cache (Section 2.1), and how to change the contents (i.e., replacement, Section 2.2).

For our discussion, we assume the processor has an SRAM last-level cache (LLC) managed at cacheline (64 B) granularity. Physical addresses are mapped to memory controllers (MC) statically at page granularity (4 KB). We also assume the in-package DRAM is similar to the first-generation High Bandwidth Memory (HBM). The link width between the memory controller and HBM is 16B, but with a minimum data transfer size of 32B [1]. Thus, reading a 64B cache line plus the tag transfers at minimum 96B. We also assume the in- and off-package DRAMs have the same latency.

Scheme DRAM Cache Hit DRAM Cache Miss Replacement Traffic Replacement
Decision
Large Page Caching
Unison Traffic: At least 128B
(data + tag read/update)
Latency: 1x
Traffic: At least 96B
(spec. data + tag read)
Latency: 2x
On every miss
32B tag + Footprint size
Hardware managed,
way-associative,
LRU
No
Alloy Traffic: 96B
(data + tag read)
Latency: 1x
Traffic: 96B
(spec. data + tag read)
Latency: 2x
On some misses
32B tag + 64B fill
Hardware managed,
direct-mapped,
stochastic [9]
Yes
TDC Traffic: 64B.
Latency: 1x
TLB coherence
Traffic: 64B.
Latency: 1x
TLB coherence
On every miss
Footprint size [15]
Hardware managed,
fully-associative,
FIFO
No
HMA Traffic: 64B.
Latency: 1x
Traffic: 0B.
Latency: 1x
Software managed, high replacement cost Yes
Banshee Traffic: 64B.
Latency: 1x
Traffic: 0B
Latency: 1x
Only for hot pages
32B tag + page size
Hardware managed,
way-associative,
frequency based
Yes
Table 1: Behavior of different DRAM cache designs. Assumes perfect way prediction for Unison Cache. Latency is relative to access time for off-package DRAM.

2.1 Tracking DRAM Cache Contents

For each LLC miss, the memory controller determines whether to access the in-package or off-package DRAM. Therefore, the mapping of each data block must be stored somewhere in the system.

2.1.1 Using Tags

The most common technique for tracking the contents of a cache is explicitly storing the tags for cached data. However, the tag storage can be significant when the DRAM cache is large. A 16 GB DRAM cache, for example, requires 512 MB (or 8 MB) tag storage if managed at cacheline (or page) granularity. As a result, state-of-the-art DRAM cache designs store tags in the in-package DRAM itself. These designs, however, has the bandwidth overhead of tag lookup for each DRAM cache access.

Table 1 summarizes the behavior for some state-of-the-art DRAM cache designs, including two that store tags in the in-package DRAM, Alloy Cache [7] and Unison Cache [8].

Alloy Cache is a direct-mapped DRAM cache storing data in cacheline granularity. The tag and data for a set are stored adjacently. On a hit, data and tag are read together with latency roughly that of a single DRAM access. On a miss, we pay the cost of a hit plus the access to off-package DRAM and filling the data into the DRAM cache. Therefore, both latency and bandwidth consumption may double. The original paper proposed to issue requests to in- and off-package DRAMs in parallel to hide miss latency. We disable this optimization here since it hurts performance when off-package DRAM bandwidth is scarce.

Unison Cache [8] stores data in page granularity and supports set associativity. The design relies on way prediction to provide fast hit latency. On an access, the memory controller reads all of the tags for a set plus the data only from the predicted way. On a hit and correct way prediction, the latency is roughly that of a single DRAM access; the data and tags are loaded and the LRU bits are updated. On a miss, latency is doubled, and we need extra traffic for off-package DRAM accesses and maybe cache replacement.

2.1.2 Using Address Remapping

Another technique for tracking data in the DRAM cache is via the virtual-to-physical address mapping [10, 18] in the page tables and TLBs. In these designs, the physical address space is carved up between in- and off-package DRAMs. Where a page is mapped to can be strictly determined using its physical address and the tag lookup is no longer required.

Besides the TLB coherence challenge mentioned in Section 1, TLB/PTE-based designs have another challenge that we call address consistency. When a page is remapped, its physical address is changed. Therefore, all of the on-chip caches must be scrubbed of cachelines on the remapped page to ensure consistent physical addresses. This leads to significant overhead for each page remapping.

Heterogeneous Memory Architecture (HMA [18]) uses a software based solution to these problems. Periodically, the operating system (OS) ranks all pages and moves hot pages into the in-package DRAM (and cold pages out). The OS updates all PTEs, flushes all TLBs for coherence, and flushes remapped pages from caches for address consistency. Due to the high cost, remapping can only be done at a very coarse granularity (100 ms to 1 s) in order to amortize the cost. Therefore, the replacement policy is not able to capture fine-grained temporal locality in applications. Also, all programs running in the system have to stop when the pages are moved, causing undesirable performance hiccups.

Tagless DRAM Cache (TDC [10]) also uses address remapping, but enables frequent cache replacement via hardware-managed TLB coherence. Specifically, TDC maintains a directory structure in main memory and updates it whenever an entry is inserted or removed from any TLB. Such fine-grained TLB coherence incurs extra design complexity. Further, the storage of the directory may be a potential scalability bottleneck as core count increases. The paper [10] does not discuss address consistency, so it is unclear which solution, if any, TDC employs.

2.2 DRAM Cache Replacement

Cache replacement is another big challenge in in-package DRAM designs. We discuss both hardware and software approaches presented in previous work.

2.2.1 Hardware-Managed

Hardware-managed caches are able to make placement decisions on each DRAM cache miss, and thus can adapt rapidly to changing workload behavior. Many designs, including Alloy Cache, Unison Cache and TDC, always place the data in the DRAM cache for each cache miss. Although this is common practice for SRAM caches, the incurred extra replacement traffic is quite expensive for DRAM. Some previous designs try to reduce replacement traffic with a stochastic mechanism [9]

where replacement happens with a small probability at each access. For page-granularity DRAM cache designs, frequent replacement also causes

over fetching, where a whole page is cached but only a subset is actually accessed before eviction. For this problem, previous works proposed to use a sector cache design [19] and rely on a “footprint predictor” [20, 15] to determine which blocks to load on a cache miss. We will show how Banshee can improve bandwidth efficiency over these designs, in Section 5.

When a cacheline/page is inserted, a replacement policy must select a victim cacheline/page. Alloy Cache is direct mapped, and so only has one choice. Conventional set-associative caches (e.g., Unison Cache) use least-recently-used (LRU) [8] or frequency-based (FBR) [17] replacement. These policies typically require additional metadata to track the relative age-of-access or access frequency for cachelines. Loading and updating the metadata incurs significant DRAM traffic. TDC implements a fully-associative DRAM cache but uses a FIFO replacement policy, which may hurt hit rate. Since Unison Cache and TDC do replacement at page granularity for each cache miss, they cannot support large pages efficiently.

2.2.2 Software-Managed

Software-implemented cache replacement algorithms (e.g., HMA [18]) can be fairly sophisticated, and so may do a better job than hardware mechanisms at predicting the best data to hold in the cache. However, they incur significant execution time overhead, and therefore, are generally invoked only periodically. This makes them much slower to adapt to changing application behavior.

3 Banshee DRAM Cache Design

Banshee aims to maximize bandwidth efficiency for both in- and off-package DRAM. To track DRAM contents, Banshee manages data mapping at page granularity using the page tables and TLBs like TDC and software-based designs. Different from previous designs, however, Banshee does not change a page’s physical address when it is remapped. Extra bits are added to PTEs/TLBs to indicate whether the page is cached or not. This helps resolve the address consistency problem (cf. Section 2.1.2). Banshee also uses a simpler and more efficient TLB coherence mechanism through software hardware co-design.

3.1 Banshee Architecture

Figure 1: Overall Architecture of Banshee. Changes to hardware/software components are highlighted in red.

Banshee implements a lazy TLB coherence protocol. Information of recently remapped pages is managed in hardware and periodically made coherent in page tables and TLBs with software support. Unlike a software based solution, the cache replacement decision can be made in hardware and take effect instantly. Unlike previous hardware based solutions, Banshee avoids the need for cache scrubbing.

Specifically, Banshee achieves this by adding a small hardware table in each memory controller. The table, called the Tag Buffer, holds information on recently remapped pages that is not yet updated in the PTEs. When a page is inserted into or evicted from in-package DRAM, the tag buffer is updated but the corresponding PTEs and TLBs are not. Since all LLC misses to that page go through the memory controller, they will see the up-to-date mapping even if the request carries a stale mapping from a TLB. Therefore, there is no need to update the TLBs eagerly. When the tag buffer eventually gets filled up, we push the latest mapping information to the PTEs and TLBs through a software interface. Essentially, the tag buffer allows us to update the page table lazily in batches, eliminating the need for fine-grained TLB coherence.

Fig. 1 shows the architecture of Banshee. Changes made to both hardware and software (TLB/PTE extensions and the tag buffer) are highlighted in red. The in-package DRAM is a memory side cache and is not inclusive with respect to on-chip caches. We explain the components of the architecture in the rest of this section.

3.2 PTE Extension

DRAM cache in Banshee is set-associative, each PTE is extended with mapping information indicating whether (cached bit) and where (way bits) a page is cached. The cached bit indicates whether a page is resident in DRAM cache, and if so, the way bits indicate which way the page is cached in.

Every L1 miss carries the mapping information (i.e., cached bit and way bits) from the TLB through the memory hierarchy. If the access is satisfied before it reaches a memory controller, the cached bit and way bits are simply ignored. If the request misses the LLC and reaches a memory controller, it first looks up the tag buffer for the latest mapping. A tag buffer miss means the attached information is up-to-date. For a tag buffer hit, the mapping carried by the request is ignored and the mapping info from the tag buffer is used.

Unlike previous PTE/TLB-based designs which supports NUMA style DRAM cache (i.e., in- and off-package DRAMs have separate physical address space), Banshee assumes inclusion between in- and off-package DRAMs and access memory with a single address space. We make this design decision because the NUMA style caching will suffer from the address consistency problem as discussed in Section 2.1.2. Namely, whenever a page is remapped, all cachelines in on-chip caches belonging to the page need to be updated or invalidated for consistency. This incurs significant overhead when cache replacement is frequent. In Banshee, however, remapping a page does not change its physical address, which avoids the address consistency issue.

Hardware prefetches from the L2 cache or lower present a complication. These caches typically operate in physical address space, and thus cannot access TLBs for their mapping information. In most systems, however, prefetches of this sort stop at a page boundary, since the data beyond that boundary in physical address space is likely unrelated to the previous page. Further, these prefetches are always triggered (directly or indirectly) by demand or prefetch requests coming from the core or L1. Thus, we can copy the mapping information from a triggering access to all prefetches it triggers.

3.3 Tag Buffer

A tag buffer resides in each memory controller and holds the mapping information of recently remapped pages belonging to that memory controller. Fig. 2 shows the architecture of a tag buffer. It is organized as a set associative cache with the physical address as the tag. The valid bit indicates whether the entry contains a valid mapping. For a valid entry, the cached bit and way bits indicate whether and where the page exists in DRAM cache. The remap bit is 1 if the mapping is not yet reflected in the page tables.

Most requests arriving at a memory controller carry mapping information, except for LLC dirty evictions. If the mapping of the evicted cacheline is not in the tag buffer, then the memory controller needs to probe the tags stored in the DRAM cache (cf. Section 4.1) to determine if this is a hit or miss. These probing operations consume DRAM cache bandwidth.

To reduce such tag probes for dirty eviction, we use otherwise empty entries in the tag buffer to hold mappings for pages cached in the LLC. On LLC misses that also miss in the tag buffer, we allocate an entry in the tag buffer for the page. While the valid bit is set to 1, indicating a useful mapping, the remap bit is set to 0, indicating the entry stores the same mapping as in the PTEs. Such entries can be replaced in the tag buffer without affecting correctness. We use an LRU replacement policy among entries with remap unset, which can be implemented by running the normal LRU algorithm with the remap bits as a mask.

3.4 Page Table and TLB Coherence

Figure 2: Tag buffer organization.

As the tag buffer fills, the mapping information stored in it needs to be migrated to the page table, to make space for future cache replacements. Since the tag buffer only contains the physical address of a page but page tables are indexed using virtual addresses, we need a mechanism to identify all the PTEs corresponding to a physical address.

TDC has proposed a hardware inverted page table to map a page’s physical address to its PTE [10]. This solution, however, cannot handle the page aliasing problem where multiple virtual addresses are mapped to the same physical address. To figure out whether aliasing exists, some internal structure in an OS (i.e., page descriptors) has to be accessed which incurs significant extra overhead.

Figure 3: 4-way associative DRAM cache layout (not drawn to scale).

We observe, however, that a modern OS already has a reverse mapping mechanism to quickly identify the associated PTEs for a physical page, regardless of any aliasing. This functionality is necessary to implement page replacement between main memory and secondary storage (e.g., Disk or SSD) since reclaiming a main memory page frame requires accessing all the PTEs mapped to it. Reverse mapping can be implemented through an inverted page table (e.g., Ultra SPARC and Power PC [21]) or a special reverse mapping mechanism (e.g., Linux [22]). In Banshee, the PTE coherence scheme is implemented using reverse mapping.

When a tag buffer fills up to a pre-determined threshold, it sends an interrupt to one or more cores. The core(s) receiving the interrupt will execute a software routine to update recently remapped pages’ PTEs. Specifically, all entries are read from the tag buffers in all memory controllers (which are memory mapped). For each tag buffer entry, the physical address is used to identify the corresponding PTEs through the reverse mapping mechanism. Then, the cached bit and way bits are updated for each PTE. During this process, the tag buffers can be locked so that no DRAM cache replacement happens. But the DRAMs can still be accessed and no programs need to stopped.

After all tag buffer entries have been applied to the page table, the software routine issues a system wide TLB shootdown to enforce TLB coherence. After this, a message is sent to all tag buffers to clear the remap bits for all entries. Note that the mapping information can stay in the tag buffer to help reduce tag probing for dirty evictions (cf. Section 3.3).

Depending on a system’s software and hardware, the mechanism discussed above may take many cycles. However, since this cost only needs to be paid once a tag buffer is almost full, the cost of updating PTEs is amortized. Furthermore, as we will see in Section 4, remapping pages too often leads to poor performance due to high replacement traffic. Thus, our design tries to limit the frequency of page remapping, further reducing the cost of PTE updates.

4 Bandwidth-Efficient Cache Replacement

As discussed in Section 2.2, the cache replacement policy can significantly affect traffic in DRAMs. This is especially true for page granularity DRAM cache designs due to the over fetching problem. In this section, we propose a frequency based replacement (FBR) policy with sampling to achieve a good hit rate while minimizing DRAM traffic.

We first discuss the physical layout of the data and metadata in the DRAM cache in Section 4.1. We then describe Banshee’s cache replacement algorithm in Section 4.2.

4.1 DRAM Cache Layout

Many previously proposed tag-based DRAM cache schemes store the tags and data in the same DRAM row to exploit row buffer locality, since they always access tags along with data. Such an organization can be efficient for a cacheline granularity DRAM cache. For a page granularity DRAM cache, however, pages and tags do not align well within a DRAM row buffer [8], which incurs extra design complexity and inefficiency.

In Banshee, the tags are rarely accessed — only for cache replacement and LLC dirty evictions that miss in the tag buffer. Therefore, tags and data are stored separately for better alignment. Fig. 3 shows the layout of a data row and a tag row in a DRAM cache with row buffer size of 8 KB and page size of 4 KB. The tags and other metadata of each DRAM cache set take 32 bytes in a tag row. For a 4-way associative DRAM cache, each set contains 16 KB of data and 32 bytes of metadata, so the metadata overhead is only 0.2%.

Banshee tracks each page’s access frequency with a counter, stored in the metadata. We store counters not only for the pages in the DRAM cache, but also for some pages not in cache, which are candidates to bring into the cache. Intuitively, we want to cache pages that are most frequently accessed, and track pages that are less frequently accessed as candidates.

4.2 Bandwidth Aware Replacement Policy

A frequency-based replacement policy incurs DRAM cache traffic through reading and updating the frequency counters and through replacing data. In Section 4.2.1, we introduce a sampling-based counter maintenance scheme to reduce the counter traffic. In Section 4.2.2, we discuss the bandwidth aware replacement algorithm that attempts to minimize replacement traffic while maximizing hit rate.

4.2.1 Sampling-Based Counter Updates

In a standard frequency-based replacement policy [23, 24], each access increments the data’s frequency counter. We observe, however, that incrementing the counter for each access is not necessary. Instead, an access in Banshee only updates a page’s frequency counter with a certain sample rate. For a sample rate of 10%, for example, the frequency counters are accessed/updated only once for every 10 DRAM accesses. This will reduce counter traffic by 10. Furthermore, since sampling slows the incrementing of the counters, we can use fewer bits to represent each counter.

It may seem that updating counters based on sampling leads to inaccurate detection of “hot” pages. However, the vast majority of applications exhibit significant spatial locality. When a cacheline misses in the DRAM cache, other cachelines belonging to the same page are likely to be accessed soon as well. Each of these accesses to the same page has a chance to update the counter. In fact, without sampling, we find that counters quickly reach large values but only the high order bits are used for replacement decision. Sampling effectively discards the low-order bits of each counter, which have little useful information anyway.

We further observe that when the DRAM cache works well, i.e., it has low miss rate, replacement should be rare and the counters need not be frequently updated. Therefore, Banshee uses an adaptive sample rate which is the product of the cache miss rate and a constant rate (sampling coefficient).

4.2.2 Replacement Algorithm

DRAM cache replacement can be expensive, in terms of traffic, especially for page granularity designs. For each replacement, the memory controller transfers a whole page (assuming no footprint cache) from off-package DRAM to in-package DRAM. Even worse, if the evicted page is dirty, the memory controller must transfer it from in-package DRAM to off-package DRAM, doubling the traffic for the replacement. For cases where a page sees only a few accesses before being replaced, we may actually see higher off-package DRAM traffic (and worse performance) than if the DRAM cache was not present.

Frequency-based replacement does not inherently preclude this problem. In a pathological case for FBR, we may keep replacing the least frequently accessed page in the cache with a candidate whose counter has just exceeded it. When pages have similar counter values, a large number of such replacements can be triggered, thrashing the cache and wasting bandwidth.

Banshee solves this problem by only replacing a page when the candidate’s counter is greater than the victim’s counter by a certain threshold. This ensures that a page just evicted from the DRAM cache must be accessed for at least times before it can enter the cache again, thus preventing a page from entering and leaving frequently. Note that reducing the frequency of replacement also increases the time between tag buffer overflows, indirectly reducing the overhead of updating PTEs.

1 Input : tag
2 # rand(): random number between 0 and 1.0
3 if rand() < recent_miss_rate sampling_coeff then
4        meta = dram_cache.loadMetadata(tag)
5        if tag in meta then
6               meta[tag].count ++
7               if tag in meta.candidates and meta[tag].count > meta.cached.minCount() + threshold then
8                      replace the cached page having the minimal counter with the accessed page
9               end if
10              if meta[tag].count == max_count then
11                      # Counter overflow, divide by 2
12                      forall t in meta.tags do
13                             meta[t].count /= 2
14                      end forall
15                     
16               end if
17              dram_cache.storeTag(tag, metadata)
18              
19       else
20               victim = random page in meta.candidates
21               if rand() < 1 / victim.count then
22                      victim.tag = tag
23                      victim.count = 1
24                      dram_cache.storeTag(tag, metadata)
25                     
26               end if
27              
28        end if
29       
30 end if
Algorithm 1 Cache Replacement Algorithm

The complete cache replacement algorithm of Banshee is shown in Algorithm 1. For each request from the LLC, a random number is generated to determine whether the current access should be sampled. If it is not sampled, which is the common case, then the access is made to the proper DRAM (in- or off-package) directly. No metadata is accessed and no replacement happens.

If the current access is sampled, then the metadata for the corresponding set is loaded from the DRAM cache to the memory controller. If the currently accessed page exists in the metadata, its counter is incremented. Furthermore, if the current page is in the candidate part and its counter is greater than a cached page’s counter by a threshold, then cache replacement should happen. By default, the threshold is the product of the number of cachelines in a page and the sampling coefficient divided by two (threshold page_size sampling_coeff / 2). Intuitively, this means replacement can happen only if the benefit of swapping the pages outweighs the cost of the replacement operation. If a counter saturates after being incremented, all counters in the metadata will be reduced by half using a shift operation in hardware.

If the current page does not exist in the metadata, then a random page in the candidate part is selected as the victim. The current page can overtake the victim entry with a certain probability, which decreases as the victim’s counter gets larger. This way, it is less likely that a hot candidate page is evicted.

4.3 Supporting Large Pages

Large pages have been widely used to reduce TLB misses and therefore should be supported in DRAM caches. In Banshee, we manage large pages using PTEs and TLBs as with regular pages. We assume huge pages (1 GB) are managed purely in software and discuss the hardware support for large pages (2 MB) here.

In Banshee, the DRAM cache can be partitioned to two portions for normal and large pages respectively. Partitioning can happen at context switch time by the OS which knows how many large pages each process is using. Partitioning can also be done dynamically using runtime statistics based on access counts and hit rates for different page sizes. Since most of our applications either make very heavy use of large pages, or very light usage, partitioning could give either most or almost none of the cache, respectively, for large pages. We leave a thorough exploration of these partitioning policies for future work.

We force each page (regular or large) to map to a single MC (memory controller) to simplify the management of frequency counters and cache replacement. A memory request learns the size of the page being accessed from the TLB, and uses this information to determine which MC it should access. In order to figure out the MC mapping for LLC dirty evictions, a bit is appended to each on-chip cacheline to indicate its page size. When the OS reconfigures large pages, which happens very rarely [25], all lines within the affected pages should be flushed from the LLC and in-package DRAMs.

In terms of the data and tag layout, a large page mapped to a particular way will span multiple cache sets taking the corresponding way in each set. One difference between regular and large pages is the cache replacement policy. Due to the higher cost of replacing a large page, the frequency counters need a greater threshold for replacement. We also reduce the sample rate of updating frequency counters to prevent counter overflow. Note that large pages do not work well for page-granularity schemes that replace on each DRAM cache miss. TDC, for example, disables caching of large pages.

5 Evaluation

We now evaluate the performance of Banshee and compare it to other DRAM cache designs. Section 5.1 discusses the methodology of the experiments. Section 5.2 and Section 5.3 show the performance and DRAM traffic comparison of different DRAM cache designs. Finally, Section 5.5 presents sensitivity studies.

5.1 Methodology

System Configuration
Frequency 2.7 GHz
Number of Cores N = 16
Core Model 4-Issue, Out-of-Order
Memory Subsystem
Cacheline Size 64 bytes
L1 I Cache 32 KB, 4-way
L1 D Cache 32 KB, 8-way
L2 Cache 128 KB, 8-way
Shared L3 Cache 8 MB, 16-way

Off-Package DRAM
Channel 1 channel
Bus Frequency 667 MHz (DDR 1333 MHz)
Bus Width 128 bits per channel
tCAS-tRCD-tRP-tRAS 10-10-10-24
In-Package DRAM
Capacity 1 GB
Channel 4 channels
Bus Frequency 667 MHz (DDR 1333 MHz)
Bus Width 128 bits per channel
tCAS-tRCD-tRP-tRAS 10-10-10-24
Table 2: System Configuration.
DRAM Cache and Tag
Ways 4
Page Size 4096 KB
Tag Buffer 1 buffer per MC
8-way, 1024 entries
Flushed when 70% full
Tag Buffer Flush Overhead 20 us
TLB Shoot Down Overhead Initiator 4 us, slave 1 us
Cache Replacement Policy
Cache Set Metadata 4 cached pages
5 candidate pages
Frequency Counter 5 bits
Sampling Coefficient 10%
Table 3: Banshee Configuration.

We use ZSim [26] to simulate a multi-core processor whose configuration is shown in Table 2. The chip has one channel of off-package DRAM and four channels of in-package DRAM. We assume all the channels are the same to model behavior of in-package DRAM [1, 3]. The maximal bandwidth that this configuration offers is 21 GB/s for off-package DRAM and 85 GB/s for in-package DRAM. In comparison, Intel’s Knights Landing [13] has roughly 4 the bandwidth and number of cores (72 cores, 90 GB/s off-package DRAM and 300+ GB/s in-package DRAM bandwidth), so we use the same bandwidth per core.

The default parameters of Banshee are shown in Table 3. Each PTE and TLB entry is extended with 3 bits for the mapping information. This is a small storage overhead (4%) for TLBs and zero storage overhead for PTEs since we are using otherwise unused bits. Each request in the memory hierarchy carries the 3 mapping bits. Each memory controller has an 8-way set associative tag buffer with 1024 entries, requiring only 5 KB storage per memory controller. Hardware triggers a “tag buffer full” interrupt when the buffer is 70% full. We assume the interrupt handler runs on a single randomly chosen core and takes 20 microseconds. For TLB shootdown, the initiating core pays an overhead of 4 microseconds and every other core pays 1 microsecond overhead [27].

The frequency counters are 5 bits long. The 32-byte per set metadata holds information for 4 cached pages and 5 candidate pages111With a 48-bit address space and the DRAM cache parameters, the tag size is 48 - 16 ( sets) - 12 (page offset) = 20 bits. Each cached page has metadata of 20 + 5 + 1 + 1 = 27 bits and each candidate page has 25 bits of metadata (Fig. 3).. The default sampling coefficient is 10% – the actual sample rate is this multiplied by the recent DRAM cache miss rate.

5.1.1 Baselines

We compare Banshee to the following baselines.

No Cache: The system only contains off-chip DRAM.

Cache Only: The system only contains in-package DRAM with infinite capacity.

Alloy Cache [7]: A state-of-the-art cacheline-granularity design, described in Section 2. We also include the bandwidth efficient cache fills and the bandwidth efficient writeback probe optimizations from BEAR [9] to improve bandwidth efficiency. This includes a stochastic replacement mechanism that only does replacement with 10% probability. In some experiments, we show results from always replacing (Alloy 1), and replacing 10% of the time (Alloy 0.1).

Unison Cache [8]: A state-of-the-art page-granularity design, described in Section 2. We model an LRU replacement policy. We assume perfect way prediction and footprint prediction. For footprint prediction, we first profile each workload to collect the average number of blocks touched per page fill; the actual experiments charge this amount of replacement traffic. The footprint is managed at 4-line granularity. We assume the predictors incur no overhead.

Tagless DRAM Cache (TDC) [10]: A state-of-the-art page-granularity design described in Section 2. We modeled an idealized TDC configuration. Specifically, we assume a zero-overhead TLB coherence mechanism and ignore all the side effects of the mechanism (i.e., address consistency, page aliasing). We also implement a perfect footprint cache for TDC like we do for Unison Cache.

Name Workloads
Mix1 libq-mcf-soplex-milc-bwaves-lbm-omnetpp-gcc 2
Mix2 libq-mcf-soplex-milc-lbm-omnetpp-gems-bzip2 2
Mix3 mcf-soplex-milc-bwaves-gcc-lbm-leslie-cactus 2
Table 4: Mixed SPEC workloads.

5.1.2 Benchmarks

We use SPEC CPU2006 [28] and graph analytics benchmarks [29]. Each experiment is simulated for 100 billion instructions or to completion, whichever happens first. By default, all benchmarks use small pages only.

We selected a subset of SPEC benchmarks that have large memory footprint and consider both homogeneous and heterogeneous workloads. For homogeneous workloads, each core in the simulated system executes one instance of a benchmark and all the instances run in parallel. Heterogeneous workloads model the multi-programming environment where the cores run a mixture of benchmarks. We use three randomly selected mixtures, shown in Table 4.

To represent throughput computing workloads, the target applications for the first systems employing in-package DRAM, we include multi-threaded graph analytics workloads. We use all graph workloads from [29], which span the spectrum of memory and compute intensity.

Figure 4: Speedup normalized to NoCache.
Figure 5: In-package DRAM traffic.
Figure 6: Off-package DRAM traffic.

Many benchmarks that we evaluated have very high memory bandwidth requirement. With the CacheOnly configuration, for example, 10 out of the 16 benchmarks have an average DRAM bandwidth consumption over 50 GB/s (bursts may exceed this). This bandwidth requirement exerts enough pressure to in-package DRAM (with maximum bandwidth of 85 GB/s) so that bandwidth changes can significantly affect performance. Our memory intensive benchmarks experience 2–4 higher memory access latency compared to compute intensive benchmarks due to the bandwidth bottleneck.

5.2 Performance

Fig. 4 shows the speedup of different cache designs normalized to NoCache

. The average bars indicate geometric mean across all workloads. On average, Banshee provides a 68.9% speedup over Unison Cache, 26.1% over TDC and 15.0% over Alloy Cache. The higher bandwidth efficiency is the major contributor to performance improvement. Compared to Unison Cache and Alloy Cache, Banshee can also reduce the cache miss latency since the DRAM cache need not be probed to check presence.

Unison Cache and TDC have worse performance than other designs on some benchmarks (e.g., omnetpp and milc) due to the lack of spatial locality. As a result, they spend a lot of DRAM traffic for cache replacement. Having a footprint predictor does not completely solve the problem since the footprint cannot be managed at cacheline granularity due to the storage overhead (we modeled 4-line granularity). Banshee is also at page granularity, but its bandwidth-aware replacement policy offsets this inefficiency for these benchmarks.

On lbm, however, both Banshee and Alloy 0.1 give worse performance than other baselines. lbm has very good spatial locality on each page, but a page is only accessed a small number of times before it gets evicted. Alloy 1, Unison Cache and TDC have good performance on lbm since they do replacement for every DRAM cache miss, therefore exploiting more locality. Banshee and Alloy 0.1, in contrast, cannot leverage all the locality due to their selective data caching. One solution is to dynamically switch between different replacement policies based on a program’s access pattern. For example, some pre-determined sets in the cache may use different replacement policies and hardware selects the policy for the rest of the cache through set dueling [9, 30]. We leave exploration of this for future work.

The red dots in Fig. 4 shows the Miss Per Kilo Instruction (MPKI) for each DRAM cache scheme on different benchmarks. In general, Alloy Cache and Banshee achieve similar miss rates, while Unison Cache and TDC have a very low miss rate since we assume perfect footprint prediction for them.

For some benchmarks (e.g., pagerank, omnetpp), Banshee performs even better than CacheOnly. This is because CacheOnly has no external DRAM. So the total available DRAM bandwidth is less than Banshee which has both in- and off-package DRAM. We will have more discussion of balancing DRAM bandwidth in Section 5.4.2.

5.3 DRAM Traffic

Fig. 5 and Fig. 6 show the in- and off-package DRAM traffic respectively. Traffic is measured in bytes per instruction to convey memory intensity of a workload, in addition to comparative behavior of the cache designs.

In Fig. 5, the HitData is the data transfer for DRAM cache hits, which is the only useful data transfer; everything else is overhead. For Alloy and Unison Cache, MissData is the speculative data loading for cache miss and Tag is the traffic for tag accesses. Tag also represents the frequency counter accesses and tag probes for LLC dirty evictions in Banshee. Replacement is the traffic for DRAM cache replacement.

Both Unison and Alloy Cache incur significant traffic for tag accesses. Alloy Cache also consumes considerable traffic for speculative loads at cache misses. Unison Cache has small speculative load traffic due to its low miss rate. Both schemes also require significant replacement traffic. Stochastic replacement can reduce Alloy Cache’s replacement traffic, but other overheads still remain.

TDC can eliminate the tag traffic by managing mapping information in PTE/TLBs. However, like Unison Cache, it still incurs significant traffic for DRAM cache replacement. For most benchmarks, the traffic difference between Unison and TDC is just the removal of Tag traffic. For some benchmarks (e.g., mcf, libquantum), TDC incurs less replacement traffic than Unison Cache because of its higher hit rate due to full associativity. On some other benchmarks (e.g., pagerank, tri_count), however, it incurs more traffic due to FIFO replacement. Overall, replacement traffic limits the performance of both Unison Cache and TDC.

Because of the bandwidth-aware replacement policy, Banshee provides significantly better efficiency in in-package DRAM (35.8% less traffic than the best baseline). Banshee achieves this without incurring extra off-package traffic, which is a necessity to provide better performance. On average, its off-package DRAM traffic is 3.1% lower than the best Alloy Cache scheme (Alloy 1), 42.4% lower than Unison Cache and 43.2% lower than TDC.

As mentioned earlier, graph codes are arguably more important for our modeled system. We note that for graph codes with high traffic (i.e., pagerank, tri_count and graph500), Banshee gives some of its largest gains, significantly reducing both in- and off-package DRAM traffic compared to all baseline schemes.

Figure 7: Performance (bars) and DRAM cache bandwidth (red dots) of different replacement policies on Banshee normalized to NoCache. Results averaged over all benchmarks.

5.4 Banshee Extensions

5.4.1 Supporting Large Pages

We evaluated the performance of large pages in Banshee for graph benchmarks. For simplicity, we assume all data resides on large (2 MB) pages. The sampling coefficient was chosen to be 0.001 and the replacement threshold was calculated accordingly (Section 4.2.2).

Our evaluation shows that with large pages, performance is on average 3.6% higher than the baseline Banshee with regular 4 KB pages. Here we assume perfect TLBs to only show the performance difference due to the DRAM subsystem. The gain comes from the more accurate hot page detection at larger page granularity as well as fewer frequency counter updates and PTE/TLB updates.

5.4.2 Balancing DRAM Bandwidth

Some related work [31, 32, 33] proposed to balance the accesses to in- and off-package DRAMs in order to maximize the overall bandwidth efficiency. These optimizations are orthogonal to Banshee and can be used on top of it.

We implemented the technique from BATMAN [31] which turns off parts of the in-package DRAM if it has too much traffic (i.e., over 80% of total DRAM traffic). On average, the optimization leads to 5% (up to 24%) performance improvement for Alloy Cache and 1% (up to 11%) performance improvement for Banshee. The gain is smaller in Banshee since it has less total bandwidth consumption. With bandwidth balancing, Banshee still outperforms Alloy Cache by 12.4%.

5.5 Sensitivity Study

In this section, we study the performance of Banshee with different design parameters.

5.5.1 DRAM Cache Replacement Policy

We show performance and DRAM cache bandwidth of different replacement policies in Fig. 7 to understand where the performance gain of Banshee comes from.

Banshee LRU uses an LRU policy similar to UnisonCache but does not use footprint cache. It has bad performance and high bandwidth consumption due to its frequent page replacement (on every miss).

Using FBR improves performance and bandwidth efficiency on top of LRU since only hot pages are cached. However, if the frequency counters are updated on every DRAM cache access (Banshee FBR no sample, similar to CHOP [17]), significant meta data traffic ( of Banshee) will be incurred leading to performance degradation. We conclude that both FBR and sampling-based counter management should be used to achieve good performance in Banshee.

5.5.2 Page Table Update Overhead

Update Cost (us) Avg Perf. Loss Max Perf. Loss
10 0.11% 0.76%
20 0.18% 1.3%
40 0.31% 2.4%
Table 5: Page table update overhead

One potential disadvantage of Banshee is the overhead of page table updates (cf. Section 3.4). However, this cost is paid only when the tag buffer fills up after many page remappings. Furthermore, our replacement policy intentionally slows remapping (cf. Section 4). On average, the page table update is triggered once every 14 milliseconds, which has low overhead in practice.

Table 5 shows the average and maximal performance degradation across our benchmarks, relative to free updates, for a range of update costs. The average overhead is less than 1%, and scales sublinearly with update cost. Note that doubling the tag buffer size has similar effect as reducing the page table update cost by half. Therefore, we do not study the sensitivity of tag buffer size here.

5.5.3 DRAM Cache Latency and Bandwidth

(a)
(b) Latency
(c) Bandwidth
Figure 8: Sweeping DRAM cache latency and bandwidth. Default parameter setting highlighed on x-axis.

Fig. 8 shows the performance (normalized to NoCache) of different DRAM cache schemes sweeping the DRAM cache latency and bandwidth. Each data point is the geometric mean performance over all benchmarks. The x-axis of each figure shows the latency and bandwidth of in-package DRAM relative to off-package DRAM. By default, we assume in-package DRAM has the same latency and bandwidth as off-package DRAM.

As the in-package DRAM’s latency decreases and bandwidth increases, performance of all DRAM cache schemes gets better. We observe that performance is more sensitive to bandwidth change than to zero-load latency change, since bandwidth is the bottleneck in these workloads. Although not shown in the figure, changing the core count in the system has a similar effect as changing DRAM cache bandwidth. Since Banshee’s performance gain over baselines is more significant when the bandwidth is more limited, we expect Banshee to perform better with more cores.

5.5.4 Sampling Coefficient

(a) Miss rate
(b) DRAM cache traffic
Figure 9: Sweeping sampling coefficient in Banshee (default sampling coefficient = 0.1).

Fig. 9 shows the DRAM cache miss rate and traffic breakdown sweeping the sampling coefficient in Banshee. As the sampling coefficient decreases, miss rate increases but only by a small amount.

Banshee incurs some traffic for loading and updating frequency counters, but this overhead becomes negligible for a sampling rate of 10%, which still provides a low miss rate.

5.5.5 Associativity

Number of Ways 1 way 2 ways 4 ways 8 ways
Miss Rate 36.1% 32.5% 30.9% 30.7%
Table 6: Cache miss rate vs. associativity in Banshee

In Table 6, we sweep the number of ways in Banshee and show the cache miss rate. Doubling the number of ways requires adding one more bit to each PTE, and doubles the per-set metadata. Higher associativity reduces the cache miss rate, though. Since we see quickly diminishing gains above four ways, we choose that as the default design point.

6 Related Work

Besides those discussed in Section 2, other DRAM cache designs have been proposed in the literature. PoM [11] and CAMEO [12] manage in- and off-package DRAM in different address spaces at cacheline granularity. Tag Tables [34] compressed the tag storage for Alloy Cache so that it fits in on-chip SRAM. Bi-Modal Cache [35] supports heterogeneous block sizes (cacheline and page) to get the best of both worlds. All these schemes focus on minimizing latency of the design and incur significant traffic for tag lookups and/or cache replacement.

Similar to this paper, several other papers have proposed DRAM cache designs with bandwidth optimizations. CHOP [17] targets the off-package DRAM bandwidth bottleneck for page granularity DRAM caches, and uses frequency-based replacement instead of LRU. However, their scheme still incurs significant traffic for counter updates, whereas Banshee uses sampling based counter management and bandwidth-aware replacement. Several other papers propose to improve off-package DRAM traffic for page granularity using a footprint cache [16, 8, 15]. As we showed, however, this alone cannot eliminate all unnecessary replacement traffic. That said, the footprint idea is orthogonal to Banshee and therefore can be incorporated to Banshee for even better performance.

BEAR [9] improves Alloy Cache’s DRAM cache bandwidth efficiency. Our implementation of Alloy Cache already includes some of the BEAR optimizations. These optimizations cannot eliminate all tag lookups, and as we have shown in Section 5.3, Banshee provides higher DRAM cache bandwidth efficiency.

Several other works have considered heterogeneous memory technologies beyond in–package DRAM. These include designs for hybrid DRAM and Phase Change Memory (PCM) [36, 37, 38] and a single DRAM chip with fast and slow portions [39, 40]. We believe the ideas in this paper can be applied to such heterogeneous memory systems, as well.

Among all previous designs, TDC [10] is the one closest to Banshee. Both schemes use PTE/TLBs to track data mapping at page granularity. The key innovation in Banshee was the bandwidth-efficient DRAM cache replacement policy and the associated designs that enabled it (lazy TLB coherence). Banshee significantly reduces both data and meta data replacement traffic through FBR and frequency counter sampling. This improves in- and off-package DRAM bandwidth efficiency which leads to performance improvement.

The replacement policy used in Banshee, however, cannot be efficiently implemented on TDC due to the address consistency and TLB coherence problem. Since TDC uses different physical addresses for in- and off-package DRAMs, if a page is remapped after some of its cachelines have been caches, these previously loaded cachelines will have stale addresses. This makes the existing address consistency problem in TDC even worse.

7 Conclusion

A new DRAM cache algorithm, Banshee, was proposed in this paper. Banshee aims at maximizing in- and off-package DRAM bandwidth efficiency and therefore performs better than previous latency optimized DRAM cache algorithms. Banshee achieves this through a software hardware co-design approach. Specifically, Banshee uses a new TLB coherence mechanism, and a bandwidth aware DRAM replacement policy. Our extensive experimental results show that Banshee can provide significant improvement over state-of-the-art DRAM cache schemes.

References

  • [1] M. O’Connor, “Highlights of the high-bandwidth memory (HBM) standard,”
  • [2] AMD, “High-bandwidth memory (HBM).” https://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf.
  • [3] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, no. 2, pp. 34–46, 2016.
  • [4] “The road to the AMD “Fiji” GPU.” http://www.semicontaiwan.org/zh/sites/semicontaiwan.org/files/data15/docs/3_semicont_2015_-_black.pdf.
  • [5] “Nvlink, pascal and stacked memory: Feeding the appetite for big data.” https://devblogs.nvidia.com/parallelforall/nvlink-pascal-stacked-memory-feeding-appetite-big-data/.
  • [6] G. H. Loh and M. D. Hill, “Efficiently enabling conventional block sizes for very large die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 454–464, ACM, 2011.
  • [7] M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 235–246, IEEE, 2012.
  • [8] D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, “Unison cache: A scalable and effective die-stacked dram cache,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 25–37, IEEE, 2014.
  • [9] C. Chou, A. Jaleel, and M. K. Qureshi, “BEAR: techniques for mitigating bandwidth bloat in gigascale dram caches,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 198–210, ACM, 2015.
  • [10] Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 211–222, ACM, 2015.
  • [11] J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, “Transparent hardware management of stacked dram as part of memory,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 13–24, IEEE, 2014.
  • [12] C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, IEEE, 2014.
  • [13] “Intel®xeon phiTM processor “knights landing” architectural overview.” https://www.nersc.gov/assets/Uploads/KNL-ISC-2015-Workshop-Keynote.pdf.
  • [14] “Hybrid memory cube specification 2.1.” http://www.hybridmemorycube.org.
  • [15] H. Jang, Y. Lee, J. Kim, Y. Kim, J. Kim, J. Jeong, and J. W. Lee, “Efficient footprint caching for tagless dram caches,” in Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), 2016.
  • [16] D. Jevdjic, S. Volos, and B. Falsafi, “Die-stacked dram caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache,” in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), vol. 41, pp. 404–415, ACM, 2013.
  • [17] X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian, “CHOP: Adaptive filter-based dram caching for cmp server platforms,” in Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12, IEEE, 2010.
  • [18] M. R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. H. Loh, “Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories,” in Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 126–136, IEEE, 2015.
  • [19] J. B. Rothman and A. J. Smith, “Sector cache design and performance,” in Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 124–133, IEEE, 2000.
  • [20] S. Kumar and C. Wilkerson, “Exploiting spatial locality in data caches using spatial footprints,” in Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA), vol. 26, pp. 357–368, IEEE Computer Society, 1998.
  • [21] W. Stallings, G. K. Paul, and M. M. Manna, Operating systems: internals and design principles, vol. 148. Prentice Hall Upper Saddle River, NJ, 1998.
  • [22] D. P. Bovet and M. Cesati, Understanding the Linux kernel. " O’Reilly Media, Inc.", 2005.
  • [23] D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, “LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies,” IEEE transactions on Computers, no. 12, pp. 1352–1361, 2001.
  • [24] J. T. Robinson and M. V. Devarakonda, “Data cache management using frequency-based replacement,” in Proceedings SIGMETRICS, vol. 18, ACM, 1990.
  • [25] “Configuring huge pages in red hat enterprise linux 4 or 5.” https://goo.gl/lqB1uf.
  • [26] D. Sanchez and C. Kozyrakis, “ZSim: fast and accurate microarchitectural simulation of thousand-core systems,” in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), vol. 41, pp. 475–486, ACM, 2013.
  • [27] C. Villavieja, V. Karakostas, L. Vilanova, Y. Etsion, A. Ramirez, A. Mendelson, N. Navarro, A. Cristal, and O. S. Unsal, “Didi: Mitigating the performance impact of tlb shootdowns using a shared tlb directory,” in Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 340–349, IEEE, 2011.
  • [28] J. L. Henning, “Spec cpu2006 benchmark descriptions,” ACM SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006.
  • [29] X. Yu, C. Hughes, N. Satish, and S. Devadas, “IMP: Indirect memory prefetcher,” in Proceedings of the 48th International Symposium on Microarchitecture (MICRO), IEEE, 2015.
  • [30] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Adaptive insertion policies for high performance caching,” in Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), vol. 35, pp. 381–391, ACM, 2007.
  • [31] C. Chou, A. Jaleel, and M. Qureshi, “Batman: Maximizing bandwidth utilization of hybrid memory systems,” Tech. Rep. TR-CARET-2015-01, 2015.
  • [32] N. Agarwal, D. Nellans, M. O’Connor, S. W. Keckler, and T. F. Wenisch, “Unlocking bandwidth for GPUs in cc-numa systems,” in Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 354–365, IEEE, 2015.
  • [33] N. Agarwal, D. Nellans, M. Stephenson, M. O’Connor, and S. W. Keckler, “Page placement strategies for GPUs within heterogeneous memory systems,” in Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), vol. 50, pp. 607–618, ACM, 2015.
  • [34] S. Franey and M. Lipasti, “Tag tables,” in Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 514–525, IEEE, 2015.
  • [35] N. Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan, “Bi-modal dram cache: Improving hit rate, hit latency and bandwidth,” in Proceedings of the 47th Annual International Symposium on Microarchitecture (MICRO), pp. 38–50, IEEE, 2014.
  • [36] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efficient and scalable hybrid memories using fine-granularity dram cache management,” Computer Architecture Letters, vol. 11, no. 2, pp. 61–64, 2012.
  • [37] H. Yoon, J. Meza, R. Ausavarungnirun, R. A. Harding, and O. Mutlu, “Row buffer locality aware caching policies for hybrid memories,” in the 30th International Conference on Computer Design (ICCD), pp. 337–344, IEEE, 2012.
  • [38] G. Dhiman, R. Ayoub, and T. Rosing, “PDRAM: a hybrid pram and dram main memory system,” in the 46th Design Automation Conference (DAC), pp. 664–669, IEEE, 2009.
  • [39] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-latency dram: A low latency and low cost dram architecture,” in Proceedings of the 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 615–626, IEEE, 2013.
  • [40] S.-L. Lu, Y.-C. Lin, and C.-L. Yang, “Improving dram latency with dynamic asymmetric subarray,” in Proceedings of the 48th International Symposium on Microarchitecture (MICRO), pp. 255–266, ACM, 2015.