Thread Batching for High-performance Energy-efficient GPU Memory Design

06/13/2019 ∙ by Bing Li, et al. ∙ Duke University Florida International University University of Pittsburgh 0

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, the memory becomes a bottleneck of GPU's performance and energy efficiency. In this work, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. Firstly, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Secondly, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3 performance improvement and 11.3 applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 7

page 8

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The use of Graphics Processing Units (GPUs) has been extended from fixed graphics acceleration to general purpose computing, including image processing, computer vision, machine learning, and scientific computing. GPU is widely employed in various platforms ranging from embedded systems to high-performance computing systems 

(Shimpi et al., 2012).

GPU heavily relies on massive threading to achieve high throughput. However, it commonly incurs intensive memory accesses, which may limit the performance and energy efficiency of GPU (Jog et al., 2013b) as the result of the high overhead of device memory access111In this work, we use device memory and memory interchangeably.. Though large-capacity and low-overhead cache have been adopted by GPU to alleviate the impact of inefficient memory accesses (Abdel-Majeed and Annavaram, 2013; Mao et al., 2014), the available cache per thread is far below the demand of most GPU applications (Jia et al., 2012). The pressures on device memory, i.e., DRAMs, in GPU are still severe.

Memory scheduling is one of the primary architectural techniques to improve memory efficiency as it is able to optimize the memory access parallelism and locality in multi-core systems (Mutlu and Moscibroda, 2007, 2008; Kim et al., 2010; Ebrahimi et al., 2011; Usui et al., 2016). However, the existing memory scheduling algorithms are usually associated with expensive implementation (Liu et al., 2012) and also insufficient to handle the intensive memory accesses in GPU (Yuan et al., 2009; Ausavarungnirun et al., 2012).

The memory partitioning (MP) based on operating system (OS) memory management is another viable approach to improve memory efficiency and reduce inter-thread memory interference. Memory partitioning generally divides memory resources and assigns them to threads, and every thread accesses its exclusive memory space (Mi et al., 2010; Liu et al., 2012; Jeong et al., [n. d.]; Xie et al., 2014; Suzuki et al., 2013). Memory partitioning is promising to improve memory efficiency in the GPU system because of the following reasons: 1) The memory address space in the heterogeneous system is pageable. The memory pages can be allocated to the GPU threads by OS; and 2) The threads in GPU are nearly homogeneous. When they are evenly dispatched to stream multiprocessors (SMs), the fairness and parallelism of their access to memory can be guaranteed. This assertion, however, may be invalid in other multi-core systems due to the disparity of memory bandwidth required by their threads (Xie et al., 2014).

Unfortunately, the existing memory partitioning mechanisms for multi-core systems cannot straightly applied to the GPU. For instance, the memory bank partitioning (MBP) (Mi et al., 2010; Jeong et al., [n. d.]; Liu et al., 2012; Xie et al., 2014), which allows each thread to access the exclusive memory bank. MBP aims for the multi-program systems which have few parallel threads. Differently, GPU always runs massive threads, of which the number is orders of magnitude larger than the available banks. It is impossible to allow every thread has the exclusive memory bank. Moreover, all threads in a GPU application share an unified address space (NVIDIA, 2009). Their memory accesses interweave together and is difficult to be separated by using memory partitioning technique.

To address the above problems, we propose an integrated solution to improve the performance and energy efficiency of GPU applications. The integrated solution is composed of the thread batch enabled memory partitioning (TEMP) to enhancing the memory access parallelism and the thread batch-aware scheduling (TBAS) to improve the memory access locality. Specifically, TEMP assigns the majority of memory requests from the same SM to the dedicated banks to ensure the parallelism of memory accesses of threads. The thread blocks that share the same set of pages are grouped into a thread batch and then are dispatched to an SM as a whole. Meanwhile, by applying the page coloring mechanism, the accessed pages are mapped to the dedicated banks which are associated with the same SM (Lin et al., 2008). In this way, TEMP minimizes the interference of memory accesses from different SMs and improves the parallelism of memory accesses. Moreover, TBAS prioritizes the execution of thread batches to preserve the locality of memory accesses. The thread batches that access the same row in one bank are clustered and scheduled together. Accordingly, TBAS effectively alleviates the contention on the memory controllers and the congestion on the reply network connecting the memory partitions to SMs.

We compare TEMP and TBAS with some representative thread scheduling techniques, including the cache-conscious wavefront scheduler (CCWS) (Rogers et al., 2012), OWL (Jog et al., 2013a) and the bandwidth-aware policy (BW-AWARE) (Agarwal et al., 2015). We set CCWS as our baseline and integrate OWL, BW-AWARE and our techniques on top of CCWS. The benchmarks consist of not only GPU applications but also the combined CPU-GPU applications. Experimental results show that after applying TEMP and TBAS, the GPU system achieves 10.3% performance improvement and 11.3% reduction of the DRAM energy consumption for the evaluated GPU applicationscompared to the baseline. The results of the combined CPU-GPU workloads demonstrate that a simple yet effective solution is capable of addressing the interference incurred by the CPU executions and ensuring high execution efficiency in GPU applications using TEMP and TBAS with negligible performance degradation on the CPU side.

The rest of this paper is organized as follows: Section 2 introduces the background of the heterogeneous CPU-GPU system and memory system; Section 3 and Section 4 describe the details of TEMP and TBAS, respectively; Section 5 summarizes our experimental setup; Section 6 presents the experimental results and related analyses; Section 7 discusses the related works; and Section 8 concludes our work.

2. Background

The heterogeneous CPU-GPU integrated systems are evolving towards unified memory address space (Chu, 2013). Because of discrepant bandwidth requirements, it is anticipated that GPU will be still physically attached with bandwidth-optimized DRAM, while CPU is attached with capacity- and cost-optimized DRAM. DRAMs of GPU and CPU share a unified memory address space (Agarwal et al., 2015). In such heterogeneous cache coherent non-uniform memory access (CC-NUMA) system, a computing unit has different access delays to local and remote memories even it sees a unified address space. Fig. 1 shows a heterogeneous CC-NUMA system including several CPUs and a GPU. The system interconnection networks bridge two memories and maintain the coherence between caches of the CPUs and the GPU.

Heterogeneous CC-NUMA allows better programmability and finer-grained memory management of the GPU. OS can allocate the GPU pages in all memories. In this work, we use the default NUMA page placement policy in Linux, i.e., local, which places as many pages as possible in the local memory. By using local policy we can avoid most bandwidth contentions between the CPUs and the GPU in heterogeneous CC-NUMA.

GPU programming models such as CUDA (NVIDIA, [n. d.]a) and OpenCL (Inc., [n. d.]b) define the workload offloaded to a GPU as a kernel. A kernel is highly multi-threaded where all the threads are encapsulated in a grid. Within a grid, the threads are partitioned into three-dimensional thread blocks, each of which contains up to thousands of threads. During executions, each thread block is dispatched as a whole to a SM. Every SM holds a complete single instruction multiple data (SIMD) pipeline. Each thread block in the SM is further partitioned into many fixed-size warps that are atomically scheduled by a warp scheduler and executed in the SIMD fashion. The L2 caches of CPUs and of GPU are separated and placed in different memory partitions, each of which has its own memory channel. The on-chip caches, including L1 data and instruction caches in the CPUs and the GPU, are connected to the L2 caches via a mesh network. In such design, GPU can use page-fault memory rather than be restricted to page-locked memory (Branover et al., 2012) and non-pageable memory (NVIDIA, 2009).

Figure 1. Organization of a heterogeneous CC-NUMA system.

[Fig. 1]Organization of a heterogeneous CC-NUMA system.

2.1. Heterogeneous CC-NUMA

2.2. DRAM Basics

Figure 2. (a) The organization of a DRAM channel; (b) The default DRAM address mapping used in this work; (c) The modified DRAM address mapping used for page coloring. The shadow segment is the color bits.

[Fig.2](a) The organization of a DRAM channel; (b) The default DRAM address mapping used in this work; (c) The modified DRAM address mapping used for page coloring. The shadow segment is the color bits.

A modern JEDEC compliant DDRx DRAM system consists of one or more channels, each of which has its own data buses, command buses, and address transferring. Fig. 2(a) depicts the basic organization of a DRAM channel, which also has a memory controller (MC) to control the operations on the channel. A channel may include multiple DIMMs. Within each DIMM, there are several ranks, each of which consists of multiple DRAM devices. In DDR3, a DRAM device contains eight banks. The data of each bank are always pre-loaded to its private row buffer before being accessed.

DRAM address mapping complies with the DRAM organization. The address mapping scheme in Fig. 2(b) (Bakhoda et al., 2009) is the baseline DRAM address mapping we used in our heterogeneous architecture. The address mapping scheme in Fig. 2(c) is used for the page coloring mechanism in our work. If the number of page offset bits is not greater than the sum of the column and byte offset bits, by using page coloring, a GPU page can be mapped to arbitrary channel, rank, bank or row in a bank.

The memory usage efficiency is mainly determined by bank-level parallelism (BLP) (Mutlu and Moscibroda, 2008) and row locality measured by row buffer hit rate (RBHR). All the banks in a DRAM can be accessed concurrently as each bank has its own address decoder and sensing logic. However, only one bank can put/receive the data on/from the shared bus at a time. All memory requests (reads and writes) need to go through the row buffer. Memory access latency and energy can be reduced when the access hits on the row buffer as no row activation is needed. In multi-core systems, a variety of memory schedulers (Mutlu and Moscibroda, 2007, 2008) have been proposed to improve the BLP and row locality as well as maximize the access fairness. However, these designs are generally insufficient to handle the massive parallel memory requests of GPU (Yuan et al., 2009). In this work, we propose TEMP and TBAS to improve the DRAM efficiency in GPU by minimizing the inter-SM interference of memory accesses, which is the root reason of low BLP and low row locality of DRAM accesses (Jeong et al., [n. d.]).

3. Thread Batch Enabled Memory Partitioning (TEMP)

A naïve GPU memory partitioning may bind each SM to one or more banks. All the pages accessed by a thread block can be placed to the banks bound to the SM where the thread block is executed. Ideally, if there is no shared page among different thread blocks, the banks can be exclusively accessed by the associated SM. Unfortunately, page sharing between thread blocks commonly exists in GPU kernels. The simple page placement mentioned above is unable to separate the memory access streams raised from different SMs. To address the issue, we propose TEMP which identifies and forms the thread blocks sharing pages (Section 3.1) and dispatches them to the same SM (Section 3.2) so as to minimize the inter-SM interference of memory accesses. The group of these thread blocks sharing pages is noted as a thread batch. The rest of this section will detail the design and implementation of TEMP.

3.1. Thread Batch Formation

By profiling the prevalent GPU benchmark suites, we observed there was two major types of thread-data mappings with some page sharing patterns in thread blocks222In this work, we only consider the kernels constructed with 1D and 2D thread block/grid, because none of the profiled benchmarks employs 3D thread block/grid (see Table 1).. The first type of thread-data mappings is: the data accessed by each thread block is clustered over a sequential address space. Fig. 3 shows the skeleton of the Mapper kernel in MapReduce engine of Mars (He et al., 2008)

. This kernel employs fixed 1D thread blocks and scatters them to 1D or 2D grid. Generally, consecutive thread blocks sequentially access the 1D vector

inputKeys, and each thread block accesses a linear address space ranging from recordBase to terminate within inputKeys.

Figure 3. Annotated code snippet of Mapper kernel in Mars library.

[Fig. 3.]Annotated code snippet of Mapper kernel in Mars library.

Figure 4. (a) A 2D grid – 1D thread blocks; (b) The accessed data matrix; The memory access footprint when the page size is (c) a matrix row, (d) two matrix rows without or (e) with thread batching.

[Fig. 4.](a) A 2D grid – 1D thread blocks; (b) The accessed data matrix; The memory access footprint when the page size is (c) a matrix row, (d) two matrix rows without or (e) with thread batching.

Fig. 4 simplifies and visualizes the first type of thread-data mapping. In this example we assume the grid of the kernel contains four thread blocks, each of which consists of four threads. The 1D thread blocks are arranged in a 2D grid. Their accessed data matrix is shown in Fig. 4(b). In this example, the first row of the data matrix is accessed by thread block (0,0,0), the second row is accessed by thread block (1,0,0), and so on. If the row address of the matrix aligns to a page, the SM-level page coloring can perfectly place the pages accessed by a SM to the bounded banks, as depicted in Fig. 4(c). Here a page is equal to a matrix row. However, if a page is composed of multiple matrix rows, say, two matrix rows, conventional thread block dispatching which interleaves thread blocks across SMs will generate interweaved memory accesses, as shown in Fig. 4(d). In order to address the situation, we can pack those thread blocks accessing the same set of pages into a thread batch and then dispatch the thread batch as a whole to a SM. For the example shown in Fig. 4(d), the 4 thread blocks can be grouped into 2 thread batches, each of which goes to a SM. The memory accesses to banks 0 and 1 are successfully separated, as illustrated in Fig. 4(e).

Figure 5. Annotated code snippet of cenergy kernel in CUTCP.

[Fig. 5.]Annotated code snippet of cenergy kernel in CUTCP.

Figure 6. (a) A 2D grid – 2D thread blocks; (b) The accessed data matrix; (c) The memory access footprint with thread batching.

[Fig. 6.](a) A 2D grid – 2D thread blocks; (b) The accessed data matrix; (c) The memory access footprint with thread batching.

The second type of thread-data mappings is that the data accessed by consecutive thread blocks are interleaved over a linear address space. Fig. 5 shows the code snippet of the cenergy kernel in the CUTCP benchmark (Stratton et al., 2012). CUTCP computes the coulombic potential at a molecular grid energygrid. A point in energygrid is indexed by xindex and yindex generated from a thread’s indexes. All threads form a 2D grid which is further tiled with 2D thread blocks. Fig. 6 demonstrates a simplified thread-data mapping in this 2D grid. The thread organization and accessed data matrix can be found in Fig. 6(a) and (b), respectively. Here, we again assume one grid has four thread blocks, and each thread block has four threads. In this example, every thread block has two active dimensions (-axis and -axis). Each matrix row is accessed by two thread blocks while each thread block accesses two rows. In such a situation, the consecutive thread blocks likely access the same set of pages. Similarly, we can pack those thread blocks sharing the same set of pages into one thread batch. Fig. 6(c) gives a thread batching example where every matrix row in Fig. 6(b) exactly forms one page. Thread blocks (0,0,0) and (1,0,0) share pages 0 and 1, while thread blocks (0,1,0) and (1,1,0) share pages 2 and 3. Consequently, we can group thread blocks (0,0,0) and (1,0,0) into thread batch 0 and thread blocks (0,1,0) and (1,1,0) into thread batch 1. By allocating pages 0 & 1 into bank 0 and pages 2 & 3 into bank 1, the memory accesses from SM 0 to bank 0 and from SM 1 to bank 1 are separated.

Those two major thread-data mapping scenarios indicate consecutive thread blocks may share pages. Accordingly, we introduce the thread block stride to indicate the number of the consecutive thread blocks that belong to the same thread batch. In the examples in Fig. 4(c) and 6

(c), the thread block stride is 1 and 2, respectively.

To find the thread block stride of a GPU kernel, we profile a kernel given a page size at the compile time when the programmer determines the thread hierarchy and how the threads access the data matrices.

At the profiling stage, the start addresses of data matrices are set to zero. During dynamic memory allocation, the start memory address of a data matrix align to the beginning of the pages to guarantee the thread block stride to be found in the compile time. Fig. 7 shows the optimal thread block stride of some GPU applications. Optimal thread block stride denotes the thread block stride suppressing the most cross-batch page sharing. Here, the page size is 4KB supported by most of the computer systems. 89% of kernels achieve the minimum inter-thread batch page sharing through a batch formation with a fixed thread block stride. There are also 6% of kernels where the batch formation can be realized using modulation. Some kernels in MUM and LBM cannot be fitted with a formula for the batch formation.

Figure 7. The distribution of thread block stride.

[Fig. 7.]The distribution of thread block stride.

The static compile-time profiling is sub-optimal since it cannot proactively remove the cross-batch page sharing. For example, the last thread block in a thread batch may share a page with the first thread block in its following thread batch, if those thread batches are formed with a fixed thread block stride. In the next section we introduce a simple dynamic hardware approach which can support thread batching better relative to the static profiling.

We further analyze some GPU applications which form thread batches with the fixed thread block stride. The accumulated percentage of the pages shared by different sizes of consecutive thread batches is shown in Fig. 8. Horizontal axis shows the maximal distance of the shared pages among the thread batches. Among all the accessed pages, nearly 75% on average is exclusively accessed by a single thread batch and 22% is accessed by two consecutive thread batches. These two cases dominate the page access patterns in the thread batches (). There are more than 2% of pages globally shared among all the thread batches in a kernel, such as program text pages.

Figure 8. The accumulated percentage of cross-batch sharing between thread batches.

[Fig. 8.]The accumulated percentage of cross-batch sharing between thread batches.

3.2. Serial Thread Block Dispatching

Given that the thread batching and the cross-batch page sharing dominate the GPU applications, we propose serial thread block dispatching. The consecutive thread blocks, which are very likely enclosed by the consecutive thread batches, are emitted to a SM. As such most thread batches are formed implicitly by the serial thread block dispatching, and most cross-batch page sharing are constrained within a SM. Now the cross-batch page sharing only happens when some thread blocks of a thread batch are distributed to multiple SMs. This would happen in the first and the last thread batch in an SM.

Traditional interleaved thread block dispatching, e.g., GigaThread engine in NVIDIA GPU (NVIDIA, 2009), generates and dispatches a new thread block to an SM once the SM has an idle slot. Typically, the dispatching unit only passes the id of the new thread block to the SM, and the SM will construct a whole thread block according to the received thread block id. The dispatching unit generates the thread block ids sequentially and the thread block ids are dispatched to SMs randomly. To implement the deterministic and serial thread block dispatching, we introduce a dispatch queue in each SM. The content, i.e., the thread block ids, in the dispatch queue are inserted before launching a kernel. Each SM receives similar amount of thread block ids in consideration of workload balance, which can be determined at the compile time. During the kernel execution, the thread block ids are popped from the dispatch queue and emitted to the associated SM.

Compared to the traditional thread block dispatching, serial thread block dispatching avoids the stall of the launch of thread blocks. An SM can always pop a thread block id from its dispatch queue once it has an idle slot. The implementation of the dispatch queue can be highly efficient since each SM only needs two extra registers to record the head and the tail of thread block ids. The head register increments by one once a new thread block id (the head register itself) is popped. The dispatching of the thread block ends when the head register meets the tail of the thread block id stored in the second register. Thus, the serial thread block dispatching incurs marginal run-time and hardware overheads.

4. Thread Batch-aware Scheduling (TBAS)

Figure 9. (a) The thread organization, data matrix, thread batches, and memory access footprint of a kernel running on SM0; (b) The execution sequence of warps generated by CCWS; (c)–(e) The progressive improvement of TBAS.

[Fig. 9.](a) The thread organization, data matrix, thread batches, and memory access footprint of a kernel running on SM0; (b) The execution sequence of warps generated by CCWS; (c)–(e) The progressive improvement of TBAS.

TEMP constrains the memory accesses from a SM within the associated memory banks, offering an opportunity to improve intra-bank/row locality by scheduling the execution of threads. Accordingly, we propose TBAS that can be explained using the example in Fig. 9.

Fig. 9(a) presents the thread organization and the data matrix in the example. In a GPU, there is only one SM (i.e., SM0) associated with its own DRAM bank. Four thread batches, each of which consists of only one thread block, are formed and dispatched to SM0. Every thread batch exclusively accesses its own page while the page layout of SM0’s bank is also shown in Fig. 9(a). We assume two pages are included in one row in the bank333Generally, the row size of a DRAM is multiple times greater than the smallest page size that the OS can support.. Every two threads in a thread block forms a warp. Since there are four threads in one thread block, each thread block has two warps and total eight warps (or four thread blocks) are running on SM0.

Fig. 9(b) shows the execution of SM0 with a cache-conscious wavefront scheduler (CCWS) (Rogers et al., 2012). CCWS was designed for improving the L1 cache locality in GPUs. It captures the intra-warp locality and decreases the L1 thrashing by limiting the number of active warps in a SM based on the L1 eviction information. Typically, CCWS only keeps a subset of warps running in SMs and throttles the rest of warps pending in the same SM if the cache thrashing is detected. Once a warp in the running set encounters a stall, it will be demoted to the pending set. Simultaneously, another warp in the pending set will be promoted to the running set. Here, we assume that a running set includes two warps. It is very likely that the two warps in a running set come from different thread batches. Hence, they may compete for different rows in the bank and degrade the row locality.

We can propose a better scheduling policy to improve the row locality, as depicted in Fig. 9(c): the running set gathers active warps of the same thread batch as they commonly access the same page (i.e., the same row). If the thread batch in running set does not have sufficient active warps, all the warps of this thread batch are demoted to the pending set, and a new thread batch that has sufficient active warps will be promoted to the running set.

In such a design, promoting warps may harm the row locality when the rows accessed by the previous active warps and the newly promoted ones are different. Hence, as shown in Fig. 9(d), a better promotion scheme can promote a thread batch that is the successor of the demoted thread batch, e.g., promoting (1,0,0) (or (1,1,0)) after demoting (0,0,0) (or (0,1,0)). Due to the page allocation mechanism, the adjacent thread batch is most likely to access the same row in the bank.

The above sequential thread batch switching often results in a round-robin execution sequence, potentially incurring the burst of memory accesses in a short time. As illustrated in Fig. 9(d), all memory accesses are evoked in the first four scheduling cycles. The situations that may harm the scheduling efficiency include: 1) A thread batch demoted by a long operation could access the same page again in the near future. However, it may not be scheduled again in time; 2) When the thread batches are continuously promoted to the running set, the generated memory-accesses burst is coupled with the lost locality. The prolonged queuing delay in memory controllers may overwhelm the reply network connecting memory controllers and SMs (Bakhoda et al., 2010).

To overcome the above drawbacks, we assign higher promotion priority to older thread batches in the pending set. We assume the priority of the thread batches in Fig. 9(a) descends from the left to the right and then from the top to the bottom. Fig. 9(e) shows the scheduling sequence of the thread batches considering our proposed promotion priority. The improvement of row locality, especially the decreasing of memory access burst, leads to significant reduction in average memory access latency. We name the scheduling method corresponding to the example presented in Fig. 9(e) as TBAS.

Besides the maintenance of intra-/inter-thread batch row locality and alleviation on congestion of reply network, TBAS also reduces the stretch of memory access footprint by limiting the active thread batches in a particular time window. Such a limitation on thread-level parallelism can bring in an implicit positive effect on the cache locality (Rogers et al., 2012) as we shall explain in Section 6.1.

The hardware overhead of TBAS is similar to that of CCWS except for the promotion priority arbitrator. Fortunately, the number of concurrent thread batches in an SM is usually small: An SM of Fermi GPU, for example, supports only eight concurrent thread blocks (or at most 8 thread batches). Therefore, the implementation overhead of the arbitrator is negligible.

5. Experiment Methodology

[Table 1.]The characteristics of GPU applications: the category an application belongs to; the used thread organizations labeled as (grid dimension, block dimension). Application Abbreviate Category Thread organization MPKI (grid dimension, block dimension) InvertedIndex II C1 (1D, 1D) 45.23 Kmeans Clustering KM C1 (1D/2D, 1D) 11.91 PageViewCount PVC C1 (1D, 1D) 5.52 PageViewRank PVR C1 (1D, 1D) 2.10 Fast Walsh Transform FWT C1 (1D, 1D) 1.01 Seven Point Stencil STEN C1 (2D, 2D) 0.85 Back Propagation BP C1 (1D, 2D) 0.75 Merge Sort MS C1 (1D, 1D) 0.53 Discrete Cosine Transform DCT C1 (2D, 2D) 0.37 Scalar Products SP C1 (1D, 1D) 0.17 Reduction RD C1 (1D, 1D) 0.04 Coulombic Potential CUTCP C1 (2D, 2D) 0.03 Neural Network NN C2 (1D/2D, 1D) 1.21 Hotspot HOT C2 (2D, 2D) 0.75 Sorting Networks SN C2 (1D, 1D) 0.27 Angular Correlation TPACF C3 (1D, 1D) 0.89 MUMmerGPU MUM C3 (1D, 1D) 5.85 Lattice-Boltzmann Method LBM C3 (2D, 1D) 5.82 CFD Solver CFD C3 (1D, 1D) 0.85 LIBOR Monte Carlo LIB C3 (1D, 1D) 0.62

Table 1. The characteristics of GPU applications: the category an application belongs to; the used thread organizations labeled as (grid dimension, block dimension).

[Table 2.]The characteristics of CPU applications. Application MPKI Application MPKI mcf 52.28 bzip2 4.14 omnetpp 34.67 h264ref 1.95 xalancbmk 27.52 gcc 0.58 lbm 20.23 perlbench 0.21

Table 2. The characteristics of CPU applications.

[Table 3.]The characteristics of heterogeneous workloads.

Workloads Type 444Each workload type is denoted by the types of combined applications, e.g., NN-I means two memory non-intensive CPU applications run with a memory intensive GPU application. Applications
WL0 NN-N perlbench, bzip2, FWT
WL1 NN-N gcc, h264ref, BP
WL2 NN-I gcc, bzip2, II
WL3 NN-I perlbench, h264ref, KM
WL4 IN-N omnetpp, gcc, STEN
WL5 IN-I xalancbmk, h264ref, PVC
WL6 II-N mcf, lbm, DCT
WL7 II-N omnetpp, xalancbmk, FWT
WL8 II-I mcf, xalancbmk, PVR
WL9 II-I omnetpp, lbm, II
WL10 IN-N lbm, bzip2, NN
WL11 IN-I mcf, perlbench, MUM
Table 3. The characteristics of heterogeneous workloads.

5.1. Benchmark

We adopt a set of diverse GPU applications from (NVIDIA, [n. d.]b; Bakhoda et al., 2009; Che et al., 2009; Jog et al., 2013a; Stratton et al., 2012) as our benchmark used in our evaluations. Most of the applications are fully simulated except for the applications from (Jog et al., 2013a) of which only the first two billion instructions are simulated. The detailed characteristics of each application in the benchmark are summarized in Table 1. All GPU applications are profiled to generate the optimal thread batches before execution.

We combined eight CPU applications with GPU application to construct the heterogeneous workloads for the evaluation. The CPU workloads are from SPEC CPU 2006, as shown in Table 2. PinPoint (Luk et al., 2005) is used to extract the execution phases for all CPU applications. The CPU applications are divided into two types: memory intensive where the L2 cache misses per kilo instructions (MPKI) is higher than 20; and memory non-intensive

where the L2 cache MPKI is lower than 20. The GPU applications can be also classified into two types based on L2 cache MPKI – memory intensive (MPKI

2) and non-intensive (MPKI2). Although the L2 cache MPKI of most GPU applications are lower than that of CPU applications, within an arbitrary time window, GPU applications possibly generate two orders of magnitude greater L2 cache misses than CPU applications due to their high instruction throughput (i.e., IPC). Moreover, we grouped GPU application into three categories, C1–C3, according to their sensitivity to TEMP+TBAS (shall be explained in Section 6.1).

We permute the combination of different types of CPU and GPU applications to create twelve heterogeneous workloads. Each workload consists of two CPU applications and one GPU application, as summarized in Table 3. We construct ten workloads (WL0–WL9 in Table 3) where the GPU applications are picked up from C1. Half of GPU applications in WL0–WL9 are memory intensive, while the rest are memory non-intensive. For the CPU workloads in WL0–WL9, we can have three combination types (i.e., NN, IN, and II) of the dual-applications. The generated ten heterogeneous workloads cover most cases where EMU may act variably. We also construct two extra workloads, i.e., WL10 and WL11, each of which consists of one GPU application from C2 and C3, respectively.

[Table 4.]Simulation configuration. CPU Number of Cores 2 Execution 3 GHz, OOO 4 issues, 256-entry ROB L1 Data Cache 32KB 4-way, 2-cycle hit, write back, 64B L2 Cache 2MB 8-way, write back, 64B GPU Number of SMs 8 SM Clock 600MHz SIMD width 16 L1 Data Cache 32KB 4-way, write through, 128B Warp Size 32 Max Number of Threads 1536/SM Max Thread Blocks 8/SM Scheduler CCWS (Rogers et al., 2012), OWL (Jog et al., 2013a), TBAS TLB 64-entry L1, 8KB page walk cache, 512-entry shared L2 L2 Cache 128B line, 8-way associated, 2 banks, 512KB/bank, total 1MB Shared resources # of Memory Channels 2 for GDDR5, 1 for DDR3 Memory Controller (MC) FR-FCFS (Owens et al., 2000), open-page, 64-entry request queue/MC Interconnection 2D Mesh GDDR5 Banks 16 Timing from (Micron, [n. d.]b) DDR3 Banks 8 Timing from (Inc., [n. d.]a)

Table 4. Simulation configuration.

5.2. Simulation Platform

Since the CPU-GPU CC-NUMA has not been shipped by any industrial vendors, we simulate a GPU system attached with a heterogeneous GDDR5-DDR3 DRAM subsystem. Our system simulation is performed on gem5-gpu (Power et al., 2015), and its configuration is listed in Table 4.

The GPU subsystem includes 8 SMs. Each SM has the similar computational capability as the SMs in Fermi and is set to the frequency. The memory bandwidth per shared-core-clock is comparable and even higher than that of real high-end heterogeneous processors integrating similar GPU unit (Inc, [n. d.]). As such, we ensure that our platform resembles real product and conducts fair evaluations.

The page size is set to 4KB, a typical size adopted widely. To avoid the bottleneck of GPGPU TLB and expose the limitation of DRAM bandwidth in heterogeneous shared memory systems, we also optimize the GPU TLB design in our heterogeneous system including per-SM TLB, highly-threaded PTW and shared L2 TLB (Power et al., 2014). We choose the configuration with CCWS in (Rogers et al., 2012) as our baseline.

We estimate the GDDR5 DRAM energy consumption through a modified Micron DRAM power calculator (Micron, [n. d.]a) based on the datasheet (Micron, [n. d.]b); the DDR3 DRAM energy consumption is directly obtained from Micron DRAM power calculator by feeding the run-time statistics generated from gem5-gpu.

To evaluate the effectiveness of TEMP and TBAS, we compared the following approaches:

  • CCSW refers to the design for improving the L1 cache locality in GPU proposed by (Rogers et al., 2012). The results of CCSW are used as the normalization basis in our evaluations.

  • OWL denotes the optimized scheduling method proposed by (Jog et al., 2013a), which improves the performance through optimizing the cache and memory accesses in GPU systems.

  • TEMP denotes the thread batch enabled memory partitioning scheme presented in Section 3.

  • TEMP+TBAS refers to the design integrating TEMP and TBAS.

  • BW-AWARE denotes a synergistic bandwidth-aware page placement policy in (Agarwal et al., 2015). It places the GPU pages across the heterogeneous memory system, i.e., GDDR5 and DDR3 DRAM, and their memory bandwidth is shared across GPU pages.

  • Batching+BW refers to the scheme that combines TEMP, TBAS, and BW-AWARE.

6. Result

6.1. Evaluation Results for GPU Applications

Figure 10. The performance (IPC) of different schemes. The results are normalized to CCWS.

[Fig. 10]The performance (IPC) of different schemes. The results are normalized to CCWS.

Figure 11. The local access ratio distribution of memory accesses of different schemes.

[Fig. 11.]The local access ratio distribution of memory accesses of different schemes.

6.1.1. Performance

We first evaluate and analyze the performance and local access ratio to each memory bank across the different designs for the GPU applications. Here, local access denotes the memory access from the SM associated with the banks, while remote access refers to the access from other SMs. According to the performance results under the TEMP design and the evaluated local access ratio, GPU applications are classified into the following three categories:

  • C1:  These applications present the high local access ratio (on average 99%) and significant performance improvement across all the configurations employed by TEMP.

  • C2:  Similar to C1, the applications in C2 also demonstrate high local access ratio (93%). In contrast, they present a slight performance reduction (1%) under TEMP yet the effective performance improvement under TEMP+TBAS.

  • C3:   The applications in C3 do not have high local access ratio due to the intrinsic thread-data mapping and memory access pattern. Their overall performance applied with TBAS and TEMP is degraded compared with those of CCWS.

The performance results are shown in Fig. 10, and Fig. 11 shows the local access ratio for the GPU applications.

The overall results show that applying TEMP on top of CCWS achieves 5.7% geometric mean (GM) speedup while replacing CCWS with TBAS (

i.e., TEMP+TBAS) can further raise the speedup to 10.3%. Based on our evaluations, OWL is 93.6% within the performance of CCWS across the application workloads. As shown in Fig. 11 and Fig. 12, the cache hit rate of OWL is lower than that of CCWS, and the BLP improvement achieved by OWL is limited. The results verified that only considering a small subset of thread blocks which share pages is insufficient to achieve remarkable performance improvement. The IPC of TEMP is 12.9% higher than that of OWL. BW-AWARE keeps a page placement ratio the same as the bandwidth ratio between GDDR5 and DDR3, which can improve the utilization of the combined bandwidth from both memories. Hence, BW-AWARE gains 5.1% performance improvement over CCWS as can be seen from Fig. 10. The performance gain is compliance to the value reported in (Agarwal et al., 2015) by given the similar bandwidth ratio.

Figure 12. The statistic of DRAM system and congestion on reply network.

The statistic of DRAM system and congestion on reply network.

To further evaluate the effects of these designs on the memory requests for the three categories of the GPU applications, we summarize the DRAM usage statistic (BLP, RBHR, DRAM access delay) as well as the stalls on reply network connecting memory controllers and SMs induced by the network congestion of these designs. The results are normalized to those in CCWS and shown in Fig. 12.

When applying TEMP on C1, the BLP of C1 is significantly improved by 58.3%, while the RBHR is increased by 17.8%. As expected, by suppressing the inter-SM interference of memory accesses, TEMP unveils the intrinsic locality and access parallelism of thread batches. In comparison to TEMP, OWL improves BLP by 16.3% and RBHR by 8.6%, respectively. The opportunistic prefetching adopted by OWL boosts RBHR.We also investigated the network congestion between the SMs and the GDDR5 DRAM partitions. The network congestion of OWL is 33.6% more than that of CCWS. This value quantitatively demonstrates that CCWS has a higher L1 cache hit rate, less L2 accesses, and less DRAM accesses compared to OWL. All the above factors together lead to 17.3% reduction in DRAM access delay with TEMP in C1. Consequently, TEMP achieves 11.1% performance improvement over CCWS, which is 24.0% higher than OWL. For C1, the BLP in TEMP+TBAS is 9.1% smaller than that in TEMP. This is because the number of active thread batches is intentionally limited for row locality enhancement. On the other hand, C1’s RBHR in TEMP+TBAS is raised by 33.1% and the DRAM access delay is reduced by 29.9%. More importantly, a considerable reduction in network congestion (18.7%) is observed. As a result, more than 15% performance improvement is achieved by TEMP+TBAS for C1 as shown in Fig. 10.

C2 achieves a high local access rate when TEMP is applied. However, TEMP is hard to increase the BLP of C2 since the BLP of C2 already approaches the theoretical upper bound. For instance, some kernels in NN have only a few thread blocks whose number is even lower than the bank count. Applying TEMP on those kernels may limit the BLP. Fortunately, TBAS enhances the row locality and reduces the network congestion, resulting in slight speedup (2%). As shown in Fig. 10, the performance of C3 in TEMP/TEMP+TBAS is averagely degraded/improved by 2.5%/2.3%. Note that it is difficult to formalize the thread-data mapping of the applications in C3. Thus, applying TEMP for C3 prolongs DRAM access delay.

Figure 13. The normalized DRAM energy consumption of different schemes.

[Fig. 12.]The normalized DRAM energy consumption of different schemes.

6.1.2. Energy

The normalized DRAM energy consumption of all configurations is shown in Fig. 13. Generally, the DRAM energy savings come from two main sources: 1) The saving of activate energy that dominates DRAM energy consumption, which can be achieved by increasing RBHR; and 2) The saving of the background energy, which is proportional to the reduction of the execution time. Therefore, DRAM energy reduction is relevant to the improved access locality as well as the overall performance improvement. Our results show that compared to CCWS, the DRAM energy saving of TEMP is 11.2%. TEMP+TBAS saves 20.7% more energy than CCWS because of the significantly improved RBHR. OWL saves 5.9% energy which is less than TEMP+TBAS as the result of the higher row activation ratio and worse performance. Batching+BW achieves the highest energy saving of 14.2%.

Figure 14. The performance of heterogeneous workloads.

[Fig. 11.]The performance of heterogeneous workloads.

6.2. Evaluation for Heterogeneous Workloads

Fig. 14 shows the performance of the CPU applications (WS-C) and the GPU application (IPC-G) in each heterogeneous workload when TEMP+TBAS is applied. The performance of the CPU applications in a workload is measured by the weighted speedup (Eyerman and Eeckhout, 2008). These results are normalized to the weighted speedup of the same CPU applications running standalone on the heterogeneous system. The IPC of a GPU application is also normalized to the IPC obtained by exclusively running with TEMP+TBAS . The memory-intensive CPU and GPU applications in the workloads suffer from non-trivial performance degradation due to the contention for shared resources, e.g., interconnection network and DRAM. On the contrary, the performance degradation of memory non-intensive applications is much less. The weighted performance of CPU applications across twelve workloads is reduced by 11.9%; correspondingly, the IPC of GPU applications is 9.2% lower than that obtained by TEMP+TBAS running alone.

The effectiveness of TEMP and TBAS is constrained in the CPU applications and hence, the performance of the CPU applications is degraded as the CPU applications: 1) TBAS expects consecutive thread blocks to access their physical pages in a limited span of rows. The physical addresses of the pages accessed by the CPU applications, however, can mix with those of the pages accessed by the GPU applications, deteriorating the row locality of the GPU applications; 2) On the other hand, even if TBAS successfully preserves the row locality of the GPU applications, the memory controller probably always prioritizes the intensive memory accesses from the GPU and suspends the memory accesses from the CPU.

To address the above problems, we can first divide each bank into two portions – one for CPU and one for GPU. We reserve the rows with higher addresses in a bank for CPU and the ones with lower addresses for GPU. The new pages for CPU and GPU are from the reserved address space. As such, most pages of CPU and GPU can be physically separated in a bank, which allows TBAS to keep the row locality of GPU applications when CPU applications are running simultaneously. Secondly, the memory controller is set to always promote the memory accesses from CPU against the ones from GPU, as proposed in (Ausavarungnirun et al., 2012). Since most CPU applications are delay-sensitive, unconditionally promoting the memory accesses from CPU can eliminate the risk of memory access starvation on the CPU-side. Combining the above two solutions, the performance loss in the CPU/GPU applications are reduced by 6.1%/3.5%, as denoted by Comb-C and Comb-G in Fig. 14. We can see that some workloads (e.g., WL8 and WL9) including both CPU and GPU intensive applications attain significant performance improvement from the integrated heterogeneous-aware thread batching.

The solutions mentioned above is simple yet capable of keeping the effectiveness of TEMP and TBAS for GPU applications while preventing considerable performance loss for CPU applications. We believe more sophisticated techniques can further balance the throughput between CPU and GPU (Kayıran et al., 2014). However, the balanced throughput design is beyond the interests of this paper and left for the future work.

7. Related Works

7.1. Memory Partitioning in Multi-core Systems

In multi-core systems, memory bank partitioning (MBP) binds a thread to one or more memory banks. Every thread accesses its own private banks to avoid the interference from other threads. Mi et al. (Mi et al., 2010) first proposed MBP and used modified bank permutation to compensate the degraded BLP. Jeong et al. (Jeong et al., [n. d.]) used sub-ranking to overcome the BLP degradation on single thread after applying MBP. Liu et al. (Liu et al., 2012) designed a purely software MBP based on OS page allocation. They also explored the utilization of MBP in a multi-threaded application but the result was not very promising because of the inter-thread data sharing. Xie et al. (Xie et al., 2014) pointed out that unbalanced memory requirements across the threads is the main reason of the BLP degradation and then proposed a dynamic bank partitioning approach to solve this problem. In TBMP, BLP is guaranteed by workload balancing across the SMs while the memory access fairness is guaranteed by the homogeneity of the GPU threads in a kernel. Thread batching in TEMP also alleviates the negative impact of inter-thread data sharing on system performance in multi-threaded applications.

7.2. DRAM Efficiency in GPU

Compiler-assisted data layout transformation (Yang et al., 2010; Sung et al., 2010; Xie et al., 2015) proactively prevents unbalanced accesses to DRAM components by carefully allocating the data, register file or the thread block index. For example, Xie et al. (Xie et al., 2015) put forward a compiler-based framework to balance the register allocation and the targeted thread-level parallelism in the GPU system. However, the compiler-level methods are not aware of any hardware implementation details. Both thread scheduling and DRAM address mapping at the hardware level may offset the optimization brought by the compiler level. The hardware-level approaches of enhancing DRAM usage efficiency in GPU or CPU-GPU systems include:

Enhanced memory schedulers:  Jeong et al. (Jeong et al., 2012) designed a QoS-aware memory scheduler for MPSoC with CPUs and GPUs. The DRAM bandwidth allocation between the CPUs and GPUs is dynamically adjusted to meet the frame rate requirement of the GPUs and maximize the overall system throughput. Ausavarungnirun et al. (Ausavarungnirun et al., 2012) proposed a staged memory scheduling framework with affordable hardware cost for heterogeneous systems. We adopt the memory scheduling policy from (Ausavarungnirun et al., 2012) to customize our proposed heterogeneous-aware thread batching.

Enhanced thread scheduler:  Jog et al. (Jog et al., 2013a) revealed that serial thread block data layout and sequential thread block dispatching can cause BLP degradation of GPU applications. A scheduler is then designed to improve the BLP by prioritizing the thread blocks in consecutive SMs. The authors also utilized prefetching to compensate the degradation of row locality. However, if the memory of a GPU is pageable, the effect of prioritized thread scheduling will become uncertain, because the pages of consecutive thread blocks can be nonconsecutive or not concentrated to a DRAM row. In our scheme, TEMP relies on thread batching and page coloring to improve the BLP and TBAS enhances the row locality, targeting a heterogeneous system design supporting pageable GPU memory.

8. Conclusion

Modern GPUs suffer from the mismatching between thread-level parallelism and DRAM bandwidth. To improve the DRAM usage efficiency of GPU applications, we propose an integrated architectural approach which is composed of TEMP and TBAS techniques: TEMP improves memory access parallelism for massive multi-threaded GPU applications by minimizing the memory access interweaving across SMs; and TBAS maximizes the row locality by elaborately prioritizing the execution of the thread batches. Heterogeneous-aware thread batching is also introduced to promise the effectiveness of thread batching when running heterogeneous workloads. Our results show that TEMP+TBAS can achieve up to 10.3% system performance improvement and 11.3% DRAM energy saving compared to the baseline employing CCWS. By using the simple and existing solution, the heterogeneous-aware thread batching can still maintain 93.9% CPU performance and 96.5% GPU performance compared to the results of exclusively running CPU and GPU applications.

Acknowledgements.
This work is supported in part by US National Science Foundation under Grant 1725456 and Grant 1615475; Bing Li acknowledges the National Academy of Sciences (NAS), USA for awarding the NRC research fellowship.

References

  • (1)
  • Abdel-Majeed and Annavaram (2013) Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped Register File: A Power Efficient Register File for GPGPUs. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA ’13). IEEE Computer Society, Washington, DC, USA, 412–423. https://doi.org/10.1109/HPCA.2013.6522337
  • Agarwal et al. (2015) Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W Keckler. 2015. Page placement strategies for GPUs within heterogeneous memory systems. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA, 607–618. https://doi.org/10.1145/2694344.2694381
  • Ausavarungnirun et al. (2012) Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA ’12). IEEE Computer Society, Washington, DC, USA, 416–427. http://dl.acm.org/citation.cfm?id=2337159.2337207
  • Bakhoda et al. (2010) Ali Bakhoda, John Kim, and Tor M Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’43). IEEE Computer Society, Washington, DC, USA, 421–432. https://doi.org/10.1109/MICRO.2010.50
  • Bakhoda et al. (2009) Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174. https://doi.org/10.1109/ISPASS.2009.4919648
  • Branover et al. (2012) Alexander Branover, Denis Foley, and Maurice Steinman. 2012. Amd Fusion apu: Llano. IEEE Micro 32, 2 (March 2012), 28–37. https://doi.org/10.1109/MM.2012.2
  • Che et al. (2009) Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). IEEE Computer Society, Austin, TX, USA, 44–54. https://doi.org/10.1109/IISWC.2009.5306797
  • Chu (2013) Hanjin Chu. 2013. AMD heterogeneous Uniform Memory Access. APU 13th developer summit.–San Jose (2013), 11–13.
  • Ebrahimi et al. (2011) Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, José A Joao, Onur Mutlu, and Yale N Patt. 2011. Parallel application memory scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, USA, 362–373. https://doi.org/10.1145/2155620.2155663
  • Eyerman and Eeckhout (2008) Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE micro 28, 3 (May 2008), 42–53. https://doi.org/10.1109/MM.2008.44
  • He et al. (2008) Bingsheng He, Wenbin Fang, Qiong Luo, Naga K Govindaraju, and Tuyong Wang. 2008. Mars: a MapReduce framework on graphics processors. In 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT). 260–269.
  • Inc ([n. d.]) Advanced Micro Devices Inc. [n. d.]. AMD Quad-Core A10-Series APU for Desktops. http://products.amd.com/en-us/DesktopAPUDetail.aspx?id=100/
  • Inc. ([n. d.]a) Micron Technology Inc. [n. d.]a. Micron DDR3 SDRAM Part MT41J256M8. Micron Technology Inc..
  • Inc. ([n. d.]b) The Khronos Group Inc. [n. d.]b. OpenCL. https://www.khronos.org/opencl/
  • Jeong et al. (2012) Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC (DAC ’12). ACM, New York, NY, USA, 850–855. https://doi.org/10.1145/2228360.2228513
  • Jeong et al. ([n. d.]) Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. [n. d.]. Balancing DRAM locality and parallelism in shared memory CMP systems. In IEEE International Symposium on High-Performance Comp Architecture. https://doi.org/10.1109/HPCA.2012.6168944
  • Jia et al. (2012) Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS ’12). ACM, New York, NY, USA, 15–24. https://doi.org/10.1145/2304576.2304582
  • Jog et al. (2013a) Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013a. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). ACM, New York, NY, USA, 395–406. https://doi.org/10.1145/2451116.2451158
  • Jog et al. (2013b) Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013b. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA ’13). ACM, New York, NY, USA, 332–343. https://doi.org/10.1145/2485922.2485951
  • Kayıran et al. (2014) Onur Kayıran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. (Dec 2014), 114–126. https://doi.org/10.1109/MICRO.2014.62
  • Kim et al. (2010) Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’43). IEEE Computer Society, Washington, DC, USA, 65–76. https://doi.org/10.1109/MICRO.2010.51
  • Lin et al. (2008) Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In 2008 IEEE 14th International Symposium on High Performance Computer Architecture. IEEE, 367–378. https://doi.org/10.1109/HPCA.2008.4658653
  • Liu et al. (2012) Lei Liu, Zehan Cui, Mingjie Xing, Yungang Bao, Mingyu Chen, and Chengyong Wu. 2012. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT ’12). ACM, New York, NY, USA, 367–376. https://doi.org/10.1145/2370816.2370869
  • Luk et al. (2005) Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. (2005), 190–200. https://doi.org/10.1145/1065010.1065034
  • Mao et al. (2014) Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai (Helen) Li. 2014. Exploration of GPGPU Register File Architecture Using Domain-wall-shift-write Based Racetrack Memory. In Proceedings of the 51st Annual Design Automation Conference (DAC ’14). ACM, New York, NY, USA, Article 196, 6 pages. https://doi.org/10.1145/2593069.2593137
  • Mi et al. (2010) Wei Mi, Xiaobing Feng, Jingling Xue, and Yaocang Jia. 2010. Software-hardware cooperative DRAM bank partitioning for chip multiprocessors. In Network and Parallel Computing. Springer Berlin Heidelberg, Berlin, Heidelberg, 329–343.
  • Micron ([n. d.]a) Micron. [n. d.]a. Micron system power calculators. http://www.micron.com/products/support/power-calc/
  • Micron ([n. d.]b) Micron. [n. d.]b. Micron TN-ED-01: GDDR5 SGRAM Introduction. http://www.micron.com/products/dram/gddr5/
  • Mutlu and Moscibroda (2007) Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). IEEE Computer Society, Washington, DC, USA, 146–160. https://doi.org/10.1109/MICRO.2007.40
  • Mutlu and Moscibroda (2008) Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA ’08). IEEE Computer Society, Washington, DC, USA, 63–74. https://doi.org/10.1109/ISCA.2008.7
  • NVIDIA ([n. d.]a) NVIDIA. [n. d.]a. CUDA. http://www.nvidia.com/object/cuda_home_new.html/
  • NVIDIA ([n. d.]b) NVIDIA. [n. d.]b. CUDA SDK. https://developer.nvidia.com/cuda-downloads/
  • NVIDIA (2009) NVIDIA. 2009. Nvidia Fermi Architecture. http://www.nvidia.com/object/fermi-architecture.html
  • Owens et al. (2000) John D Owens, William J Dally, Scott Rixner, Peter Mattson, and Ujval J Kapasi. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA ’00). ACM, New York, NY, USA, 128–138. https://doi.org/10.1145/339647.339668
  • Power et al. (2015) Jason Power, Joel Hestness, Marc S Orr, Mark D Hill, and David A Wood. 2015. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Computer Architecture Letters 14, 1 (Jan 2015), 34–36. https://doi.org/10.1109/LCA.2014.2299539
  • Power et al. (2014) Jason Power, Mark D Hill, and David A Wood. 2014. Supporting x86-64 Address Translation for 100s of GPU Lanes. In HPCA. 568–578. https://doi.org/10.1109/HPCA.2014.6835965
  • Rogers et al. (2012) Timothy G Rogers, Mike O’Connor, and Tor M Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 72–83. https://doi.org/10.1109/MICRO.2012.16
  • Shimpi et al. (2012) Anand Lal Shimpi et al. 2012. Inside the titan supercomputer: 299k amd x86 cores and 18.6 k nvidia gpus. AnandTech online computer hardware magazine (2012).
  • Stratton et al. (2012) John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and WMW Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).
  • Sung et al. (2010) I-Jui Sung, John A Stratton, and Wen-Mei W Hwu. 2010. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, Vienna, Austria, 513–522.
  • Suzuki et al. (2013) Noriaki Suzuki, Hyoseung Kim, Dionisio De Niz, Bjorn Andersson, Lutz Wrage, Mark Klein, and Ragunathan Rajkumar. 2013. Coordinated bank and cache coloring for temporal protection of memory accesses. In 2013 IEEE 16th International Conference on Computational Science and Engineering. IEEE, 685–692. https://doi.org/10.1109/CSE.2013.106
  • Usui et al. (2016) Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and Onur Mutlu. 2016. DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Transactions on Architecture and Code Optimization (TACO) 12, 4, Article 65 (Jan. 2016), 28 pages. https://doi.org/10.1145/2847255
  • Xie et al. (2014) Mingli Xie, Dong Tong, Kan Huang, and Xu Cheng. 2014. Improving System Throughput and Fairness Simultaneously in Shared Memory CMP Systems via Dynamic Bank Partitioning. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 344–355. https://doi.org/10.1109/HPCA.2014.6835945
  • Xie et al. (2015) Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395–406. https://doi.org/10.1145/2830772.2830813
  • Yang et al. (2010) Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU Compiler for Memory Optimization and Parallelism Management. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’10). ACM, New York, NY, USA, 86–97. https://doi.org/10.1145/1806596.1806606
  • Yuan et al. (2009) George L Yuan, Ali Bakhoda, and Tor M Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 34–44. https://doi.org/10.1145/1669112.1669119