Given the continued growth of data requirements and processing intensity of a range of scientific and data-centric applications, a single GPU can no longer supply the necessary computing power to meet the needs of tomorrow’s applications [30, 57]. Limited by device technology challenges of CMOS scaling and associated manufacturing costs, it is becoming impractical to improve single GPU performance by adding more computing resources on the same die .
One potential solution to this issue, which is quickly gaining popularity, is to integrate multiple GPUs into a single platform. NVIDIA recently started offering a multi-GPU DGX platform 
, focusing on accelerating Deep Neural Network (DNN) training. However, recent studies suggest that the performance of multi-GPU systems can be heavily constrained by CPU-to-GPU and GPU-to-GPU synchronization, and limited by multi-GPU memory management overhead. Design of an effective memory management system and cross-GPU communication fabric remains an open problem that needs to be addressed to unlock the full potential of future multi-GPU systems.
Recent improvements to GPU memory systems provide support for a unified memory space, enabling cross-GPU memory access and system-level atomics. With unified memory [28, 36] and cross-GPU memory access , one GPU can access the data on another without the help of the CPU. Additionally, system-level atomics allow for synchronization across GPUs during the execution of a GPU program. These features open up new possibilities for GPU applications to enjoy significantly improved levels of performance. At the same time, there are new challenges associated with multi-GPU systems, including how to handle GPU-to-GPU communication, memory management across the unified CPU and multi-GPU memory space, and application/data partitioning/mapping. These challenges have to be tackled properly to fully unlock the potential performance of multi-GPU systems. We need to develop a better understanding of these challenges at a microarchitectural level.
Due to the lack of appropriate simulation research tools, computer architecture researchers are handicapped when trying to study the potential of multi-GPU collaborative execution. Existing studies [9, 58] on multi-GPU systems assume a unified logical GPU programming model, which hides the complexity of the multi-GPU system and exposes only one logical GPU to the programmer. Despite its simplicity, a unified logical GPU is not necessarily the right choice for GPU programmers today. The main reason is that on today’s commercial multi-GPU systems, programmers are tasked with precisely controlling which GPU stores a piece of data and which GPU runs a program, as most state-of-the-art programming frameworks (e.g., OpenCL , HSA , and CUDA 
) adopt this model, and major GPU libraries (e.g., Caffe1], etc.) follow this model. Researchers cannot use existing GPU simulation frameworks to simulate multi-GPU systems mainly due to: (1) a lack of modularity and state encapsulation, and (2) non-scalable performance.
Due to this lack of modularity and state encapsulation, configuring a multi-GPU system, or any other major model modification to the simulator, would require shotgun-style refactoring. This would involve modifications to major portions of the simulator codebase, rendering changes to the design a major endeavor. Further, lack of modularity and state encapsulation inhibits contributions by the broader research community, given that common files will need modification by each contributor. Therefore, the research community is calling for a new generation of “modular, pluggable, hookable, and composable”  simulators that can provide a much higher level of extensibility and can support the needs of the user base.
In addition, simulating a large-scale system demands a highly performant simulator infrastructure. Using existing simulation frameworks, researchers can wait for days or weeks to produce a single simulation result. This simulation cost is further exacerbated when simulating a multi-GPU system, as simulation overhead will grow faster than linear as we add more simulated elements. The major reason is that as the contention in key system components increases, including interconnects, shared caches, and memory controllers, the simulation can be several orders of magnitude slower than simulating each GPU independently. Experimental studies [34, 40] have explored creating multi-threaded simulators to accelerate simulation. However, they trade-off accuracy for faster simulation. To provide scalable performance in a multi-GPU simulation framework, we need a new parallelization approach that enables system simulation to achieve both high performance and high accuracy.
In this work, we present MGSim, an open-source 111https://gitlab.com/akita/gcn3, cycle-accurate GPU simulator that is specifically designed for, but not limited to, multi-GPU system simulation. MGSim executes unmodified Graphics Core Next (GCN) 3rd generation instruction set architecture (ISA) binaries. It is flexible and fully configurable, enabling users to quickly create a multi-GPU platform for simulation. MGSim ships with built-in multi-threading capability, supporting both efficient functional emulation and architectural simulation, without compromising simulation accuracy.
To accompany MGSim, we have developed MGMark, a new benchmark suite designed to support design space exploration using multi-GPU collaborative execution patterns. We define multi-GPU collaborative execution as workloads that leverage multiple GPUs executing the single application, though running concurrently. We categorize multi-GPU collaborative execution patterns into five groups: i) Partitioned Data, ii) Adjacent Access, iii) Gather, iv) Scatter, and v) Irregular. These patterns span the gamut of general communication schemes that include no data sharing among the GPUs, to sharing the entire address space for both read and write operations. We provide a set of workloads covering not only each multi-GPU collaborative execution pattern, but also a wide range of algorithms and GPU features.
MGSim and MGMark form a brand new framework to support the computer architecture design community. They allow the community to efficiently explore a range of multi-GPU models: execution on a discrete multi-GPU system (D-MGPU), and execution on a multi-GPU system behind a unified logical GPU interface (U-MGPU). To demonstrate this capability, we conduct a case study that runs MGMark on MGSim and compares D-MGPU and U-MGPU, building a configuration using 4 GPUs in the system. We explore which GPU collaborative execution patterns perform well when targeting each multi-GPU configuration. The main purpose of this study is to demonstrate the the power of this new framework, while also providing design directions for the future multi-GPU systems.
The contributions of this paper include the following:
We present a set of design principles that all simulators should aspire to.
We present MGSim, a new parallel cycle-accurate multi-GPU architectural simulator that delivers both flexibility and high performance. We extensively validate MGSim against real hardware with both micro-benchmarks and full workloads.
We present MGMark, a benchmark suite that explores multi-GPU communication patterns.
We use MGSim and MGMark in a case study to analyze the impact of unifying multiple GPUs behind a logical GPU interface. We discuss future multi-GPU system design implications from this case study.
Ii-a Multi-GPU Systems
While today’s GPUs are quite powerful, in many emerging applications a single GPU cannot meet the required processing demands, due to: (1) limited compute capabilities, and (2) limited memory space. For example, VGGNet , a popular deep neural network framework, requires G-Ops (Giga operations) to process a single image through a DNN model . If an application requires a 1000-images per second throughput (i.e., 40 TFLOPs), we need, in theory, at least five R9 Nano GPUs to fulfill this requirement. On the other hand, training a DNN may require a multi-terabyte dataset , dwarfing the GPU memory capacity of a single GPU. If we can increase the storage available, we have the potential of improving memory management and significantly accelerating training throughput. Multi-GPU systems can be a potential solution, as they both provide more compute resources and more memory storage.
The industry has begun to realize the potential of multi-GPU systems [23, 32]. Most popular GPU programming frameworks, including OpenCL and CUDA, support multi-GPU programming following the model shown in Figure (a)a. These programming frameworks expose all of the GPUs to users, enabling them to select where data is stored and how kernels are mapped to devices. For example, in OpenCL, a command queue is associated with a GPU and all the commands (e.g., memory copy, kernel launching) in the queue run on the associated GPU. In CUDA, the developer can select the GPU to use with the cudaSetDevice API, and all memory operations and kernel launches after the API call will target the designated GPU.
The computer architecture community lacks simulation frameworks that can support evaluation of this well-accepted multi-GPU programming model. Microarchitectural studies usually use a GPU programming model similar to the one shown in Figure (b)b. Researchers usually simulate only a single GPU, but configure the interconnect in the simulator to mimic a multi-GPU system .
One feature that greatly simplifies multi-GPU programming is unified memory  (for CUDA) or shared virtual memory  (for OpenCL). Unified memory provides the programmer with a single unified address space and avoids the need for explicit memory copies from one device to another. Recent unified memory implementations  even support cross-GPU memory access, further simplifying the programming model. However, in current multi-GPU systems, cross-GPU memory access involves data transfers across a slow interconnect. Even NVLink, the most recent cross-GPU interconnect, provides an inter-GPU fabric that is an order magnitude slower than local memory on the GPU (NVLink supports 20GB/s , whereas HBM memory supports 256GB/s ). Therefore, bottlenecks in the cross-GPU memory system can significantly impact the benefits of multi-GPU systems. To begin to design more performant and scalable multi-GPU systems, a new brand of tools are required to guide design choices.
Ii-B GPU Execution Model
A typical GPU system is made up of 1 or multiple CPUs, and a few GPUs (up to 8 per node in high-end systems). The GPUs generally are under the control of a CPU. More specifically, the host program that runs on the CPU sets up the data for the GPU and launches GPU programs (kernels) on GPUs. A vendor-specific GPU driver, running at the operating system level, receives requests from the host program and transfers data and launches kernels on the GPU hardware.
When running on a GPU, a kernel can launch a 1-, 2-, or 3-dimensional grid of work-items. One work-item is comparable to a thread on a CPU and has its own register state. A grid can be divided into work-groups and wavefronts. On an AMD GCN3 GPU, a wavefront consists of 64 work-items that execute the same instruction concurrently. A work-group contains 1-8 wavefronts which can be synchronized using barriers.
The design of a GPU supports very high throughput for data-parallel workloads. For example, the AMD R9 Nano GPU  leverages many Compute Units (CU) to execute instructions. A single CU incorporates four Single-Instruction Multiple-Data (SIMD) units. Each SIMD unit has 32 lanes, with each lane providing a single precision floating point unit. Hence, a single SIMD unit can execute 32 instructions in parallel within a single clock cycle. With 64 CUs, the R9 Nano GPU executes up to instructions per cycle. As the R9 Nano GPU runs at a 1GHz clock rate, it can support a peak throughput of 8,192 1G 8.19 terafloating-point operations per second (TFLOPs).
Ii-C Parallel GPU Simulation
Architectural simulation can be much slower than running on real hardware. For example, Multi2Sim  is reported to be slower than native execution, which translates to more than a day to simulate 2 seconds of native execution. Malhotra et, al.  report that GPGPUsim is slower than native — 11 days to simulate 2 seconds of native execution. These slow simulation speeds make simulating large-scale systems and large-scale workloads almost impossible in existing simulators. To successfully simulate multi-GPUs with large-scale workloads, we need a new simulation philosophy.
To accelerate architectural simulation, researchers have explored using multi-threading to accelerate simulation. In general, two types of parallelization approaches are used: 1) conservative, and 2) optimistic . Using a conservative approach, the chronological order of the events is not interrupted, which requires global synchronization in each cycle. An optimistic approach supports reordering events to avoid frequent synchronizations, reducing simulation time, though at some cost to the fidelity of the simulation.
We elect to adopt a conservative parallel simulation scheme because we do not want to compromise simulation accuracy. Figure 2 shows the number of events scheduled at the same time during simulation of the AES benchmark using MGSim. We see that the number of events that can be executed concurrently varies between 60 and 100, providing sufficient parallelism to keep a 4- to 8-core system busy.
Iii GPU Simulator Design Principles
Architectural simulators have been one of the most important tools to guide early design space exploration, performance optimization, and pre-silicon verification. Developing an accurate and extensible simulator is essential for the research community to explore a wide range of design possibilities.
In the following paragraphs, we discuss a number of design principles that simulators should follow, though are absent in many current simulators.
DP-1: Simulate state-of-the-art machine-level ISA. Cutting-edge research explores cutting-edge features, and hence, new ISAs and new microarchitectures need to be evaluated. Existing simulators are generally simulating old ISAs or intermediate representations. For example, Multi2Sim  emulates the GCN1 ISA, which is four generations older than current AMD product. GPGPU-Sim  mainly models the NVidia Fermi architecture that was released in 2010. In addition, researchers have highlighted major issues when performing performance analysis while simulating at an intermediate language level versus using the actual machine code ISA [25, 26], resulting in misleading performance. Therefore, while any simulator will immediately become quickly dated due to the pace of development in GPU technology, the research community needs a simulator that can simulate a new and feature-rich machine-level ISA.
DP-2: “Open to Extension, Closed to Modification.” When studying performance/power/reliability with an architectural simulator, researchers usually need to reconfigure, or more commonly, modify, the simulator to fit the needs of their intended study. Modifying the inter-dependent components in a simulator is non-trivial and may require modifying a large number of files. It tends to be more problematic when combining the modifications from different developers, as each developer may need to modify common files.
According to the “Open-Closed Principle” , one should be able to extend a simulator without modifying it. When adding more functionality to a simulator, researchers should not need to modify source files. Instead, they should write new extensions for the simulator and plug the new extensions into the existing simulator to realize new configurations. This approach can also help support the reproducibility of results, since each module can be clearly defined and reused .
DP-3: No magic. It is tempting for simulator developers to overuse the flexibility that a software design offers to overcome the complexity of the simulated hardware design, typically manifested in intricate queuing systems, asynchronous buffers, or low-level communication protocols.
As an example of “magic”, the implementation of a GPU may directly invalidate the caches by invalidating all directory entries, ignoring the fact that in real hardware, this action involves a message to be sent from the command processor to each cache module. Manipulating the state of one module from another is a clear sign that the simulator is not tracking the behavior of real hardware, and this may impact simulation accuracy. When a simulator developer uses “magic”, it hurts both the accuracy of the simulator, as well as encapsulation and modularity of the code.
DP-4: Track both timing and data.
Directly inferred from the “no-magic” rule, a simulator should model the actual data-flow in both the memory system and the instruction pipelines, rather than only calculating the simulated time. Execution simulation that maintains data values offers two advantages: (1) Minor mistakes in the simulator will be detected as a mismatch of output values, rather than a difference in the estimated time. If the result generated by the simulator matches execution on the target hardware, we can guarantee that the modeled hardware is at least feasible. (2) A performance model or power model may be data dependent[47, 54]. Maintaining data in each module under simulation helps us support data-dependent modeling, which can improve accuracy.
DP-5: Simulate multi-threaded hardware with multi-threaded software. A GPU supports a massively parallel execution model. There are a large number of units concurrently executing independently on a GPU. Therefore, it should be possible to use multiple CPU threads to simulate GPU execution. In addition, properly applying locks in a multi-threaded program to prevent race conditions and avoid deadlocks is usually a difficult job. The design of the simulator should provide a locking scheme that both guarantees performance and avoids the hazards described above.
DP-6: No busy ticking. Busy ticking (i.e., constant checks of module states) is a common reason for low simulation performance, and should be avoided. In current simulator designs (e.g., GPGPUSim), modules usually need to check their internal state every cycle, even if the states do not need to be updated. This is a common problem for cycle-based simulation. Multi2Sim  partially solves the problem by using a hybrid cycle-based and event-driven simulation scheme. However, some modules still need to keep retrying actions each cycle, such as cache reads to the cache while the network is busy. To achieve good simulation performance, a next-generation simulator should avoid busy-ticking whenever possible.
MGSim is a highly-configurable GPU simulator that is open-sourced under the terms of the MIT licence . The simulator has been developed using the Go programming language . We selected Go because it provides both reasonable performance and ease of programmability. It also provides native language-level support for multi-threaded programming.
Iv-a Simulator Core
The simulator core features a lightweight design composed of four parts:
1. The event system: An event marks an update of the system state that occurs at a particular time. An event-driven simulation engine maintains a queue of events and triggers events in chronological order.
2 The component system: Every entity that the simulator simulates is a component. In our case, a GPU, a compute-unit, and a cache controller are examples of components. A component can only schedule events to itself and cannot decide what other components do in the future. Each component serves as an event handler that can process different types of events. The same type of event may have different behavior when handled by different components.
3. The request-connection system: Two components can, and only can, communicate with each other through connections using requests. Connections are used to model the network-on-chip (NOC) and cross-chip interconnects.
4. The hook system: Hooks are small pieces of software that can be configured to attach to the simulator to either read simulation state, or update the simulator state. The event-driven simulation engine, all the components, and the connections are hookable. Hooks are used to perform non-critical tasks such as collecting traces, dumping debugging information, calculating performance metrics, recording stall reasons, and injecting faults (for reliability studies).
The MGSim event engine supports parallel simulation, fulfilling DP-5. Leveraging the fact that the events that are scheduled concurrently do not depend on each other, the event engine employs multiple CPU threads to process the events that are scheduled concurrently. We embrace a conservative parallel event-driven scheme, so that we guarantee accurate results that match execution on a serial version of the simulator.
The component system and the request-connection system enforce a very strict state encapsulation of components. Since we restrict a component from scheduling events for other components, nor allow a component to access another component’s state (by reading/writing field values, using getter/setter functions, or function calls), all communication must use the connection systems. This design forces the developer to explicitly declare protocols between components. The benefits of this design are three-fold.
First, a developer can implement a component without any concern for the communication protocol, letting the request-connection system worry about the implementation details of connecting components.
Second, we gain flexibility by allowing the user to compose two components that follow the same protocol freely. When a researcher wants to extend the simulator, one does not need to modify the existing simulator, but only needs to rewrite a new component that replaces an existing component. The researcher only needs to be compliant with the protocol of the original component. When combining efforts of two researchers, one simply needs to import the code from two sources and write a new configuration to connect the systems together. By adopting this model, we fulfill the requirement of DP-2. We encourage researchers that use MGSim to create a new git repository (open-source, ideally) that only contains the extensions to MGSim, and provides the necessary configuration code to wire the new extension to the original MGSim.
Third, we can improve simulation accuracy as no information can “magically” flow from one component to another, without being explicitly transferred through the interconnect. As a consequence, the processor cannot access the data directly in DRAM, forcing all the data to flow through the cache hierarchy. We do not support emulation run-ahead during architectural simulation, as the processor does not even have the instruction bytes until the data is explicitly fetched from memory system. Therefore, we can both satisfy DP-3 and DP-4 with this design.
The component system also contributes to addressing DP-5, by creating a clear boundary on where locks should be used. As the Handle function is the only place where a component can update its internal state (other components cannot access this state), we simply set a lock at the beginning of the Handle function and unlock at the end of the Handle function.
The event-driven simulation and the connection system can help avoid busy ticking (DP-6). For long latency actions, such as a 300-cycle latency in reading DRAM, we can schedule an event in the event-driven simulation engine after 300 cycles and skip state updates in between. In addition, another type of busy ticking in GPU architectures is caused by components that repeatedly retry to send data. Since a component has no information about when a connected connection becomes available, the component has to retry each cycle. To avoid this type of busy ticking, we allow the connections to explicitly notify connected components when the connection is available. Therefore, a component can avoid updating the state if all of the out-going connections are busy, as no progress can be made, and continue to update cycle-by-cycle after the connection is available.
Iv-B GPU Architecture Simulation
MGSim models a GPU, as shown in Figure 3, that runs the Graphics Core Next 3rd Generation (GCN3) ISA, fulfilling DP-1, by simulating a new ISA and microarchitecture. GCN4  and Vega (GCN5)  only involve microarchitecture modifications or minor memory access instruction extensions, and hence, can still be modeled by configuring MGSim.
The GPU architecture is mainly composed of a Command Processor (CP), Asynchronous Compute Engines (ACEs), Compute Units (CUs), caches, and memory controllers. The CP is responsible for communicating with the GPU driver and starting kernels with the help of ACEs. The ACEs dispatch wavefronts of kernels to run on the CUs.
In our model, a CU (as shown in Figure 4
) incorporates a scheduler, a set of decoders, a set of execution units, and a set of storage units. The CU includes a scalar register file, vector register files, and a local data share (LDS) storage. A fetch arbiter and an issue arbiter decide which wavefront can fetch instructions and issue instructions, respectively, in a round-robin manner. Decoders require 1 cycle to decode one instruction, before sending the instruction to the execution unit (e.g., SIMD unit). Each execution unit has a pipelined design that includes read, execute, and write stages. After one instruction completes all the stages in the pipeline, the wavefront that owns the instruction can issue the next instruction.
The MGSim simulator includes a set of cache controllers, including a write-around cache, a write-back cache, and a memory controller. By default, the L1 caches and the L2 caches use a write-around and write-back policy, respectively. The cache controllers do not enforce coherence, as allowed by the GPU programming and memory model.
Iv-C Multi-GPU Configuration
To demonstrate configurability of the simulator, we explore the multi-GPU design space, configuring three different multi-GPU platforms — namely, a Monolithic Single GPU (M-SGPU), a Unified multi-GPU system (U-MGPU), and a discrete multi-GPU system (D-MGPU). The M-SGPU is similar to the base-line R9 Nano GPU configuration (as in Figure 3), but provides 256 CUs, 32 L2 cache units, and 32 memory banks, making the computing power equivalent to four GPUs. Note that the M-SGPU is just a baseline design to help us analyze performance scaling of the multi-GPU systems. In reality, manufacturing such a GPU is impractical due to the limitations of current die sizes.
As shown in Figure (b)b, the D-MGPU design creates a GPU configuration that is commonly provided on current platforms: the driver connects to multiple GPUs, and the programmer can use APIs to control where the data resides and where the kernel executes. To enable unified memory and cross-GPU memory access, we introduce RDMA engines that route memory requests to other GPUs. We connect the RDMA engines via a PCIe bus, providing a bandwidth of 16 GB/s shared by all the GPUs.
To create U-MGPU, we disable the Command Processors and the ACE of GPUs 1 to 3, leaving the CP and the ACE of GPU 0 in charge of all the Compute Units. We also create a cross-GPU connection that connects the ACE of the first GPU with the CUs of all other GPUs.
The DRAM banks in the multi-GPU systems are interleaved with a granularity of 4KB. For example, the address 0x0000-0x0FFF is stored in DRAM 0 of GPU 0 and so on, An exception is in D-MGPU, where the address space is first partitioned across GPUs and then interleaved, making the DRAMs of the second GPU map to the address range 4GB-8GB.
MGMark is a new benchmark suite that targets exploration of multi-GPU collaborative execution patterns with a wide range of multi-GPU workloads.
V-a Multi-GPU Collaborative Execution Patterns
Execution patterns include types of behavior that repeatedly appear in program execution. The pattern of a program is usually determined by both algorithm constraints and implementation decisions. In this work, we consider a scenario where the data to be processed is large, so that duplicating the data to each GPU adds too much overhead, and is impossible to run on a single GPU due to memory size limitations.
Studying multi-GPU collaborative execution patterns can help us cover most types of multi-GPU execution with a smaller number of benchmarks. It can also guide programmers and system designers to optimize programs and systems for specific targets. Note that the patterns introduced here are not meant to be exhaustive, nor mutually exclusive. One multi-GPU program may use more than one pattern, or may use patterns that we do not characterize in this paper.
Partitioned Data: The Partitioned Data pattern describes a type of algorithm that naturally allows both the input and output data to be partitioned on each GPU. The result is that no cross-GPU memory accesses are required. This pattern is frequently observed in streaming applications, such as AES encryption , and the Blacksholes algorithm , where the input and output have a one-to-one mapping. This pattern usually relies on a head node (a CPU or a GPU) to partition the data and broadcast the data to each GPU to process each batch. As no cross-GPU communication is required, this pattern is likely to achieve good scalability, and hence, should be used whenever possible.
Adjacent Access: The Adjacent Access describes a pattern where the GPUs need to access data, that is closely related to their own local data, from other GPUs. This pattern is frequently observed in signal processing , stencil algorithms , and physical simulations [38, 56], as calculating one output at one particular index needs the input data from surrounding indices that are resident on a neighboring GPU. If the data that needs to be accessed from another GPU is read-only, we can maintain multiple copies of the data to avoid cross-GPU access, at the cost of using more GPU memory space. Otherwise, we can keep the data partitioned on each GPU and allow each GPU to issue cross-GPU accesses occasionally. Adjacent accesses involve a relatively small amount of cross-GPU communication, and therefore, can be a good option compared to data duplication.
Gather: This pattern describes a commonly used computing paradigm, where every GPU in the system needs to read remote data from the other GPUs, but each GPU will only write to its own local memory. The Gather pattern can be used in reduction style computing (e.g., adding two vectors element-wise or calculating the sum of a vector) as each GPU needs to synthesize a larger amount data to create a smaller output. When the data is too large to fit in one GPU’s memory, or the data is already on each GPU, we can use a Gather operation. The Gather pattern requires the system to process cross-GPU read requests with rather low latency.
Scatter: Opposite but similar to Gather, Scatter describes a pattern where each GPU needs to input data from a local GPU and output data to the entire GPU address space. This pattern is used when the input data can be partitioned on each GPU, while the output location is non-deterministic.
Irregular: We summarize all other patterns as following an Irregular pattern, and includes patterns when any GPU needs to both read and write data from/to the entire GPU address space. This data reference pattern occurs in many sorting and graph algorithms, as the access pattern is data-dependent. The Irregular pattern presents performance challenges since it may result in frequent cross-GPU communication. The programmer should try to use other patterns before settling for an Irregular pattern. Also, whenever this pattern is used, the programmer should make every effort to keep memory accesses within a local GPU and avoid cross-GPU accesses.
We select a suite of workloads from public-domain libraries and benchmark suites, including the AMDAPPSDK 3.0  (BS, MT, SC) and HeteroMark  (AES, FIR, KM), as well as one benchmark (GD) developed from scratch. Workloads are modified with new OpenCL kernels supporting multi-GPU execution, and extended with a Go main program compatible with the simulator.
Advanced Encryption Standard (AES): AES 256-bit encryption  is an encryption algorithm widely used in the security domain today. It involves a large number of bitwise operations to convert the plaintext to ciphertext, making it a compute-intensive workload. Our partitioned implementation breaks up the plaintext into chunks and broadcasts the chunks to the GPUs. Each GPU then works on its own chunk of the data, with no need to access any remote data.
We include this benchmark to test the Partitioned Data pattern. We also use this benchmark to validate our model for sub-dword-addressing, a distinct feature of the GCN3 and later AMD ISAs .
Bitonic Sort (BS): Bitonic Sort  is a sorting algorithm that suits the GPU’s massively parallel architecture. It has a predefined order to compare pairs of values in the array to be sorted, making it highly data-parallel.
We include the Bitonic Sort algorithm to test the Irregular pattern. Although the memory access order is predefined, each GPU needs to read from, and write to, any location in the unified memory address space. It also scans a wide range of memory addresses repetitively, putting significant stress on the cache system.
Finite Impulse Response Filter (FIR): FIR  is a fundamental algorithm from the digital signal processing domain. In FIR, each work-item multiplies the filter kernel with a portion of the input data in an element-wise manner and sums all the results together.
We include FIR to test the Adjacent Access pattern, as for each GPU, the first few work-items on each GPU need to access the input data that is stored on another GPU. Its large memory footprint can help us analyze how cross-GPU memory access may have a significant performance impact.
Gradient Descent (GD): Gradient descent  is an important step used in optimization problems such as DNN Training. Gradient descent evaluates the gradient values for a set of mathematical functions and uses the gradient value to update each function’s parameters. When running on a multi-GPU system, gradient descent is usually performed in a data-parallel fashion, as each GPU processes a mini-batch of the data (i.e., the Partitioned Data pattern). At the end of calculating the gradient on each GPU, the gradient values need to be averaged. Calculating the average inevitably involves cross-GPU communication.
We include the GD workload as it is one of the most widely used algorithms that requires the Gather pattern. Its large memory footprint is also a good test case to stress the cross-GPU interconnect.
KMeans (KM): KMeans 
is an important clustering algorithm widely used in unsupervised machine learning applications. The GPU is responsible for calculating the distance from each input node to each of the centroids, while the CPU updates the centroid location.
We select the KMeans benchmark to evaluate the Partitioned Data pattern. This workload is different from AES, which also follows the Partitioned Data pattern, in two respects: i) KMeans is a more memory intensive workload, and ii) KMeans repetitively accesses the same memory locations in multiple kernels, making it more sensitive to the cache design.
Matrix Transpose (MT): Matrix Transpose is one of the building blocks common in more complex matrix operations. Work-items from one work-group first load matrix data to the local data share memory (an addressable memory space with similar latency to the L1 caches), and then write the data back to the memory in the transposed locations.
Although MT can be implemented using both the Gather pattern and the Scatter pattern, we include the Matrix Transpose benchmark to test the Scatter pattern. Each GPU is responsible for a specific number of columns in the output matrix. Since each GPU stores a few rows of both the input and output matrix, each GPU can read from local memory and write to other GPUs. We also use MT to validate the simulator on Local Data Store (LDS) operations.
Simple Convolution (SC)
: Simple convolution is a common operation in the image processing domain. It is also a fundamental step in convolutional neural networks (CNNs). SC performs a convolution operation on 2-dimensional images.
We include SC to test the Adjacent Access pattern in a 2-dimensional problem. Although the image to be convolved can be partitioned across multiple GPUs, each GPU needs to access a remote partition for the input pixels on the margins.
Vi Evaluation Methodology
We start by using both micro-benchmarks and full benchmarks from MGMark to validate the accuracy of MGSim, and then evaluate the multi-GPU design performance as a case study.
Vi-a Execution Platforms
|Number of CUs||64|
|Core Frequency||1.0 GHz|
|Theoretical Compute Speed||8.19 TFLOPs|
|L1 Vector Cache||64 16KB 4-way|
|L1 Instruction Cache||16 32KB 4-way|
|L1 Constant Cache Size||16 16KB 4-way|
|L2 Cache Size||8 256KB 16-way|
|DRAM Size||8 512MB|
In order to validate MGSim against a real hardware, we collect the actual GPU execution time as a golden performance reference. The validation system has 2 Intel Xeon E2560 v4 CPUs and one AMD R9 Nano GPU (details provided in Table I). The system runs the Radeon Open Compute Platform (ROCm) 1.7 GPU software stack on a Linux Ubuntu 16.04.4 operating system. We lock the GPUs to run at the maximum frequency to avoid the impact of the Dynamic Frequency and Voltage Scaling (DVFS) on the system. All the timing results are collected using the Radeon Compute Profiler .
MGSim supports ROCm standard , so we compile the benchmarks with AMD’s ROCm compiler. We use clang-ocl to compile the MGMark workloads and use Clang (ROCm modified) to assemble the kernels of the micro-benchmarks (to be introduced in VI-B). The host programs are compiled with GCC 5.2.
We evaluate simulation speed and multi-threaded scalability on a host platform based on an Intel Core i7-4770 CPU, with 4 cores and 2 threads per core. When measuring the simulator performance, we use the environment variable GOMAXPROCS to set the number of CPU cores that the simulator can use.
We use a set of micro-benchmarks to help us confirm that our simulator can faithfully model each individual aspect of the real hardware. Each micro-benchmark is composed of a manually written GCN3 assembly kernel, a C++ host program used in native execution, and an additional host program written in Go for simulation purposes. Micro-benchmarks include:
ALU: A simple Python script is used to generate kernels with a varying number of ALU operations (v_add_f32 v3, v2, v1), followed by an s_endpgm instruction. Using the ALU micro-benchmark, we validate instruction cache policies and geometry.
L1 Access: Another Python program is again used to generate a fixed number of memory reads to the same address. All accesses, except for the first one, are presumably L1 cache hits, which allows us to infer the cache latency.
: Global memory is repeatedly accessed using a 64-byte stride. Since all cache levels use 64-byte blocks, all accesses are expected to incur cache misses, and ultimately read from the DRAM. We use this micro-benchmark to measure the DRAM latency.
L2 Access: This micro-benchmark first scans 1MB of memory, loading all of the data in the 2MB L2 cache on the R9 Nano. The L1 cache is expected to retain the last 16KB, which is equal in size to its total capacity. After this, a second scan sweeps the same 1MB of data from the beginning, using a variable number of memory accesses. All the memory accesses in the second pass should miss in the L1 cache and hit in L2. We use this strategy to find the L2 cache latency.
Vi-C MGMark Configuration
|1 GPU||4 GPUs||What|
|MT||2048||4096||Width of square matrix|
|SC||1024||2048||Width of square image|
Vii Experimental Results
We first carry out a thorough experimental evaluation to validate our simulator against GPU hardware. Then we present a set of experiments to demonstrate how microarchitecture design can impact multi-GPU collaborative execution efficiency.
Vii-a Simulator Validation with Micro-benchmarks
Figure (a)a presents the execution time when using the ALU micro-benchmark. We consider the execution time as we increase the number of ALU operations (see X-axis). As we can observe, the execution time demonstrates a staircase behavior, which is the result of instruction cache misses. In particular, as each cache line in the CU’s L1 Instruction Cache can store 16 ALU instructions, we have 1 cache miss and 15 hits for every cache line read. From the slope of the time curve, we can conclude that the SIMD unit takes 5 cycles to execute one instruction, and from the step’s height, we know that the GPU spends 300+ cycles to service a cache miss. The dashed line in the figure shows the simulator’s reported execution time, demonstrating that our simulator can reproduce both the cache instruction misses and the pipeline latencies accurately.
Figure 6 also explains the behavior of the memory system of the R9 Nano GPU when using the remaining three micro-benchmarks: L1 Access, DRAM Access, and L2 Access. Figure (b)b suggests that each L1 hit takes around 150 cycles, which is a very long time for L1 caches. However, if we compare these results with Figure (c)c, which shows that each L2 cache hit takes approximately 140 - 150 cycles, we can draw the conclusion that L1V is disabled by default in the ROCm platform. Therefore, we also disable the L1V cache in our simulator to represent the real hardware.
As to the L2 Access micro-benchmark, since we run it on real GPU for a large number of accesses, we use a blue dot to represent each reading in Figure (c)c. The two groups of blue dots, separated by a 0.03ms gap, is a result of an occasional DRAM refresh, which adds a small amount of time to the overall execution time. As we observe, although MGSim underestimates the execution time, the error is very small () and, more importantly, we can track trends that match the real GPU.
Figure (d)d shows the execution time when running the DRAM Access micro-benchmark. These results reveal that an L2 miss takes approximately 460 cycles to service, which is the time required to traverse the whole memory hierarchy. The combined results shown in Figure 6 demonstrate that MGSim is capable of simulating each layer of the memory hierarchy with very high accuracy.
Vii-B Simulator Validation with our MGMark Suite
To validate our simulator using system workloads, we run the full set of benchmarks included in our MGMark suite. As we can see from Figure 7, except in two cases (AES and KM), our simulator achieves almost identical execution times. Overall, MGSim achieves performance numbers within of the measured hardware runs. The largest discrepancy is less than in the tested benchmarks.
Vii-C Simulator Performance
MGSim was developed to deliver scalable simulation performance. To demonstrate this feature, we run MGSim configured with a single-GPU running the MT benchmark. Our experiment reveals that our simulator can reach kilo-instruction per second (KIPS) in terms of simulation speed. Although simulators support different ISAs and model distinct components, to put this value into perspective, we run the same experiment in two other state-of-the-art simulators: Multi2Sim 5.0 and GPGPUSim. We obtain KIPS and KIPS, respectively.
To support efficient design-space exploration in the context of multi-GPU systems, unlike contemporary GPU simulators, we designed MGSim with built-in multi-threaded execution to further accelerate the speed of simulations. Our simulations can take advantage of the multi-threaded/multi-core capabilities of contemporary CPU platforms. As shown in Figure 8, MGSim achieves good scalability when using multiple threads to run simulations. In particular, when 4 cores are used in the Intel Core i7-4770 CPU platform, MGSim can achieve and speedups in functional emulation and architectural simulation, respectively, while preserving the same level of accuracy as in single-threaded simulation.
Vii-D Evaluating Multi-GPU Configurations
So far, we have validated MGSim considering single-GPU scenarios employing both micro-benchmarks and our MGMark suite. Next, we evaluate the utility of MGSim, simulating the two multi-GPU configurations defined in Section IV-C: U-MGPU, a unified logical GPU configuration, which is widely adopted by the microarchitecture research community; and D-MGPU, a discrete multi-GPU configuration. Also, to help us analyze performance scaling, we use a baseline design that consists of a Monolithic Single GPU (M-SGPU) that combines 256 CUs on a single die, a GPU that would be impractical to fabricate using today’s technology.
Figure 9 presents the relationship between the cross-GPU traffic and overall performance. From the figure, we clearly see that the cross-GPU communication is a bottleneck in the full system, as the traffic on the interconnect is strongly correlated with the total execution time.
U-MGPU generally shows much larger slowdowns compared to D-MGPU. This is because the programmer cannot control where the data is placed or where a kernel is launched in U-MGPU design. The lack of data-affinity scheduling produces a large amount of cross-GPU traffic, and hence, significantly reduces overall performance.
We also see that the different collaborative execution patterns play a roll in overall performance. As AES and KM follow the Partitioned Data pattern, the programmer can eliminate all cross-GPU traffic, leading to the high performance in D-MGPU. In FIR and SC, occasional cross-GPU accesses occur when using an Adjacent Access pattern, leading to a lower performance in D-MGPU as compared to M-SGPU. SC is worse, as compared to FIR, because it needs to load more data from remote GPUs. GD and MT both need to read and/or write a relatively large amount of data from a remote GPU, suggesting Gather and Scatter are patterns that place high demands on a multi-GPU communication. Finally, we see D-MGPU outperforms U-MGPU in the BS benchmarks. Although BS has an Irregular access pattern, a majority of the swapping occurs between adjacent elements, making a proper data partitioning still useful in improving performance.
We can draw the following design insights from the results of this case study. 1) Although unifying multiple GPUs under a single GPU interface simplifies programmability, the performance penalty is not negligible. Future research needs to explore solutions to reduce cross-GPU traffic to effectively leverage an unified-GPU system. 2) Multi-GPU systems that use unified memory and run workloads that generate cross-GPU memory accesses will require very high bandwidth between GPUs to make a multi-GPU system scalable. 3) Multi-GPU programmers need to have a clear picture on which collaborative execution pattern they are adopting in order to anticipate cross-GPU traffic. Programmers should also avoid patterns that generate too much cross-GPU traffic. 4) As programmers are familiar with the programming model of discrete-GPUs, giving back control to the programmer can be a reasonable solution to multi-GPU systems.
Viii Related Work
GPU Simulators: Ever since GPUs were introduced for high-performance general-purpose computing, researchers have developed GPU architectural simulators to support the research community to perform architectural exploration. GPGPU-Sim  and Multi2Sim  are two of a number publicly available GPU simulators that modeled GPUs based on NVIDIA’s PTX ISA and AMD’s GCN1 ISA, respectively. The Gem5 AMD GPU model  is a recent GPU simulator developed in parallel with MGSim, and is also capable of simulating the GCN3 ISA. While MGSim is inspired by these predecessor simulators, MGSim emphasizes strong software engineering principles, high performance parallel simulation, and multi-GPU system modeling.
Parallel GPU simulators: To accelerate GPU simulation, parallel GPU simulators have been proposed    . Barra  mainly focuses on parallel functional emulation, which is very different from MGSim, since it performs both emulation and timing simulation. GPUTejas  is a Java-based, trace-driven, parallel architectural simulator that can achieve high performance and scalability. Instead of trace-driven, our simulator is execution-driven in order to support the “no-magic” and “track both timing and data” design principles. The parallel simulator framework proposed by Lee et at. [34, 35] modifies GPGPUSim and only synchronizes when the processor accesses the memory system. Different from GPUTejas and Lee et al.’s frameworks, we achieve scalable speedup without compromising simulation accuracy. We also deliver a next-generation GPU simulator that can simulate a new ISA and multi-GPU systems.
GPU Computing Benchmarks: Because of the rising popularity of general purpose computing on GPUs, a significant amount of effort has been put into creating benchmark suites, such as Rodinia , Parboil  and Lonestar . Instead of designing for multi-GPU systems, these benchmark suites target single GPU computing capabilities. In addition, Chai  and Hetero-Mark  are benchmark suites that specifically focus on simulating concurrent CPU-GPU execution. MGMark is different from existing benchmark suites, as it targets multi-GPU systems that support unified memory and cross-GPU memory access.
Multi-GPU Benchmarks: Ben-Nun et al. developed Maps-Multi  and MGBench , a framework that categorizes multi-GPU memory access patterns and proposes an approach that can schedule memory location and kernel execution efficiently. The goal of our work is to provide a workload suite to evaluate multi-GPU system with modern features, including unified memory and cross-GPU memory access, which are not considered in Maps-Multi and MGBench. Also, our benchmark suite covers a broader range of multi-GPU execution patterns, compared to MGBench, which only includes two full benchmarks.
Multi-GPU Micro-architecture Research: More recently, the research community has started to study how to efficiently accelerate computing with multi-GPU systems. As major bottlenecks of the system are the cross-GPU interconnect and the memory system, research has focused on optimizing memory organization. Ziabari et al.  proposed unified memory hierarchy (UMH) and NMOESI, using the large GPU DRAMs as cache units for system memory, achieving CPU-multi-GPU memory coherency. MCM-GPU  considers a multi-chip module that encapsulates multiple GPUs in the same package. They introduced an L1.5 cache and used memory affinity scheduling to reduce the cross-GPU traffic. A NUMA aware multi-GPU system, proposed by Milic et al. , also tries to reduce traffic on the interconnect. While these studies are related to our own, we do not propose a new architecture nor algorithm, but instead, deliver a framework to explore Multi-GPU systems and explore the possibility of giving control of the multi-GPU system back to the programmer.
With the development of multi-GPU systems, the research community demands new tools to explore faster and scalable multi-GPU designs. In this paper, we have proposed a new, flexible, and high-performance, parallel multi-GPU simulator MGSim. We have extensively validated MGSim with both micro-benchmarks and full workloads against a real GPU. We also describe MGMark, a new benchmark suite for exploring multi-GPU execution patterns. Together, MGSim and MGMark serve as a novel framework that can be used to explore new and emerging multi-GPU systems.
In this paper, we presented a case study, comparing a discrete multi-GPU system with a unified multi-GPU system. We draw design lessons from our case study, suggesting that exposing a true multi-GPU interface to programmers is a valuable solution, but requires the programmer to have a clear picture of the underlying program pattern. We found that unifying the multi-GPU interface introduces a significant amount of cross-GPU traffic, and thus, requires a high-bandwidth interconnect as well as an efficient scheduling mechanism.
Designing a computer architecture simulator is a long-term effort. Despite the reasonable overall accuracy we can achieve, we will continue to support the simulator for the community, adding new features (e.g., supporting atomic operations) and additional workloads. We also plan to explore the multi-GPU design space more thoroughly, including different cross-GPU network topologies, network fabrics, and scaling the number of GPUs in the system.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
-  AMD. Amd radeon™ r9 series gaming graphics cards with high-bandwidth memory, 2015.
-  AMD. Dissecting the polaris architecture. 2016.
-  AMD. Graphics core next architecture, generation 3, reference guide. 2016.
-  AMD. Amd app sdk 3.0 getting started. 2017.
-  AMD. Vega instruction set architecture, reference guide. 2017.
-  AMD. Radeon compute profiler, 2018.
-  AMD. Rocm, a new era in open gpu computing, 2018.
-  Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. Mcm-gpu: Multi-chip-module gpus for continued performance scalability. ACM SIGARCH Computer Architecture News, 45(2):320–332, 2017.
-  The Go Authors. Effective go. 2009.
-  Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163–174. IEEE, 2009.
-  Tal Ben-Nun. mgbench, 2017.
-  Tal Ben-Nun, Ely Levy, Amnon Barak, and Eri Rubin. Memory access patterns: the missing piece of the multi-gpu puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 19. ACM, 2015.
-  Eric A Brewer. Kubernetes and the path to cloud native. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 167–167. ACM, 2015.
-  Martin Burtscher, Rupesh Nasre, and Keshav Pingali. A quantitative study of irregular programs on gpus. In Workload Characterization (IISWC), 2012 IEEE International Symposium on, pages 141–151. IEEE, 2012.
-  Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
-  Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44–54. Ieee, 2009.
-  Sylvain Collange, David Defour, and David Parello. Barra, a parallel functional gpgpu simulator. 2009.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
Reza Farivar, Daniel Rebolledo, Ellick Chan, and Roy H Campbell.
A parallel implementation of k-means clustering on gpus.In Pdpta, volume 13, pages 212–312, 2008.
-  Denis Foley and John Danskin. Ultra-performance pascal gpu and nvlink interconnect. IEEE Micro, 37(2):7–17, 2017.
-  Richard M Fujimoto. Parallel discrete event simulation. Communications of the ACM, 33(10):30–53, 1990.
Nitin A Gawande, Jeff A Daily, Charles Siegel, Nathan R Tallent, and Abhinav
Scaling deep learning workloads: Nvidia dgx-1/pascal and intel knights landing.Future Generation Computer Systems, 2018.
-  Juan Gómez-Luna, Izzat El Hajj, Li-Wen Chang, Víctor García-Floreszx, Simon Garcia de Gonzalo, Thomas B Jablin, Antonio J Pena, and Wen-mei Hwu. Chai: collaborative heterogeneous applications for integrated-architectures. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on, pages 43–54. IEEE, 2017.
-  Xun Gong, Rafael Ubal, and David Kaeli. Multi2sim kepler: A detailed architectural gpu simulator. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on. IEEE, pages 153–154, 2017.
-  Anthony Gutierrez, Bradford M Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, et al. Lost in abstraction: Pitfalls of analyzing gpus at the intermediate language level. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 608–619. IEEE, 2018.
-  Owen Harrison and John Waldron. Aes encryption implementation and analysis on commodity graphics processing units. In International Workshop on Cryptographic Hardware and Embedded Systems, pages 209–226. Springer, 2007.
-  Wen-mei Hwu. Heterogeneous System Architecture: A new compute platform infrastructure. Morgan Kaufmann, 2015.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  Hai Jiang, Yi Chen, Zhi Qiao, Tien-Hsiung Weng, and Kuan-Ching Li. Scaling up mapreduce-based big data processing on multi-gpu systems. Cluster Computing, 18(1):369–383, 2015.
-  Rashid Kaleem, Sreepathi Pai, and Keshav Pingali. Stochastic gradient descent on gpus. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs, pages 81–89. ACM, 2015.
-  David Kanter. Graphics processing requirements for enabling immersive vr. AMD White Paper, 2015.
-  Joonyoung Kim and Younsu Kim. Hbm: Memory solution for bandwidth-hungry processors. In Hot Chips 26 Symposium (HCS), 2014 IEEE, pages 1–24. IEEE, 2014.
-  Sangpil Lee and Won Woo Ro. Parallel gpu architecture simulation framework exploiting work allocation unit parallelism. In Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on, pages 107–117. IEEE, 2013.
-  Sangpil Lee and Won Woo Ro. Parallel gpu architecture simulation framework exploiting architectural-level parallelism with timing error prediction. IEEE Transactions on Computers, (4):1253–1265, 2016.
-  Wenqiang Li, Guanghao Jin, Xuewen Cui, and Simon See. An evaluation of unified memory technology on nvidia gpus. In Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on, pages 1092–1098. IEEE, 2015.
-  Yong Lim and Sydney Parker. Fir filter design over a discrete powers-of-two coefficient space. IEEE Transactions on Acoustics, Speech, and Signal Processing, 31(3):583–591, 1983.
-  Pu Liu, Zhenyu Qi, Hang Li, Lingling Jin, Wei Wu, SX-D Tan, and Jun Yang. Fast thermal simulation for architecture level dynamic thermal management. In Computer-Aided Design, 2005. ICCAD-2005. IEEE/ACM International Conference on, pages 639–644. IEEE, 2005.
-  James D MacBeth and Larry J Merville. An empirical examination of the black-scholes call option pricing model. The Journal of Finance, 34(5):1173–1186, 1979.
-  Geetika Malhotra, Seep Goel, and Smruti R Sarangi. Gputejas: A parallel simulator for gpu architectures. In High Performance Computing (HiPC), 2014 21st International Conference on, pages 1–10. IEEE, 2014.
-  Robert C Martin. Agile software development: principles, patterns, and practices. Prentice Hall, 2002.
-  Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. Beyond the socket: Numa-aware gpus. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 123–135. ACM, 2017.
-  NVIDIA. Cuda programming guide, 2010.
-  NVIDIA. Developing a linux kernel module using gpudirect rdma. 2018.
-  Open Source Initiative. The mit licence.
-  Cristiano Pereira, Harish Patil, and Brad Calder. Reproducible simulation of multi-threaded workloads for architecture design exploration. In Workload Characterization, 2008. IISWC 2008. IEEE International Symposium on, pages 173–182. IEEE, 2008.
-  Jan M Rabaey, Anantha P Chandrakasan, and Borivoje Nikolic. Digital integrated circuits, volume 2. 2002.
-  Nadathur Satish, Mark Harris, and Michael Garland. Designing efficient sorting algorithms for manycore gpus. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1–10. IEEE, 2009.
-  Andreas Schäfer and Dietmar Fey. High performance stencil code algorithms for gpgpus. In ICCS, pages 2027–2036, 2011.
-  Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun. Very deep multilingual convolutional neural networks for lvcsr. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 4955–4959. IEEE, 2016.
-  John E Stone, David Gohara, and Guochun Shi. Opencl: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering, 12(3):66–73, 2010.
-  John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 127, 2012.
-  Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. Hetero-mark, a benchmark suite for cpu-gpu collaborative computing. In 2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1–10. IEEE, 2016.
-  M. Khavari Tavana, Y. Fei, and D. R. Kaeli. Nacre: Durable, secure and energy-efficient non-volatile memory utilizing data versioning. IEEE Transactions on Emerging Topics in Computing, page 1.
-  Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. Multi2sim: a simulation framework for cpu-gpu computing. In Parallel Architectures and Compilation Techniques (PACT), 2012 21st International Conference on, pages 335–344. IEEE, 2012.
-  Renato Vacondio, Alessandro Dal Palù, and Paolo Mignosa. Gpu-enhanced finite volume shallow water solver for fast flood simulations. Environmental modelling & software, 57:60–75, 2014.
-  Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, and Gang Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 2015.
-  Amir Kavyan Ziabari, Yifan Sun, Yenai Ma, Dana Schaa, José L Abellán, Rafael Ubal, John Kim, Ajay Joshi, and David Kaeli. Umh: A hardware-based unified memory hierarchy for systems with multiple discrete gpus. ACM Transactions on Architecture and Code Optimization (TACO), 13(4):35, 2016.