Graphics processing units (GPUs) have become the system of choice for accelerating a variety of workloads including deep learning, graph applications, data mining, and big data processing. The size of these applications is growing continuously, and these applications are exhausting the compute and memory resources in single-GPU systems. Hence, the community is actively migrating towards using multi-GPU (MGPU) systems to accelerate the above-mentioned workloads. To enable inter-GPU communication, GPU vendors have proposed a number of mechanisms (see TableI). However, achieving near-ideal speedup (w.r.t. a single GPU) when using multiple GPUs is challenging because of the inefficiencies in MGPU system design and the associated programming model.
Inefficiency 1: In the existing discrete MGPU systems, each GPU has its own local main memory (MM) as shown in Figure 1(left). Each GPU in the MGPU system can access the other GPUs’ MM through low-bandwidth high-latency links. These off-chip links have 5 to 10 lower bandwidth (BW) (for transferring data between GPUs, and between CPU and GPU) than the BW for accessing local MM of a GPU. Thus, accessing a remote GPU’s MM increases the application execution time. Moreover, we observe non-uniform memory access (NUMA) effects when accessing a remote memory resulting in under-utilization of GPU compute resources, and therefore sub-optimal performance.
Inefficiency 2: Today’s MGPU programming model requires a programmer to manually maintain coherency by replicating data and/or accessing non-cached data from a remote memory using the expensive off-chip links. As a result, there is additional traffic traversing through the off-chip links. In addition, the existing weak data-race-free (DRF) consistency model for GPUs requires additional efforts from the programmer to avoid data races by providing explicit barriers.
As a result of these inefficiencies, we cannot leverage the full potential of MGPU systems. We provide more details about these two inefficiencies with experimental evaluation in Section 2. Researchers have proposed various solutions to address the aforementioned inefficiencies in the MGPU systems. In particular, the solution with identical objectives to ours was by Arunkumar et al. [arunkumar2017mcm], who proposed a package-level integration of multi-chip-module GPU (MCM-GPU) (see Figure 1(left)), where each GPU module has its own local DRAM. Here local accesses have low-latency, but remote accesses have very high latency. In parallel, other hardware and software optimizations such as L1.5$ [arunkumar2017mcm], CARVE [young2018combining], and HMG [hmg] have been proposed to address the two inefficiencies mentioned earlier.
To simplify programming, reduce the data transfer latency and increase the memory utilization efficiency, we propose a multi-GPU system with truly shared memory (MGPU-TSM). Unlike the MCM-GPU (see Figure 1), an MGPU-TSM system allows all GPUs to directly access the entire physical main memory of the system, thus eliminating non-uniform memory access (NUMA) effects observed in traditional MGPU systems. In addition, an MGPU-TSM does not require L1.5$ to reduce remote access overhead. Moreover, MGPU-TSM paves the way to accommodate low-overhead coherence protocol as well as a simpler consistency model for MGPU systems. In this work, we compare the performance of an MGPU-TSM design with state-of-the-art RDMA– and unified memory (UM)–based MGPU designs using MGPUSim [sun2019mgpusim] to demonstrate the benefits of MGPU-TSM systems.
2 Challenges in Existing MGPU Systems
2.1 RDMA Access Cost
In this section, using the data access latency metric, we present the motivation for providing shared main memory in an MGPU system. Here, we run the commonly-used matrix multiplication kernel SGEMM, from NVIDIA’s cuBLAS library [nvidia2008cublas], on an MGPU system with V100 GPUs (compute capability of 7.0). We use two GPUs connected through NVLink 2.0 (50 GB/s bidirectional bandwidth). The conclusions of our analysis should be broadly applicable to systems with more than 2 GPUs that use GPU-GPU RDMA.
The computations in the SGEMM kernel consist of three matrices A, B, and C. In our experiment, we distribute the matrices in the memory of two GPUs (GPU0 and GPU1) and examine the performance degradation caused by different degrees of remote access (using P2P direct access as an example) when the SGEMM is executed on GPU0. We use the aL-bR format to represent a% local access and b% remote access for GPU0, where a and b are integers. We evaluate the following four matrix distributions across memory:
Matrices A, B and C are in GPU0’s memory. This leads to 100% local access for GPU0 (100L-0R).
Matrices A and B are in GPU0’s memory, and C is in GPU1’s memory (67L-33R).
Matrix A is in GPU0’s memory, and matrices B and C are in GPU1’s memory (33L-67R).
Matrices A, B and C are in GPU1’s memory. This leads to 100% remote access for GPU0 (0L-100R).
Figure 2 shows the runtime for the SGEMM kernel execution with different matrix sizes for the above four matrix distributions. For smaller matrix sizes, accessing remote memory is very expensive because of the fixed remote access overhead. The runtime of SGEMM for the 0L-100R distribution for a 4k4k matrix is 27 longer than that of the 100L-0R distribution. On the other hand, the runtime of SGEMM for the 0L-100R distribution for the 32k32k matrix is 12.2 longer than that of the 100L-0R distribution. Here, the fixed remote access overhead gets amortized. From these experiments, we can see the significant impact of remote accesses on performance, and in turn, argue that to improve the performance of applications, we need to avoid remote accesses as much as possible.
2.2 Data Sharing and Programmability
Data sharing across multiple GPUs during kernel execution leads to programming challenges, as the programmer must choose between programmability and performance. In this section, we examine the DNN training process on MGPU systems, when leveraging different data-parallelism schemes. We highlight how different mechanisms trade-off programmability for performance. The three stages of DNN training include forward propagation (FP), backward propagation (BP), and weight update (WU). During the FP and BP stages, different GPUs calculate their local stochastic gradient descents (SGDs) that are later used to update the values of weights used for the next iteration.111More details about the training stages can be found in [mojumder2018profiling].
In Algorithms 1, 2 and 3, we consider three different ways a programmer can perform the WU stage. We will assume a 2-GPU MGPU system here. Algorithm 1 shows that when using memcpy, the programmer must maintain coherence explicitly by periodically copying data to GPU1’s memory. Thus, there is an additional copy of data i.e. SGD (gGPU1) in GPU0’s memory, leading to additional memory usage. Nonetheless, this mechanism can be efficient in terms of kernel runtime because P2P memcpy can run asynchronously. Algorithm 2 shows how P2P direct access with RDMA can eliminate the data copy step, but at the expense of accessing data using off-chip links. Still, the programmer must transfer the data from the CPU to the GPUs. Algorithm 3 illustrates that a shared main memory could ease programmability and eliminate explicit GPU-to-GPU or CPU-to-GPU data transfers. Note that UM and Zerocopy solutions use Algorithm 3. UM, as proposed by NVIDIA, eases programming with a software abstraction, but suffers from performance degradation due to inefficient page-fault support and expensive remote accesses [trinayan-hpca2020]. A Zerocopy solution does not use GPU memory at all. The GPUs access pinned CPU memory using the off-chip (PCIe) links [negrut2014unified]. We argue that we need a solution which would not trade-off programmability to gain performance. A programmer can use Algorithm 3 on our envisioned MGPU-TSM and enjoy both ease of programmability and high performance.
In the pseudocode, the right arrows point to destination variables of an operation.
3 MGPU-TSM System
To explain and evaluate our envisioned MGPU-TSM architecture, we consider an MGPU-TSM system consisting of 4 GPUs, 1 CPU and 4 HBM stacks that provide a total of 32GB MM (we are using a 32GB capacity as an example to explain the MGPU-TSM architecture – our MGPU-TSM system works with larger memory). The specifications of the GPU, CPU, and HBM stacks are provided in Table II.
3.1 MGPU-TSM Architecture
Figure 1(right) shows the logical view of our proposed MGPU-TSM system. We leverage the current common design for compute units (CUs), where each CU has a dedicated write-through L1$. All the L1$s are connected to the L2$s using a crossbar network. For our proposed MGPU-TSM system, we make changes to the memory hierarchy, starting from L2$ down to the MM.
GPUs typically have distributed L2$ banks, where each L2$ bank serves one memory controller (MC). In our envisioned MGPU-TSM system, we have 8 L2$ banks per GPU and 4 HBM stacks that provide a total of 32 GB of MM. Thus, for each GPU, an L2 MC controls 4GB of memory. Each of the 8GB DRAMs is further distributed into 16 banks, where each bank has a 512MB capacity.
Each L2 bank, as well as each DRAM bank, is connected to a centralized switch through a dedicated 32GB/s bidirectional link. Thus, each GPU has a total of 256GB/s of bidirectional BW between the L2$ and MM. With 4 GPUs, the total BW is 1TB/s. This also implies that each memory access requires a two-hop communication, from L2$ to the Switch, then from the Switch to MM, and vice versa. Recently, NVIDIA introduced NVSwitch [ishii2018nvswitch], providing 18 ports and 928GB/s of bidirectional BW, supporting RDMA connectivity across multiple GPUs. Hence, our assumed 32-port switch with 1TB/s aggregate bidirectional BW is realistic.
The key advantage of our TSM lies in physically-unified MM, providing uniform memory access (UMA) across the system. This physically-unified design completely removes the need for remote accesses. In addition, having a centralized location for data access by multiple GPUs provides the opportunity to coalesce data accesses at the MM level and makes it easier to provide support for coherency given the lower overhead in communication. Moreover, having more memory banks helps improve the throughput by an efficient allocation of data, i.e., allocating consecutive pages to neighboring DRAM banks in a round-robin manner.
|CPU||Ryzen 9 3950X||7||144||105|
Determined using technology scaling rules.
3.2 Preliminary Evaluation
|per GPU||per GPU|
L1 Vector $
|L1 Scalar $||16KB 4-way||8||L1I$||32KB 4-way||8|
|L2$||256KB 16-way||8||DRAM||512MB HBM||16|
|L1 TLB||1 set, 32-way||48||L2 TLB||32 sets, 16-way||1|
In this section, we discuss the potential performance benefits of an MGPU-TSM system over the existing MGPU system configurations, i.e., MGPU systems that use RDMA P2P direct access (referred to as RDMA), and the MGPU system that uses unified memory (referred to as UM). Table III shows the configuration for each GPU in our evaluation, where we allocate memory by interleaving the pages across all the memory modules in the MGPU system. For a fair comparison, we use the same GPU specifications i.e. CU count, L1$ and L2$ sizes and number of total DRAM banks (16 for each GPU) for RDMA, UM and TSM configurations. We use a page size of 4KB. For the RDMA configuration, we use PCIe 4.0 links to provide 32GB/s bidirectional BW for remote accesses. UM provides a unified view of the total memory to the programmer by virtually combining the CPU and GPU memories. UM uses a first touch policy for page placement. To evaluate our design we use the MGPUSim simulator [sun2019mgpusim], which is designed specifically to support MGPU simulation. We use 12 standard benchmarks from the Hetero-Mark [heteromark], PolyBench [pouchet2012polybench], SHOC [shoc], and DNNMark [dong2017dnnmark] benchmark suites for our preliminary evaluation.
Figure 3 shows a comparison of TSM, RDMA and UM. TSM is, on average, 3.9 and 8.2 faster than RDMA and UM, respectively. TSM is faster than using RDMA because RDMA requires data copy operations between the CPU and GPUs. During kernel execution, all GPUs are required to use RDMA to access data residing on the other GPUs’ memories. UM suffers from an expensive page fault service mechanism and page migration through the off-chip links.
4 MGPU-TSM System Design Challenges
Our preliminary comparison of TSM with RDMA and UM shows that TSM is quite promising, but it also comes with several challenges. Here we discuss these challenges and our future research direction to address those challenges.
4.1 Data Sharing Within and Across GPUs
In the MGPU-TSM system, different CUs within and across GPUs can access the same memory location. Hence, we need a low-overhead scalable cache coherency and memory consistency model to maintain accuracy such as HALCONE [mojumder2020halcone]. Traditional snooping-based or directory-based coherency protocols, such as MESI and MOESI, can lead to large inter-GPU and intra-GPU communication latencies [singh2013cache]. Timestamp-based coherence [tabbakh2018g], which allows auto-invalidation of cache blocks and reduces the traffic overhead, can be suitable for an MGPU-TSM system. A wide range of consistency models, including sequential consistency, weak consistency, and release consistency, have been proposed for single-GPU systems. We need to design consistency models for an MGPU-TSM system consisting of thousands of threads.
4.2 L2-to-MM Network
The L2-to-MM network plays a critical role in the overall performance of an MGPU-TSM system. In our example system, we used direct links between L2 to the Switch and between the Switch to MM. As we scale the number of GPUs, the radix of the Switch grows proportionally. A high-radix switch leads to lower performance, and at the same time, the resulting area and power become problematic. In our future work, we will explore different high-BW low-latency networks that scale well with GPU count.
4.3 CPU-GPU Memory Accesses
CPUs are typically latency-sensitive, while GPUs are BW-sensitive. Since the MGPU-TSM system provides the same physical memory to both CPUs and GPUs, it is imperative to design a network protocol that allows low-latency data access to the CPU and high-BW data access to the GPUs.
4.4 Integration Technology
To design a scaled-up MGPU-TSM system, we envision using 2.5D integration technology with multiple interposers. Each interposer will have multiple GPU chiplets, a CPU chiplet, and multiple HBM stacks. For intra-interposer communication, we can use electrical links, while for long-distance inter-interposer communication, we can use photonic links. To design such a multi-interposer system, we need to develop a cross-layer design automation technique that jointly optimizes the system architecture, circuit design, and physical design.
In this work, we showed that the performance of MGPU systems is limited due to expensive remote data access through off-chip links. At the same time, programming MGPU systems is difficult due to a lack of hardware support for coherency. To address these issues, we propose an MGPU-TSM architecture that eliminates remote data access, improves memory utilization, and reduces programmer burden. We also highlight the major challenges we need to overcome to make MGPU-TSM viable.