MGPU-TSM: A Multi-GPU System with Truly Shared Memory

The sizes of GPU applications are rapidly growing. They are exhausting the compute and memory resources of a single GPU, and are demanding the move to multiple GPUs. However, the performance of these applications scales sub-linearly with GPU count because of the overhead of data movement across multiple GPUs. Moreover, a lack of hardware support for coherency exacerbates the problem because a programmer must either replicate the data across GPUs or fetch the remote data using high-overhead off-chip links. To address these problems, we propose a multi-GPU system with truly shared memory (MGPU-TSM), where the main memory is physically shared across all the GPUs. We eliminate remote accesses and avoid data replication using an MGPU-TSM system, which simplifies the memory hierarchy. Our preliminary analysis shows that MGPU-TSM with 4 GPUs performs, on average, 3.9x? better than the current best performing multi-GPU configuration for standard application benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

07/08/2020

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

While multi-GPU (MGPU) systems are extremely popular for compute-intensi...
12/12/2017

Intra-node Memory Safe GPU Co-Scheduling

GPUs in High-Performance Computing systems remain under-utilised due to ...
10/04/2019

GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory

We present an implementation of the overlap-and-save method, a method fo...
03/25/2021

ButterFly BFS – An Efficient Communication Pattern for Multi Node Traversals

Breadth-First Search (BFS) is a building block used in a wide array of g...
04/30/2021

Memory Reduction using a Ring Abstraction over GPU RDMA for Distributed Quantum Monte Carlo Solver

Scientific applications that run on leadership computing facilities ofte...
04/20/2018

CUDA Support in GNA Data Analysis Framework

Usage of GPUs as co-processors is a well-established approach to acceler...
05/16/2018

Recent Advances in Overcoming Bottlenecks in Memory Systems and Managing Memory Resources in GPU Systems

This article features extended summaries and retrospectives of some of t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Method Definition
MM Access
Latency
MM Access
Bandwidth
Data
Duplication
Improves
Programmability
Improves GPU
Mem. Usage
P2P Memcpy
Data copy from one GPU MM to another GPU MM
High
Low Yes
P2P Direct
Data is accessed directly from the remote GPU memory
and cached in the requesting GPU’s L1$
High
Low Partial
Zerocopy
Data is directly accessed from CPU memory by all GPUs
without copying the data into GPU memory or GPU cache
Extremely
high
Low No ✓✓
Unified Memory
Data is either transferred or accessed directly from the current
owner based on how the runtime decides to serve a page fault
Extremely
High
Low No ✓✓
MGPU-TSM
All CPUs and GPUs can access the physically shared main
memory seamlessly using a low latency network
Low
High No ✓✓ ✓✓
TABLE I: Comparison of different communication mechanisms available in existing MGPUs vs. the communication scheme in MGPU-TSM. We compare the programmability and memory usage of each mechanism w.r.t. P2P Memcpy. Latency and BW is compared w.r.t. local MM access latency and BW. ‘✗’, ‘✓’, and ‘✓✓’ indicate ‘no’, ‘fair’ , and ‘good’, respectively .

Graphics processing units (GPUs) have become the system of choice for accelerating a variety of workloads including deep learning, graph applications, data mining, and big data processing. The size of these applications is growing continuously, and these applications are exhausting the compute and memory resources in single-GPU systems. Hence, the community is actively migrating towards using multi-GPU (MGPU) systems to accelerate the above-mentioned workloads. To enable inter-GPU communication, GPU vendors have proposed a number of mechanisms (see Table 

I). However, achieving near-ideal speedup (w.r.t. a single GPU) when using multiple GPUs is challenging because of the inefficiencies in MGPU system design and the associated programming model.

Inefficiency 1: In the existing discrete MGPU systems, each GPU has its own local main memory (MM) as shown in Figure 1(left). Each GPU in the MGPU system can access the other GPUs’ MM through low-bandwidth high-latency links. These off-chip links have 5 to 10 lower bandwidth (BW) (for transferring data between GPUs, and between CPU and GPU) than the BW for accessing local MM of a GPU. Thus, accessing a remote GPU’s MM increases the application execution time. Moreover, we observe non-uniform memory access (NUMA) effects when accessing a remote memory resulting in under-utilization of GPU compute resources, and therefore sub-optimal performance.

Inefficiency 2: Today’s MGPU programming model requires a programmer to manually maintain coherency by replicating data and/or accessing non-cached data from a remote memory using the expensive off-chip links. As a result, there is additional traffic traversing through the off-chip links. In addition, the existing weak data-race-free (DRF) consistency model for GPUs requires additional efforts from the programmer to avoid data races by providing explicit barriers.

As a result of these inefficiencies, we cannot leverage the full potential of MGPU systems. We provide more details about these two inefficiencies with experimental evaluation in Section 2. Researchers have proposed various solutions to address the aforementioned inefficiencies in the MGPU systems. In particular, the solution with identical objectives to ours was by Arunkumar et al. [arunkumar2017mcm], who proposed a package-level integration of multi-chip-module GPU (MCM-GPU) (see Figure 1(left)), where each GPU module has its own local DRAM. Here local accesses have low-latency, but remote accesses have very high latency. In parallel, other hardware and software optimizations such as L1.5$ [arunkumar2017mcm], CARVE [young2018combining], and HMG [hmg] have been proposed to address the two inefficiencies mentioned earlier.

To simplify programming, reduce the data transfer latency and increase the memory utilization efficiency, we propose a multi-GPU system with truly shared memory (MGPU-TSM). Unlike the MCM-GPU (see Figure 1), an MGPU-TSM system allows all GPUs to directly access the entire physical main memory of the system, thus eliminating non-uniform memory access (NUMA) effects observed in traditional MGPU systems. In addition, an MGPU-TSM does not require L1.5$ to reduce remote access overhead. Moreover, MGPU-TSM paves the way to accommodate low-overhead coherence protocol as well as a simpler consistency model for MGPU systems. In this work, we compare the performance of an MGPU-TSM design with state-of-the-art RDMA– and unified memory (UM)–based MGPU designs using MGPUSim [sun2019mgpusim] to demonstrate the benefits of MGPU-TSM systems.

Fig. 1: MCM-GPU system (left) and Proposed MGPU-TSM system (right).

2 Challenges in Existing MGPU Systems

2.1 RDMA Access Cost

In this section, using the data access latency metric, we present the motivation for providing shared main memory in an MGPU system. Here, we run the commonly-used matrix multiplication kernel SGEMM, from NVIDIA’s cuBLAS library [nvidia2008cublas], on an MGPU system with V100 GPUs (compute capability of 7.0). We use two GPUs connected through NVLink 2.0 (50 GB/s bidirectional bandwidth). The conclusions of our analysis should be broadly applicable to systems with more than 2 GPUs that use GPU-GPU RDMA.

The computations in the SGEMM kernel consist of three matrices A, B, and C. In our experiment, we distribute the matrices in the memory of two GPUs (GPU0 and GPU1) and examine the performance degradation caused by different degrees of remote access (using P2P direct access as an example) when the SGEMM is executed on GPU0. We use the aL-bR format to represent a% local access and b% remote access for GPU0, where a and b are integers. We evaluate the following four matrix distributions across memory:

  1. Matrices A, B and C are in GPU0’s memory. This leads to 100% local access for GPU0 (100L-0R).

  2. Matrices A and B are in GPU0’s memory, and C is in GPU1’s memory (67L-33R).

  3. Matrix A is in GPU0’s memory, and matrices B and C are in GPU1’s memory (33L-67R).

  4. Matrices A, B and C are in GPU1’s memory. This leads to 100% remote access for GPU0 (0L-100R).

Figure 2 shows the runtime for the SGEMM kernel execution with different matrix sizes for the above four matrix distributions. For smaller matrix sizes, accessing remote memory is very expensive because of the fixed remote access overhead. The runtime of SGEMM for the 0L-100R distribution for a 4k4k matrix is 27 longer than that of the 100L-0R distribution. On the other hand, the runtime of SGEMM for the 0L-100R distribution for the 32k32k matrix is 12.2 longer than that of the 100L-0R distribution. Here, the fixed remote access overhead gets amortized. From these experiments, we can see the significant impact of remote accesses on performance, and in turn, argue that to improve the performance of applications, we need to avoid remote accesses as much as possible.

Fig. 2: Runtime of SGEMM kernel from cuBLAS library for different matrix sizes. Each bar corresponds to a different distribution of local and remote memory accesses.

2.2 Data Sharing and Programmability

Data sharing across multiple GPUs during kernel execution leads to programming challenges, as the programmer must choose between programmability and performance. In this section, we examine the DNN training process on MGPU systems, when leveraging different data-parallelism schemes. We highlight how different mechanisms trade-off programmability for performance. The three stages of DNN training include forward propagation (FP), backward propagation (BP), and weight update (WU). During the FP and BP stages, different GPUs calculate their local stochastic gradient descents (SGDs) that are later used to update the values of weights used for the next iteration.

111More details about the training stages can be found in [mojumder2018profiling].

In Algorithms 1,  2 and 3, we consider three different ways a programmer can perform the WU stage. We will assume a 2-GPU MGPU system here. Algorithm 1 shows that when using memcpy, the programmer must maintain coherence explicitly by periodically copying data to GPU1’s memory. Thus, there is an additional copy of data i.e. SGD (gGPU1) in GPU0’s memory, leading to additional memory usage. Nonetheless, this mechanism can be efficient in terms of kernel runtime because P2P memcpy can run asynchronously. Algorithm 2 shows how P2P direct access with RDMA can eliminate the data copy step, but at the expense of accessing data using off-chip links. Still, the programmer must transfer the data from the CPU to the GPUs. Algorithm 3 illustrates that a shared main memory could ease programmability and eliminate explicit GPU-to-GPU or CPU-to-GPU data transfers. Note that UM and Zerocopy solutions use Algorithm 3. UM, as proposed by NVIDIA, eases programming with a software abstraction, but suffers from performance degradation due to inefficient page-fault support and expensive remote accesses [trinayan-hpca2020]. A Zerocopy solution does not use GPU memory at all. The GPUs access pinned CPU memory using the off-chip (PCIe) links [negrut2014unified]. We argue that we need a solution which would not trade-off programmability to gain performance. A programmer can use Algorithm 3 on our envisioned MGPU-TSM and enjoy both ease of programmability and high performance.

Initialization: weights in CPU ;
Copy weights from CPU to GPU0 wGPU0  ;
Copy weights from CPU to GPU1 wGPU1  ;
FP+BP on GPU0 using wGPU0 gGPU0  ;
FP+BP on GPU1 using wGPU1 gGPU1  ;
Copy gGPU1 from GPU1 to GPU0 gGPU0Copy  ;
WU on GPU0 using (gGPU0,gGPU0Copy) wGPU0  ;
Copy wGPU0 from GPU0 to GPU1 wGPU1  ;
Algorithm 1 Using Memcpy
Initialization: weights in CPU ;
Copy weights from CPU to GPU0 wGPU  ;
FP+BP on GPU0 using wGPU gGPU0  ;
FP+BP on GPU1 using wGPU gGPU1  ;
WU on GPU0 using (gGPU0,gGPU1) wGPU  ;
Algorithm 2 Using P2P direct access
Initialization: weights in CPU ;
FP+BP on GPU0 using weights g0  ;
FP+BP on GPU1 using weights g1  ;
WU on GPU0 using (g0, g1) weights  ;
Algorithm 3 Using shared main memory

In the pseudocode, the right arrows point to destination variables of an operation.

3 MGPU-TSM System

To explain and evaluate our envisioned MGPU-TSM architecture, we consider an MGPU-TSM system consisting of 4 GPUs, 1 CPU and 4 HBM stacks that provide a total of 32GB MM (we are using a 32GB capacity as an example to explain the MGPU-TSM architecture – our MGPU-TSM system works with larger memory). The specifications of the GPU, CPU, and HBM stacks are provided in Table II.

3.1 MGPU-TSM Architecture

Figure 1(right) shows the logical view of our proposed MGPU-TSM system. We leverage the current common design for compute units (CUs), where each CU has a dedicated write-through L1$. All the L1$s are connected to the L2$s using a crossbar network. For our proposed MGPU-TSM system, we make changes to the memory hierarchy, starting from L2$ down to the MM.

GPUs typically have distributed L2$ banks, where each L2$ bank serves one memory controller (MC). In our envisioned MGPU-TSM system, we have 8 L2$ banks per GPU and 4 HBM stacks that provide a total of 32 GB of MM. Thus, for each GPU, an L2 MC controls 4GB of memory. Each of the 8GB DRAMs is further distributed into 16 banks, where each bank has a 512MB capacity.

Each L2 bank, as well as each DRAM bank, is connected to a centralized switch through a dedicated 32GB/s bidirectional link. Thus, each GPU has a total of 256GB/s of bidirectional BW between the L2$ and MM. With 4 GPUs, the total BW is 1TB/s. This also implies that each memory access requires a two-hop communication, from L2$ to the Switch, then from the Switch to MM, and vice versa. Recently, NVIDIA introduced NVSwitch [ishii2018nvswitch], providing 18 ports and 928GB/s of bidirectional BW, supporting RDMA connectivity across multiple GPUs. Hence, our assumed 32-port switch with 1TB/s aggregate bidirectional BW is realistic.

The key advantage of our TSM lies in physically-unified MM, providing uniform memory access (UMA) across the system. This physically-unified design completely removes the need for remote accesses. In addition, having a centralized location for data access by multiple GPUs provides the opportunity to coalesce data accesses at the MM level and makes it easier to provide support for coherency given the lower overhead in communication. Moreover, having more memory banks helps improve the throughput by an efficient allocation of data, i.e., allocating consecutive pages to neighboring DRAM banks in a round-robin manner.

Component Name Tech. Node Area Power
(nm) (mm) (W)
GPU RX 5700 7 151 180
CPU Ryzen 9 3950X 7 144 105
Memory HBM 2.0 14 92 21.4
  • Determined using technology scaling rules.

TABLE II: Specification of MGPU-TSM components.

3.2 Preliminary Evaluation

Component Configuration Count Component Configuration Count
per GPU per GPU
CU 1.0 GHz 32

L1 Vector $

16KB 4-way 32
L1 Scalar $ 16KB 4-way 8 L1I$ 32KB 4-way 8
L2$ 256KB 16-way 8 DRAM 512MB HBM 16
L1 TLB 1 set, 32-way 48 L2 TLB 32 sets, 16-way 1
TABLE III: GPU Architecture.

In this section, we discuss the potential performance benefits of an MGPU-TSM system over the existing MGPU system configurations, i.e., MGPU systems that use RDMA P2P direct access (referred to as RDMA), and the MGPU system that uses unified memory (referred to as UM). Table III shows the configuration for each GPU in our evaluation, where we allocate memory by interleaving the pages across all the memory modules in the MGPU system. For a fair comparison, we use the same GPU specifications i.e. CU count, L1$ and L2$ sizes and number of total DRAM banks (16 for each GPU) for RDMA, UM and TSM configurations. We use a page size of 4KB. For the RDMA configuration, we use PCIe 4.0 links to provide 32GB/s bidirectional BW for remote accesses. UM provides a unified view of the total memory to the programmer by virtually combining the CPU and GPU memories. UM uses a first touch policy for page placement. To evaluate our design we use the MGPUSim simulator [sun2019mgpusim], which is designed specifically to support MGPU simulation. We use 12 standard benchmarks from the Hetero-Mark [heteromark], PolyBench [pouchet2012polybench], SHOC [shoc], and DNNMark [dong2017dnnmark] benchmark suites for our preliminary evaluation.

Fig. 3: Speedup of proposed TSM, and UM w.r.t. RDMA.

Figure 3 shows a comparison of TSM, RDMA and UM. TSM is, on average, 3.9 and 8.2 faster than RDMA and UM, respectively. TSM is faster than using RDMA because RDMA requires data copy operations between the CPU and GPUs. During kernel execution, all GPUs are required to use RDMA to access data residing on the other GPUs’ memories. UM suffers from an expensive page fault service mechanism and page migration through the off-chip links.

4 MGPU-TSM System Design Challenges

Our preliminary comparison of TSM with RDMA and UM shows that TSM is quite promising, but it also comes with several challenges. Here we discuss these challenges and our future research direction to address those challenges.

4.1 Data Sharing Within and Across GPUs

In the MGPU-TSM system, different CUs within and across GPUs can access the same memory location. Hence, we need a low-overhead scalable cache coherency and memory consistency model to maintain accuracy such as HALCONE [mojumder2020halcone]. Traditional snooping-based or directory-based coherency protocols, such as MESI and MOESI, can lead to large inter-GPU and intra-GPU communication latencies [singh2013cache]. Timestamp-based coherence [tabbakh2018g], which allows auto-invalidation of cache blocks and reduces the traffic overhead, can be suitable for an MGPU-TSM system. A wide range of consistency models, including sequential consistency, weak consistency, and release consistency, have been proposed for single-GPU systems. We need to design consistency models for an MGPU-TSM system consisting of thousands of threads.

4.2 L2-to-MM Network

The L2-to-MM network plays a critical role in the overall performance of an MGPU-TSM system. In our example system, we used direct links between L2 to the Switch and between the Switch to MM. As we scale the number of GPUs, the radix of the Switch grows proportionally. A high-radix switch leads to lower performance, and at the same time, the resulting area and power become problematic. In our future work, we will explore different high-BW low-latency networks that scale well with GPU count.

4.3 CPU-GPU Memory Accesses

CPUs are typically latency-sensitive, while GPUs are BW-sensitive. Since the MGPU-TSM system provides the same physical memory to both CPUs and GPUs, it is imperative to design a network protocol that allows low-latency data access to the CPU and high-BW data access to the GPUs.

4.4 Integration Technology

To design a scaled-up MGPU-TSM system, we envision using 2.5D integration technology with multiple interposers. Each interposer will have multiple GPU chiplets, a CPU chiplet, and multiple HBM stacks. For intra-interposer communication, we can use electrical links, while for long-distance inter-interposer communication, we can use photonic links. To design such a multi-interposer system, we need to develop a cross-layer design automation technique that jointly optimizes the system architecture, circuit design, and physical design.

5 Conclusion

In this work, we showed that the performance of MGPU systems is limited due to expensive remote data access through off-chip links. At the same time, programming MGPU systems is difficult due to a lack of hardware support for coherency. To address these issues, we propose an MGPU-TSM architecture that eliminates remote data access, improves memory utilization, and reduces programmer burden. We also highlight the major challenges we need to overcome to make MGPU-TSM viable.

References