Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

12/13/2020
by   Chenhao Xie, et al.
0

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53x (up to 9.86x) speedup on a DGX-1 system and 3.66x (up to 9.64x) speedup on a DGX-2 system with 4-GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU system.

READ FULL TEXT

page 1

page 5

research
09/15/2022

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems

Sparse linear algebra kernels play a critical role in numerous applicati...
research
06/13/2021

G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression

Text analytics directly on compression (TADOC) has proven to be a promis...
research
03/11/2019

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

High performance multi-GPU computing becomes an inevitable trend due to ...
research
06/26/2021

GSmart: An Efficient SPARQL Query Engine Using Sparse Matrix Algebra – Full Version

Efficient execution of SPARQL queries over large RDF datasets is a topic...
research
04/11/2020

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

GPUs are playing an increasingly important role in general-purpose compu...
research
01/29/2022

Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures

Tensor decomposition (TD) is an important method for extracting latent i...
research
04/17/2021

Ripple : Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout

GPUs are now used for a wide range of problems within HPC. However, maki...

Please sign up or login with your details

Forgot password? Click here to reset