Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

05/01/2023
by   Shixun Wu, et al.
0

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While researchers often strive for faster performance by using large compute platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design for a high-performance GEMM with algorithm-based fault tolerance for use on GPUs. We describe fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, proprietary cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Experimental results demonstrate that our baseline GEMM presents comparable or superior performance compared to the closed-source cuBLAS. The fault-tolerant GEMM incurs only a minimal overhead (8.89% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of 160%∼ 183.5% and 148.55%∼ 165.12% for fault-tolerant and non-fault-tolerant GEMMs, outperforming cuBLAS by up to 41.40%.

READ FULL TEXT

page 5

page 7

page 9

page 11

research
05/03/2023

FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs

General matrix/matrix multiplication (GEMM) is crucial for scientific co...
research
02/17/2022

Winograd Convolution: A Perspective from Fault Tolerance

Winograd convolution is originally proposed to reduce the computing over...
research
10/10/2022

Fault-Tolerant Strassen-Like Matrix Multiplication

In this study, we propose a simple method for fault-tolerant Strassen-li...
research
04/02/2021

FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific ...
research
06/16/2022

Fault-Tolerant Collaborative Inference through the Edge-PRUNE Framework

Collaborative inference has received significant research interest in ma...
research
08/24/2020

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25 ...
research
05/16/2018

FT-LADS: Fault-Tolerant Object-Logging based Big Data Transfer System using Layout-Aware Data Scheduling

Layout-Aware Data Scheduler (LADS) data transfer tool, identifies and ad...

Please sign up or login with your details

Forgot password? Click here to reset