Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs

08/24/2018
by   Jianyu Huang, et al.
0

Conventional GPU implementations of Strassen's algorithm (Strassen) typically rely on the existing high-performance matrix multiplication (GEMM), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, "squarish" matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from GEMM, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We also develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for GEMM and Strassen. Overall, our 1-level Strassen can achieve up to 1.11x speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19x speedup with a crossover point at 7,680.

READ FULL TEXT

page 2

page 4

research
03/24/2021

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Although the matrix multiplication plays a vital role in computational l...
research
05/29/2020

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely use...
research
11/03/2016

Generating Families of Practical Fast Matrix Multiplication Algorithms

Matrix multiplication (GEMM) is a core operation to numerous scientific ...
research
12/18/2022

High-Performance Filters For GPUs

Filters approximately store a set of items while trading off accuracy fo...
research
11/07/2022

TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

Tucker decomposition is one of the SOTA CNN model compression techniques...
research
10/20/2011

Efficient Synchronization Primitives for GPUs

In this paper, we revisit the design of synchronization primitives---spe...
research
06/30/2020

Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory

High fidelity scientific simulations modeling physical phenomena typical...

Please sign up or login with your details

Forgot password? Click here to reset