Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs

by   Jianyu Huang, et al.

Conventional GPU implementations of Strassen's algorithm (Strassen) typically rely on the existing high-performance matrix multiplication (GEMM), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, "squarish" matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from GEMM, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We also develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for GEMM and Strassen. Overall, our 1-level Strassen can achieve up to 1.11x speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19x speedup with a crossover point at 7,680.



There are no comments yet.


page 2

page 4


Accelerating Sparse Approximate Matrix Multiplication on GPUs

Although the matrix multiplication plays a vital role in computational l...

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely use...

Implementing Strassen's Algorithm with BLIS

We dispel with "street wisdom" regarding the practical implementation of...

Bandwidth-Optimal Random Shuffling for GPUs

Linear-time algorithms that are traditionally used to shuffle data on CP...

AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

Stencil computation is one of the most widely-used compute patterns in h...

Efficient Synchronization Primitives for GPUs

In this paper, we revisit the design of synchronization primitives---spe...

Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory

High fidelity scientific simulations modeling physical phenomena typical...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.