1 Introduction
1.1 Tall & Skinny Matrix Multiplications (TSMM)
The general matrix-matrix multiplication (GEMM) is such an essential linear algebra operation used in many numerical algorithms that hardware vendors usually supply an implementation that is perfectly optimized for their hardware. In case of Nvidia, this is part of CUBLAS ([cublas]). However, since these implementations are focused on mostly square matrices, they often perform poorly for matrices with unusual shapes.

In this paper, we cover the operation , with matrices and being tall & skinny, i.e., much taller than they are wide. If and are of size and , the shared dimension is long (of the order of ), whereas the dimensions and are short, which we define here as the range . and are both stored in row-major order. We are interested in a highly efficient implementation of this operation on the Nvidia Volta GPGPU.
1.2 Application
Row-major tall & skinny matrices are the result of combining several vectors to block vectors.
Block Vector Algorithms are linear algebra algorithms that compute on multiple vectors simultaneously for improved performance. For instance, by combining multiple, consecutive Sparse Matrix Vector (SpMV) multiplications to a Sparse Matrix Multiple Vector (SpMMV) multiplication, the matrix entries are loaded only once and used for the multiple vectors, which reduces the overall memory traffic and consequently increases performance of this memory bound operation. This has first been analytically shown in [Gropp99] and is used in many applications such as described in [blockjada].The simultaneous computation on multiple vectors can also be used to gain numerical advantages. This has been shown for block vector versions of the Lanzcos algorithm [blocklanzcos], of the biconjugate gradient algorithm [blockcg], and of the Jacobi-Davidson Method [blockjada]
, each of which use block vectors to compute multiple eigenvectors simultaneously. Many such algorithms require multiplications of block vectors. For example, the tall & skinny matrix matrix multiplication
occurs in classical Gram-Schmidt orthogonalization of a number of vectors represented by against an orthogonal basis .1.3 Roofline Model
We use the roofline model to obtain an upper limit for the performance of this operation. Under the assumption that the three matrices are transferred just once from memory and that floating point operations are performed, the arithmetic intensity assuming and is
(1) |
In this symmetric case, the arithmetic intensity grows linearly with
. This paper will show measurements only for the symmetric case, although the nonsymmetric case is not fundamentally different, with the intensity being proportional to the harmonic mean of both dimensions and consequently dominated by the smaller number. With the derived intensity, the model predicts
as the “perfect” performance yardstick.Usually the GEMM is considered a classic example for a compute-bound problem with high arithmetic intensity. However, at , the arithmetic intensity is just , which is far below the roofline knee of modern compute devices and therefore strongly memory bound. This is not surprising given that a matrix multiplication with is the same as a scalar product. At the other endpoint of the considered spectrum, at , the arithmetic intensity is , which is close to the inverse machine balance of a V100 GPU (see below). Therefore the performance character of the operation changes from extremely memory bound at to simultaneously memory and compute bound at . An implementation with perfect performance thus needs to fully utilize the memory bandwidth at all sizes and reach peak floating point performance for the large sizes. The very different performance characteristics make it hard to write an implementation that fits well for both ends of the spectrum. Different optimizations are required for both cases.
With the performance as given by the roofline model, it is possible to judge the quality of an implementation’s performance as the percentage of the roofline limit. This is plotted in Figure 2 for CUBLAS. The graph shows a potential performance headroom of to .
1.4 Contribution
This paper shows the necessary implementation techniques to achieve near-perfect performance for double precision tall & skinny matrix-matrix multiplications on an Nvidia V100 GPGPU.
Two different parallel reduction schemes are implemented and analyzed as to their suitability for matrices with lower row counts.
A code generator is implemented that generates code for specific matrix sizes and tunes many configuration options specifically to that size. This allows to exploit regularity, where size parameters allow it, while still generating the least possible overhead where they do not.
1.5 Related Work
CUBLAS is NVIDIA’s BLAS implementation. The GEMM function
interface in BLAS only expects column-major matrices. Treating the
matrices as transposed column major matrices and executing
is an equivalent operation.
CUTLASS is a collection of primitives for multiplications
especially of small matrices, which can be composed in different ways
to form products of larger matrices. One of these is the splitK
kernel, which additionally parallelizes the inner summation of the matrix
multiplication. Square matrix multiplications usually do not do this,
but it is what is required for sufficient parallelism. An adapted
version of the “06_splitK_gemm
” example is used for
benchmarking.
1.6 Hardware
In this work we use Nvidia’s V100-PCIe-16GB GPGPU (Volta architecture) with CUDA 10.0. The hardware data was collected with our own CUDA microbenchmarks, available at [cuda-benches] together with more detailed data.

Memory Bandwidth. The STREAM benchmarks ([McCalpin1995]
) all contain a write stream, while the TSMM does not. We thus use a thread-local sum reduction to estimate the achievable memory bandwith (see Table
1). The maximum is above 880 Gbyte/s; but this is only possible with sufficient parallelism, either through high occupancy or instruction level parallelism (ILP) in the form of multiple read streams, achieved here through unrolling.Floating Point Throughput. The V100 can execute one 32-wide double precision (DP) floating point multiply add (FMA) per cycle on each of its 80 streaming multiprocessors (SMs) and runs at a clock speed of 1.38 GHz for a DP peak of . One SM quadrant can process a 32 warp lanes-wide instruction every four cycles at a latency of eight cycles. Full throughput can already be achieved with a single warp per quadrant, if instructions are independent.
L1 Cache.
The L1 cache plays an instrumental role in achieving the theoretically possible arithmetic intensity. The per-SM private L1 cache can transfer a full 128-byte cache line per cycle. Therefore a 32-wide, unit-stride DP load takes two cycles but this number increases by a cycle for each 128-byte cache line that is affected. This amounts to a rate of one load for two FMA instructions.
2 Implementation Strategies
The arithmetic intensity (1) was derived assuming perfect caching, i.e. that and are transferred from memory just once. The fastest way to reuse values is to use a register and have the thread, the register belongs to, perform all required operations on this data. Data used by multiple threads can preferably be shared in the L1 cache for threads in the same thread block or in the L2 cache otherwise. This works best only with some spatial and temporal access locality in place.
2.1 Thread Mapping Options
The parallelization scheme, i.e., the way in which work is mapped to GPU threads, plays an important role for data flow in the memory hierarchy. The canonical formulation of an MMM is the three-level loop nest shown in Listing LABEL:lst:MMM.
Comments
There are no comments yet.