Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking

by   Zhixiang Gu, et al.
Indiana University

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. It is well known that SpGEMM is a memory-bound operation, and its peak performance is expected to be bound by the memory bandwidth. Yet, existing algorithms fail to saturate the memory bandwidth, resulting in suboptimal performance under the Roofline model. In this paper we characterize existing SpGEMM algorithms based on their memory access patterns and develop practical lower and upper bounds for SpGEMM performance. We then develop an SpGEMM algorithm based on outer product matrix multiplication. The newly developed algorithm called PB-SpGEMM saturates memory bandwidth by using the propagation blocking technique and by performing in-cache sorting and merging. For many practical matrices, PB-SpGEMM runs 20 on modern multicore processors. Most importantly, PB-SpGEMM attains performance predicted by the Roofline model, and its performance remains stable with respect to matrix size and sparsity.



page 6

page 9


Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in ...

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Although the matrix multiplication plays a vital role in computational l...

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Architectures with multiple classes of memory media are becoming a commo...

Accelerating CPU-based Sparse General Matrix Multiplication with Binary Row Merging

Sparse general matrix multiplication (SpGEMM) is a fundamental building ...

Parallel Algorithms for Adding a Collection of Sparse Matrices

We develop a family of parallel algorithms for the SpKAdd operation that...

ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs

Linear algebra operations have been widely used in big data analytics an...

The MOMMS Family of Matrix Multiplication Algorithms

As the ratio between the rate of computation and rate with which data ca...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sparse matrix-matrix multiplication (SpGEMM) is a widely-used kernel in many graph analytics, scientific computing, and machine learning algorithms. In graph analytics, SpGEMM is used in betweenness centrality [10], clustering coefficients, triangle counting [3], multi-source breadth-first search [16], colored intersection search [20], and cycle detection [31] algorithms. In scientific computing, SpGEMM is used in algebraic multigrid [6] and linear solvers. Many machine learning tasks like dimensionality reduction (e.g., NMF, PCA [14]

), spectral clustering

[19], and Markov clustering (MCL) [4]) rely on an efficient SpGEMM algorithm as well. Additionally, SpGEMM algorithms are applied to evaluate the chained product of sparse Jacobians [17] and optimize join operations on modern relational databases [1].

In most data analytics applications, SpGEMM has very low arithmetic intensity (AI) measured by the ratio of total floating-point operations to total data movement. For example, when multiplying two Erdős-Rényi random matrices with nonzeros per column (matrices with

nonzeros uniformly distributed in each column) , an algorithm has an AI of just

flops/byte (see Sec. II-C). At this arithmetic intensity, SpGEMM is a memory-bound operation, and SpGEMM’s peak performance has an upper bound of , where

is the memory bandwidth. Assuming 50GB/s bandwidth available on a multicore processor, the estimated peak performance can be as high as

GFLOPS (billions of floating point operations per second). However, state-of-the art parallel algorithms based on heap and hash merging [25] attain no more than 500 MFLOPS on a socket of an Intel Skylate processor. No prior work clearly explained these observed performances of SpGEMM algorithms as no standard performance model has been developed to understand SpGEMM’s performance.

In this paper, we rely on the Roofline model [30] and develop lower and upper performance bounds for SpGEMM algorithms. In developing practical lower bounds of AI, we considered the fact that an SpGEMM may read data more than once. Even with a tight lower bound on AI, an algorithm can attain peak performance (that is ) only if it (a) saturates the memory bandwidth, (b) does not have high latency cost, and (c) makes use of full cache lines of data. We show that these requirements are not satisfied by current column-by-column algorithms, resulting into lower than attainable peak FLOPS.

Here, we develop a new algorithm based on the outer product of matrices. The goal of this algorithm is to eliminate irregular data accesses, increase bandwidth utilization, and attain the performance predicted by the Roofline model. Our algorithm is built upon the expansion-sort-compress paradigm [8, 11]. Given two matrices, this algorithm performs outer products to generate intermediate tuples of row index, column index, and multiplies values. These tuples are then sorted, and duplicate row and column indices are merged to get the final product. To ensure efficient bandwidth utilization, we store intermediate tuples into partially-sorted bins so that each bin can be sorted and merged independently. This technique of binning intermediate data for better bandwidth utilization is called propagation blocking (PB) [7]

. Prior work has used propagation blocking to improve the bandwidth utilization as well as the overall performance of sparse matrix-vector multiplications 

[7, 27]. Here we use propagation blocking to regularize data movements in SpGEMM. Hence, our algorithm is named as PB-SpGEMM.

All three phases of PB-SpGEMM (expand, sort, and compress or merge) stream data from memory. Hence, every phase has no significant latency overhead, utilizes full cache lines, and attains a bandwidth close the STREAM benchmark. Therefore, given the bandwidth of a system and input matrices, PB-SpGEMM’s performance matches the prediction of the Roofline model.

Similar to any expansion-sort-compress algorithm, PB-SpGEMM has to store intermediate tuples, where is the number of multiplications needed by SpGEMM. This can lead to significant data movements when the compression factor (the ratio of to nonzeros in the output) is greater than four. However, most practical SpGEMM operations have small compression factors. For example, when squaring matrices from the SuiteSparse Matrix Collection, more than of SpGEMMs have a compression factor less than 3 and about have a compression factor less than 6 [22]. Hence, for most practical scenarios, PB-SpGEMM performs predictably better than existing heap and hash algorithms. For multiplications with compression factor greater than four, PB-SpGEMM’s performance is still predictable, but it can run slower than the alternatives.

We summarize the key contributions of this paper below:

  1. We develop a Roofline performance model for SpGEMM algorithm. This model can show the limitations of existing column-by-column SpGEMM algorithms.

  2. We develop an outer-product based SpGEMM algorithm called PB-SpGEMM. We use propagation blocking, in-cache sorting and merging for better bandwidth utilization.

  3. Different phases of PB-SpGEMM utilize bandwidth close to the STREAM benchmark. Given the bandwidth of a system and input matrices, PB-SpGEMM’s performance matches the prediction from our Roofline model.

  4. For SpGEMMs with compression factor less than four, PB-SpGEMM is 30% to 50% faster than previous state-of-the-art algorithms for multicore processors.

Our implementation of PB-SpGEMM is publicly available at Bitbucket: .

Ii A Performance Model for SpGEMM

Ii-a Notations

Given two sparse matrices and , SpGEMM computes another potentially sparse matrix . denotes th column of , and denotes th entry in the th column. In our analysis, we consider -by- matrices for simplicity. We use Erdős-Rényi (ER) random matrices throughout the paper. An ER matrix with nonzeros per column has nonzeros uniformly distributed in each column. Among many representations for sparse matrices, we consider three standard data structures in this paper: Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), and Coordinate format (COO). See Langr and Tvrdik  [21] for a comprehensive discussion on sparse matrix storage formats.

Given a matrix , denotes the number of nonzeros in , and denotes average nonzeros in a row or column. In computing , denotes the number of multiplications needed. Throughout the paper, floating point operations only denote multiplications. The compression factor (cf) denotes the ratio of to nonzeros in the output matrix: . Since at least one multiplication is needed for every output nonzero, .

Ii-B Classes of SpGEMM algorithms categorized by data access patterns

To compute , most algorithms also manipulate an expanded matrix that contains unmerged entries. The final output can be obtained by merging entries in with the same row and column indices. Therefore, if we primarily focus on data accesses, SpGEMM algorithms have two distinct phases: (a) input matrices are accessed to form , (b) duplicate entries in are merged to form . Here, merging means adding multiplied values with the same row and column indices. For input accesses, we have only two options: (1) access and column-by-column (or equivalently row-by-row) [25, 29, 13, 2, 11], and (2) access column-by-column and access row-by-row for outer product [9, 28]. For output formations, we have many options. Prior work used the expand-sort-compress strategy [11, 28, 22] or used accumulators based on heap [2], hash table [25], and a dense vector called SPA [29, 15]. Table I summarizes prior work based on their data access patterns. Next, we briefly summarize four major classes of algorithms and discuss their data access patterns.

max width= Input Access Column wise Outer Product Output Heap/Hash/SPA [25, 29, 13, 2] [9] Formation ESC [11, 22] [28], this

TABLE I: Classification of SpGEMM algorithms based on access patterns of input and output matrices. This paper falls into the lower right cell.
Fig. 1: An illustration of column SpGEMM algorithm. To generate , the algorithm selects 2nd, 4th and 7th columns of (shown in red) based on the nonzeros in (shown in blue). The selected columns are multiplied by corresponding entries in , forming entries in (shown in yellow). Duplicate entries in are merged to generate (shown in green). The merging strategy differs in different column SpGEMM algorithm
Fig. 2: An illustration of outer product SpGEMM algorithm. Each multiplication between a column from A and a corresponding row from B generate a sub 1-rank matrix, those sub 1-rank matrices added up yield the final result C
No of Accesses Streamed Access Cache Line Utilization
Column SpGEMM (Heap/Hash/SPA) (when )
ESC (column-wise) (when )
ESC (outer product)

* Column SpGEMM generates one column of at a time. ** Using blocking techniques discussed in this paper.

TABLE II: Data access patterns in different SpGEMM algorithms when multiplying two ER matrices with nonzeros per column. If an algorithm does not stream data, it has high latency cost.

Column SpGEMM algorithms based on the heap/hash/SPA accumulator. Some algorithms materialize only one column of , merge duplicate entries in that column and generate the corresponding column of . Since column-by-column and row-by-row algorithms have similar computational patterns, we only discuss column SpGEMM algorithms in this paper. An illustration of the column SpGEMM algorithm is shown in Fig. 1, where is generated by merging a subset of columns in determined by nonzeros in . Most column-by-column algorithms are based on Gustavson’s algorithm [18], and they differ from one another based on how they merge entries in to obtain . Prior work have used heap [2], hash table [26], or a dense vector called SPA [15] for merging columns. A common characteristic of all column-by-column algorithms is that they read from and write to one column at a time. However, columns of may be read irregularly several times based on the nonzero pattern of .

We explain the access pattern of when multiplying two ER matrices with nonzeros per column. In the worst case, each column of will be read times from memory because columns of are accessed randomly without any spatial locality. Thus, we read times over the execution of the algorithm. Second, if we store matrices in the CSC format, column SpGEMM algorithms have good spatial locality for , and , but not for . Hence, we pay heavy latency costs for irregular accesses of ’s columns. Third, if a column of has very few entries (e.g., when is less than 8), we read a column of in a cache line, but the whole cache line is not used. Hence, column SpGEMM waste memory bandwidth for very sparse matrtices. The first row of Table II summarizes the data access patterns in column SpGEMM algorithms. Similarly, a row-by-row algorithm (when matrices are stored in the CSR format) has good spatial locality for , and , but not for .

The Expand-Sort-Compress (ESC) algorithms. Algorithms based on the ESC technique generate full before merging duplicate entries. The original ESC algorithm [11] developed for GPUs generates using 111Earlier work [11] actually used a row-by-row algorithm with CSR matrices, which is equivalent to column SpGEMM with CSC matrices. similar to the first step of column SpGEMM algorithm shown in Fig. 1. After the entire is constructed, tuples in are sorted and merged to generate the final output. Since sorting can be efficiently performed on GPUs, ESC SpGEMM can performed better than other algorithms on GPUs [11, 22]. The column ESC algorithm has access patterns for , , and similar to the column SpGEMM algorithm. Additionally, needs to access twice (one write after multiplication and one read before merging). The second row of Table II summarizes the data access patterns in the column ESC algorithms.

Previous work [11] tried to eliminate reading and writing from memory by partitioning by rows and multiplying each partition with . Thus, this approach generates one partition of at a time, which can fit in cache when a large number of partitions is used. However, the partitioned ESC algorithm will need to read several times as well as reading multiple times for each partition. Hence, the effectiveness of partitioning depends on the number of partitions and nonzero structures of input matrices.

Outer product algorithms. Fig. 2 shows an illustration of the outer product algorithm. In this formulation, is multiplied with to form a rank-1 outer product matrix. The rank-1 matrices can be merged by using a heap or using ESC. A heap can be used to merge the outer product of and with the current output after every iteration [9]. However, this algorithm is too expensive as it requires merging operations. Hence, we do not elaborate this algorithm further.

The rank-1 matrices from outer product can also be merged using the ESC strategy. In this case, input matrices are streamed only once. Hence, outer product can fully utilize cache lines when reading inputs. To sort and merge unmerged tuples in , we need to read them from memory. This can significantly increase the memory traffic. Nonetheless, with an efficient blocking technique discussed in this paper, we can stream whenever it is accessed. Hence, we can utilize full cache lines and saturate memory bandwidth to offset more data accesses. The last row of Table II summarizes the data access patterns in the outer-product-based ESC algorithm.

Ii-C Arithmetic Intensity (AI) of SpGEMM.

AI is the ratio of total floating-point operations to total data movement (bytes). To compute , one must read and from memory, and write to memory222We ignore that may have to first read from memory to cache before writing.. Assume that on average, we need bytes to store a nonzero. Then,


If we use 4 bytes for indices and 8 bytes for values, then is 16 bytes (assuming that matrices are stored in the COO format). Here, cf is a property of input matrices and it varies from 1 to 8 for most practical sparse matrices [22]. Even in the best scenario when we read and write matrices just once, the arithmetic intensity of SpGEMM is very low: often less than one. Let’s consider multiplying two ER matrices and , since cf for ER matrix is close to 1 in expectation [5], according to our model, AI will be around flops/byte. At this AI, SpGEMM’s performance is completely bound by memory bandwidth333On modern processors, SpGEMM can be computation bound only if , it is unrealistic for sparse matrices..

Peak performance of an SpGEMM algorithm. Suppose, a smart algorithm achieves the best AI of . Let be the memory bandwidth (BW) of the system measured by the STREAM benchmark. Then, the performance measured by FLOPS (floating point operations per second) follows this inequality:


Hence, the peak FLOPS for a given problem on a given architecture can be at most assuming that SpGEMM is bandwidth bound. For example, on an Intel Skylake processor with 50GB/s memory bandwidth, the peak performance for multiplying ER matrices can be at most GFLOPS as shown in Fig. 3. However, state-of-the-art column SpGEMM algorithms achieve less than 20% of this peak performance as discussed in several recent papers [25].

As discussed before, the primary reasons behind the suboptimal performance of SpGEMM algorithms are: (1) algorithms read/write data multiple times, (2) algorithms access data at random memory locations, (3) algorithms may not fully utilize cache lines. The first problem is inherent to SpGEMM and cannot be completely overcome when input matrices are unstructured. The irregular data access can under utilize bandwidth, impeding the performance of SpGEMM when the compression factor is small. In this paper, we address this irregular access problem and develop an algorithm where all steps of SpGEMM utilize full memory bandwidth. Nevertheless, Equation 1 seems an upper bound that no existing algorithm can attain. Next, we consider a more practical bound for AI.

A more practical bound on AI for SpGEMM. In column SpGEMM algorithm, we read the first input matrix several times depending on the nonzero pattern of . To obtain a lower bound for column SpGEMM, we assume that the accesses of have no temporal or spatial locality, and all accesses of incur memory traffic. Hence, in the worst case, the amount of data read from is bytes. Hence AI for column SpGEMM can be approximated as follows:

AI(col) (3)

By contrast, the outer product based algorithm based on the ESC strategy generates all unmerged tuples in , writes all nnz() = tuples in memory, reads them again for sorting and merging. Hence, in the worst case, ESC-based algorithms performs additional memory red-write operations, giving us the following AI:

AI(outer) (4)

For ER matrices with , Eq. 4 gives us an arithmetic intensity of 1/80 (assuming bytes). Fig. 3 shows these lower bounds on AI with the corresponding attainable performance. We will experimentally show that the newly-developed outer-product-based algorithm can attain the peak performance based on Eq. 4.

Fig. 3: Using Roofline model to estimate performance when multiplying two Erdős-Rényi matrices on a single-socket Intel Skylake processor. Memory bandwidth is 50GB/s as measured with STREAM benchmark

Iii The PB-SpGEMM Algorithm

Fig. 4: An example of PB-SpGEMM multiplying two 44 matrices with two bins, the first bin blocks propagation of tuples with rowid 0 and 1, the second blocks 2 and 3
Fig. 5: Using thread-private local bins to utilize cache lines. If is the total number of global bins, each thread also creates number of small local bins to store tuples as they are generated. When a local bin is full, tuples inside will be flushed to the corresponding global bin. This example is using two local bins per thread and two global bins

Iii-a Overview of the PB-SpGEMM algorithm

Algorithm 1 provides a high-level description of an SpGEMM algorithm based on the expand-sort-compress scheme. In the symbolic step, we estimate for the current multiplication and allocate memory for unmerged tuples in . Then, multiplied tuples are formed and stored in , which is then sorted and merged to form .

Our algorithm follows the exact same principle, but uses outer products and propagation blocking for efficient bandwidth utilization. Fig. 4 explains the propagation blocking idea based on two matrices and two bins. After we expand tuples, we partially order them in two bins, where the first bin stores rowid 0 and 1, and the second bin stores rowid 2 and 3. If these bins fit in L2 cache, sorting and merging can be be performed efficiently in cache by different threads. After generating a tuple, if we directly write it to its designated global bin, we may not fully utilize the cache line. Hence, each thread also maintains small local bins that are filled in cache before flushing to global bins in memory. The use of local bins is illustrated in Fig. 5. Hence, our algorithm has several tunable parameters, including (a) nbins: number of global or local bins, and (b) Lbinwidth: the width of local bins. We experimentally select these parameters as will be discussed in the experimental section.

Algorithm 2 described the PB-SpGEMM algorithm. To facilitate outer product operations, input matrices and are passed as CSC and CSR formats, respectively. Here, we store in CSR format, but it can be easily stored in CSC without any overhead. Finally, the expanded matrix is stored in the COO format. Similar to Algorithm 1, PB-SpGEMM has four phases (a) symbolic (line 1) (b) Expand (lines 5-14), (c) Sort (line 16), and (d) Compress (line 17). However, Algorithm 2 differs from previous ESC algorithms in two crucial ways: (1) we use outer product to stream data from input matrices, (2) we use propagation blocking to organize the expanded matrix into bins so that all phases of the algorithm saturate the memory bandwidth. We additionally perform a post-processing step (line 9) to convert the output to CSR format. Next, we discuss these phases in detail.

Input: ,
1 Symbolic (A, B) Create space for ;
2 Expand (, ) Create unmerged tuples ;
3 Sort () perform radix sort using (rowid, colid) as keys;
4 Compress () merge duplicated tuples ;
Algorithm 1 ESC-SpGEMM algorithm
Input: in CSC, in CSR
Output: in CSR
1 GBin expanded tuples () partitioned into bins;
2 nthreads Number of threads ;
3 Lbinwidth 512 Local bin width, default set to 512 bytes ;
4 LBin Create a 3D array of size nthreads nbins Lbinwidth LBin[] is local bin, private for thread ;
5 for  to in parallel do
6       The thread executing the th iteration ;
7       for (rowid, i, aval)  do
8             for (i, colid, bval)  do
9                   binid rowid % nbins ;
10                   if size(LBin[][binid]) = Lbinwidth then
11                         MemCopy (GBin[binid], LBin[][binid]) ;
12                         LBin[][binid] ;
14                  mval aval bval ;
15                   Append (LBin[][binid], (rowid, colid, mval));
19for all thread in parallel do
20       for binid 1 to  do
21             if size(LBin[][binid]) 0 then
22                   MemCopy (GBin[binid], LBin[][binid]) ;
26for binid to in parallel do
27       GBin[binid] Sort (GBin[binid]) perform radix sort in the th bin using (rowid, colid) as keys;
28       GBin[binid] Compress (GBin[binid]) merge duplicated tuples by two-pointer method;
Algorithm 2 PB-SpGEMM algorithm

Iii-B Symbolic Phase

In the symbolic phase, we estimate the memory requirement for , estimate number of bins and allocate space for global bins (Gbin). Algorithm 3 describes the symbolic step. We compute flops for the current multiplication using an outer-product style computation. The loop in Line 2 accesses column by column and row by row. If there is a nonzero entry in , it must be multiplied by all nonzeros in the th row of . Hence, line 5 adds to the count. After we compute flops, we compute the number bins (line 6) so that each global bin fits in the L2 cache in the sorting and merging phase. We then allocate memory for global bins (line 7). Algorithm 3 needs only time and attains higher memory bandwidth by streaming just row and column pointer arrays of and , respectively.

Note that Algorithm 3 is much simpler than symbolic steps used in column SpGEMM algorithms where we need to estimate . An outer product algorithm can be also developed without a symbolic phase. For example, a linked-list can be used to dynamically append expanded tuples [28]. However, the dynamic memory allocations in parallel sections could result in poor performance.

Input: A in CSC, B in CSR
Output: Gbin, nbins
1 flops 0;
2 for  to in parallel do
3       - ;
4       - ;
5       flops
6nbins flops / L2_CACHE_SIZE ;
Gbin MemAlloc(flops) allocate shared array to store tuples, no initialization needed
Algorithm 3 Symbolic Phase (Static Schedule)

Iii-C Expand

Lines 5-14 in Algorithm 2 describes the expand phase of PB-SpGEMM. In the expand phase, a thread reads and and performs their outer product. The binid is computed by rowid of the multiplied tuple (rowid comes from the rowid in ). Once a local bin, this thread will flush the tuples inside to the corresponding global bin (line 10-12). Then, the newly-formed tuple is appended to its designated local bin in line 14. After the multiplication, there still could be some tuples left in the local bins because bins were not full. Lines 15-18 send the partially full local bins to global bins.

With the local binning, we always write tuples in multiple of cache lines. We make sure the number of local bins and the size of local bins are small. typically, we create 1024 bins and 512 bytes per local bin so that all local bins for a thread easily fit in the cache.

Iii-D Sorting

After the multiplication and propagation blocking phase, the expanded matrix is stored in the COO format, partitioned into several bins. Then, we sort to bring tuples with the same (rowid, colid) pair close to one another for merging in the next phase. As shown in Algorithm 2, sorting can be performed independently in each bin because bins do not share tuples with the same rowid. Hence, a thread can sort tuples in a bin sequentially, while other threads sort other bins in parallel.

The sorting algorithm uses the (rowid, colid) pairs as keys and the multiplied values as payloads. For this purpose, we use an in-place radix sort (similar to American flag sort [24]) that groups the keys by individual bytes sharing the same significant byte position. In the worst case, this in-place radix sort needs passes over the data, where is the number of bytes needed to store a key. Hence, Radix sort can be faster than comparison-based sorting when keys are stored in fewer bytes.

Preparing integer keys for Radix sort. In our algorithm, we use 4-byte integers for row and column indices. We concatenate the rowid and colid to form a combined 8-byte integer key for radix sort. With 8-byte keys, radix sort may need 8 passes over the data to sort tuples, which can incur significant data transfers. We can reduce the key space by using the fact that bins are already grouped by consecutive row indices. For example, if the input is a 1M 1M matrix and we create 1K bins to block the propagation, the rowids of tuples within the same bin will in adjacent 1k, then we only need 10 bits to represent the remainder rather than a full 32-bit integer, and we still have 32-10=22 bits to store colid. Furthermore, if we assume that matrices have at most 1M rows and columns, we can use 20 bits for colid and 20-10=10 bits for rowid (assuming 1K bins). Hence, in most practical cases, we can potentially squeeze keys into 4-byte integers, needing four passes over the data for sorting.

Phase Comp. Complexity Bandwidth cost Latency In-cache operations Parallelism
Expand reading bytes Negligible manipulating local bins cols of and
writing bytes rows of per thread
Sort reading bytes Negligible shuffling bytes bins per thread
Compress writing bytes Negligible read/write bytes bins per thread

TABLE III: Computational complexity and data access patterns in different phases of PB-SpGEMM.

In-cache sorting. Since we sort containing tuples, four passes over the data directly from memory can be the performance bottleneck. Fortunately, bins can help in this case if tuples in a bin fit in L3 or L2 cache. In many practical problems, we can indeed fit a bin into L2 cache. For example, consider an ER matrix with 1M rows, 1M columns and 4M nonzeros. When computing , we generate 16M tuples in expectation. If 1K bins are used, each bin will contain 16K tuples. If we use 4 byte keys (as described in the previous paragraph) and 8 byte payloads, we need 192KB memory to store all tuples in a bin. This can easily fit in L2 cache of most modern processors. Multiplications with high compression ratios and matrices with denser rows can create problems for some bins as tuples may not fit in L2 cache. In these cases, we either use more bins or create bins with variable ranges of rows. Hence, our algorithm reads a bin from memory and perform radix sort on data stored in cache. The sorted data can then be compressed while it is still in the cache. Hence, the sorting phase reads bytes.

Iii-E Compression

After we sort each bin, tuples with the same (rowid, colid) pair are stored in adjacent locations. Then, in the compression phase, we sum numeric values from tuples with the same (rowid, colid). As shown in Algorithm 2, compression can be performed independently in each bin because bins do not share tuples with the same rowid. Hence, a thread can compress tuples in a bin sequentially, while other threads compress other bins in parallel.

As tuples are already sorted within a bin, compression is done by scanning the tuples in the sorted order. We implement this using two in-place pointers, which only walk the array once. The first pointer walks through the array, the second pointer maintains the location to be merged. Every time when points to a new location it checks with , if the keys of the two tuples are the same, simply add the numeric value of the first tuple to the second, if not, we move to the next location and copy the tuple2 there, keep doing this until the reach the end of the array.

Table II summarizes the computational complexity, bandwidth and latency costs, in-cache operations and parallelism schemes for all phases of PB-SpGEMM.

Iv Experiment Setup

Iv-a Software

We evaluate the performance of PB-SpGEMM against some state-of-the-art column SpGEMM algorithms, namely HeapSpGEMM, HashSpGEMM and HashVecSpGEMM. All of these algorithms have memory access patterns of a typical column SpGEMM algorithm discussed earlier.

  • [leftmargin=*]

  • HeapSpGEMM is a column SpGEMM algorithm that uses heaps to merge columns [2]. To multiply two ER matrices with average nonzeros per column, HeapSpGEMM complexity is , where the term comes from manipulating heaps. Hence, HeapSpGEMM can be efficient for matrices with small , but can be expensive for relatively dense matrices. Each column of can be formed in parallel using thread-private heaps.

  • HashSpGEMM uses a hash table to merge columns[26, 25]. Its complexity is for ER matrices assuming that hash tables have limited collisions. Each column of can be formed in parallel using thread-private hash tables.

  • HashVecSpGEMM is a variant of hash algorithm, which utilizes vector registers for hash probing [25]. HashVecSpGEMM may preform better when the collision in the hash table is high.

Previous work [25] has shown that the optimized heap and hash algorithms largely outperform Intel MKL and Kokkos-kernel. Considering this, we will not include those algorithms and software in our evaluation. All implementations are compiled by GCC-8.2.0, with flags ”-fopenmp” ”-O3” ”-m64” ”-march=native” enabled.

Skylake-SP POWER9
CPU Model Intel Xeon Platinum 8160 IBM POWER9
Architecture x86_64 ppc64le
#Sockets 2 2
#Cores/socket 24 20
Clock 2.1GHz 3.8GHz
L2 cache 1024KB/core 512KB/two cores
L3 cache 33792KB/socket 10240KB/two cores
Memory Size 250GB 1TB
Bandwidth 100GB/s 250GB/s
TABLE IV: The configuration of hardware used for evaluation
Copy Scale Add Triad
single socket 47.40 46.85 54.00 57.04
dual socket 97.73 87.43 107.00 108.42
TABLE V: Stream benchmark result of the evaluation platform

Iv-B Hardware

In our experiments, we use a dual-socket Intel Skylake system and an IBM POWER9 system as described in Table IV. Since most of our experiments are conducted on the Skylake processor, we examine its memory system carefully using the STREAM benchmark [23]. Table V shows the sustainable memory bandwidth for Copy, Scale, Add and Triad benchmarks on single and dual sockets of the Skylake system. Hence, we expect that PB-SpGEMM should attain GB/s on a single socket and GB/s on two sockets of Skylake. While PB-SpGEMM indeed attains GB/s in every phase, its dual socket performance falls short of STREAM benchmark. Hence, we primarily focus the single-socket performance because memory bandwidth is harder to predict in Non-Uniform Memory Access (NUMA) domains. We also show some results with both sockets and explain dual socket performance in Sec. V-D. When experimenting with a single socket, we set ”OMP_PLACES” to ”cores”, ”OMP_PROC_BIND” to ”close”, and restrict the memory allocation to NUMA node 0 by using ”numactl –membind=0”.

Iv-C Dataset

We closely follow a recent paper [25] to select matrices to test SpGEMM algorithms. We use the R-MAT recursive matrix generator to generate synthetic matrices. RMAT has four parameters and . ER matrices are generated with a=b=c=d=0.25, and Graph-500 matrices are generated with a=0.57, b=c=0.19, d=0.05. Here, we refer to the latter graphs as RMAT. For random graphs, edge factor denotes the average number of nonzeros in a row or column. A matrix of scale has rows and columns. We also use 12 real-world matrices from SuiteSparse Matrix Collection  [12] as shown in Table VI. In our experiments, we randomly generate two random matrices and multiply them or multiply a real matrix with itself. These computations cover some representative applications such as Markov clustering [4] and triangle counting [3]. Due to space limitation, we did not explore other application scenarios such as multiplying a square matrix by a tall-and-skinny matrix as needed in betweenness centrality algorithms.

max width=
n(A) nnz(A) d(A) flops nnz(C) cf
2cubes_sphere 101.5K 1.6M 16.23 27.5M 9.0M 3.06 amazon0505 410.2K 3.4M 8.18 31.9M 16.1M 1.98 cage12 130.2K 2.0M 15.61 34.6M 15.2M 2.14 cant 62.5K 4.0M 64.17 269.5M 17.4M 15.45 hood 220.5K 9.9M 44.87 562.0M 34.2M 16.41 m133_b3 200.2K 800.8K 4.00 3.2M 3.2M 1.01 majorbasis 160.0K 1.8M 10.94 19.2M 8.2M 2.33 mc2depi 525.8K 2.1M 3.99 8.4M 5.2M 1.6 offshore 259.8K 4.2M 16.33 71.3M 69.8M 3.05 patents_main 240.5K 560.9K 2.33 2.6M 2.3M 1.14 scircuit 171.0K 958.9K 5.61 8.7M 5.2M 1.66 web-Google 916.4K 5.1M 5.57 60.7M 29.7M 2.04

TABLE VI: A list of real-world graphs used in our evaluation
(a) Local bin width
(b) The number of bins
Fig. 6: Impact of bin width and the number of bins, data based on ER matrices with scale 20 and edge factor 4
(a) Performance scaling with matrix density
(b) Sustained bandwidth
Fig. 7: Performance and bandwidth evaluation with ER matrices on a single socket (24 cores) from Skylake
Fig. 8: Performance evaluation with ER matrices on a single socket (20 cores) from Power9
(a) Performance scaling with matrix density
(b) Sustainable bandwidth
Fig. 9: Performance and bandwidth evaluation with RMAT matrices on a single socket (24 cores) from Skylake
Fig. 10: Performance evaluation with RMAT matrices on a single socket (20 cores) from Power9

V Results

V-a Select parameters of PB-SpGEMM

Fig. 11: Performance evaluation with real matrices (matrix squaring) on a single socket of the Skylake server. From left to right, we sort matrices in the ascending order of the compression factor

In the expand phase, local bins are used to improve data locality, the propagation of tuples will be blocked into small bins, and they are moved to global shared memory. A local bin should be large enough to have good utilization of a cache line so that it can be send to global bins without wasting memory bandwidth. Furthermore, the total size of local bins per thread should be smaller than the L2 cache 444For system that has multiple threads per core, it should be counted by core.. Hence, number of bins and local bin width are two parameters upon which the performance of PB-SpGEMM depends. Fig. 5(a) shows our experiment where we vary the width of local bins to observe its impact on the performance of the expand phase. Smaller local bins do not utilize full cache lines, resulting in reduced sustained bandwidth. Based on this experiment, we used 512 bytes for every thread-private bin in all experiments in the paper.

The number of bins (), on the other hand, is a tradeoff between the expand and sort phases. Recall that sorting is performed independently in each bin. Hence, increasing the number of bins ensures that tuples in a global bin fit in the cache. However, increasing the number of bins also reduces the average bin size, which may reduce the bandwidth utilization of the expand phase. Fig. 5(b) shows the impact of over the expand and sorting phases. With more bins, radix sort can be entirely performed in L2 cache. Hence, in-cache sorting bandwidth can be as high as 200 GB/s. Hence, the number of bins is determined by L2/L3 cache size and total number of flop, and for most practical matrices, we use 1K or 2K bins.

V-B Overall performance of PB-SpGEMM with respect to the state-of-the-art

At first we discuss the overall performance of PB-SpGEMM with respect to state-of-the-art column SpGEMM algorithms. Here, we consider both ER, RMAT, and real matrices as discussed in Sec IV-C.

Performance with ER random matrices. Fig. 6(a) compares the performance of PB-SpGEMM, HeapSpGEMM, HashSpGEMM, and HashVec-SpGEMM with ER matrices of various scales and edge factors on a single socket of the Skylake server. Here, we multiply two ER matrices with the same scale and edge factor. For a given scale, the performance of PB-SpGEMM is stable (between 700 and 800 GFLOPS) and is better than column SpGEMM algorithms for all edge factors considered. This performance of PB-SpGEMM can be explained by Fig. 6(b) that reports the sustained bandwidth of PB-SpGEMM. We observed that the sustained bandwidth on a single socket is between 40 and 50 GB/s, which are close to the STREAM benchmark. Since RE matrices have , the lower bound of AI is flops/byte according to Eq. 4. Hence, the performance of PB-SpGEMM should be at least MFLOPS when the sustained bandwidth is 40 GB/s and at least MFLOPS when the sustained bandwidth is 50 GB/s. Fig. 6(a) confirms that PB-SpGEMM’s performance remains close to this lower bound estimate. By comparison, hash and heap algorithms have lower performance primarily because of their irregular memory accesses. However, as we increase the edge factor, the performance of column SpGEMM may increase because of increased utilization of cache lines.

Similar performance is observed on the Power9 system as shown in Fig. 8. As observed on Skylake, PB-SpGEMM performs better than column SpGEMM algorithms and its performance remains relatively stable for various matrix size and sparsity.

Performance with RMAT random matrices. Fig. 8(a) compares the performance of SpGEMM algorithms with RMAT matrices of various scales and edge factors on a single socket of the Skylake server. Here, we multiply two RMAT matrices with the same scale and edge factor. As with the ER matrices, the performance of PB-SpGEMM remains between 700 and 900 GFLOPS and is generally better than column SpGEMM algorithms. However, Fig. 8(b)

shows that PB-SpGEMM attains lower sustained bandwidth between 30 and 40 GB/s. The reason behind this lower than STREAM bandwidth is the skewed degree distributions in RMAT matrices, resulted in variable-size bins. Load imbalance in different bins makes the expansion phase less bandwidth efficient. We will discuss more in the scalability results.

Similar performance is observed on the Power9 system as shown in Fig. 10. As observed on Skylake, PB-SpGEMM performs better than column SpGEMM algorithms and its performance remains relatively stable for various matrix size and sparsity.

Fig. 12: Scalability on ER (left) and R-MAT matrices (right). Both are scale 16 matrices with edge factor of 16.
Fig. 13: Scalability breakdown on ER (left) and R-MAT matrices (right). Both are scale 16 matrices with edge factor 16.

Performance with real matrices. Fig. 11 shows the performance of SpGEMM algorithms when squaring real matrices on a single socket of the Skylake server. As before, the sustained bandwidth of PB-SpGEMM is between 47 and 55 GB/s and its performance is relatively stable. In Fig. 11, we sort matrices in the ascending order of the compression factor (from left to right). PB-SpGEMM is generally faster than its peers.

V-C Scalability

Fig. 12 shows the strong scaling of SpGEMM algorithms from 1 to 24 threads within a socket of the Skylake processor. We observe that PB-SpGEMM runs faster that column SpGEMM on all concurrencies. All SpGEMM algorithms scale well within a socket. On 24 cores, PB-SpGEMM attains about speedup for ER matrices and speedup for RMAT matrices. For RMAT matrices, PB-SpGEMM does not scale well on high thread counts because of the load imbalance caused by highly skewed nonzero and distributions. We tried to eliminate the load imbalance by variable length bins, but this can lead to lower sustained bandwidth as was observed in Fig. 8(b).

Fig. 14: Multi-socket performance of SpGEMM algorithms on Skylake system

V-D Dual-socket Performance

Thus far, we have only considered the performance of SpGEMM algorithms on a single socket of Skylake and Power9 processors. Fig. 14 shows the performance of SpGEMM algorithms on dual socket Skylake processor. PB-SpGEMM still runs faster for ER matrices, but runs slightly slower than heap algorithm for RMAT matrices. This lower-than-expected performance of PB-SpGEMM on NUMA systems is due to inter-socket communication contentions. If a bin is allocated on socket-1 in the expand phase and sorted by a thread from socket-2, the performance of PB-SpGEMM depends on cross-socket memory bandwidth. We checked cross-socket memory bandwidth empirically by placing data in one socket and accessing from another socket in a STREAM copy kernel. Table VII shows the local and remote access bandwidth and latency. Memory latency was measured by Intel Memory Latency Checker. We observe that cross-socket access is much slower than local access. Hence, PB-SpGEMM’s performance relies saturating the memory bandwidth, it is affected by lower cross-socket bandwidth. Note that column SpGEMM algorithms are not significantly affected by cross-socket bandwidth because they generate one column at a time, where the active column usually stays in cache.

In the Master’s thesis of the first author, we tried to improve the dual socket performance by partitioning into two matrices and multiply each part with independently in two sockets. This partitioned PB-SpGEMM partially mitigates the cross-socket bandwidth problem, but it does not perform uniformly well for all matrices due to the additional cost of reading more than once. We did not cite the thesis as per the double blind policy.

NUMA socket 0 NUMA socket 1
NUMA socket 0 50.26GB/s and 88.1ns 33.36GB/s and 147.4ns
NUMA socket 1 34.06GB/s and 146.7ns 50.12GB/s and 88.3ns
TABLE VII: NUMA local and cross-socket memory bandwidth and latency on Skylake

Vi Conclusions

With the rise of sparse and irregular data, SpGEMM has emerged as an important operation in many scientific domains. Over the past decade, the state-of-the-art of parallel SpGEMM algorithms has progressed significantly. However, understanding the performance of SpGEMM algorithms remains a challenge without an established performance model. Relying on the fact that SpGEMM is a bandwidth-bound operation, we used the Roofline model to develop bounds for SpGEMM algorithms based on column-by-column merging and the expand-sort-compress strategy. We conclude the paper with the following key findings:

  1. [leftmargin=*]

  2. We can estimate the arithmetic intensity (AI) of an SpGEMM algorithm based on the compression factor of the multiplication and number of bytes needed to store each nonzero.

  3. The attainable performance of an algorithm is , where is the memory bandwidth. This peak performance can only be attained if the algorithm saturates the bandwidth. We showed that existing column SpGEMM algorithms do not attain peak performance according to the Roofline model because of irregular data accesses and underutilization of cache lines.

  4. We develop a new algorithm based on outer product of matrices. This algorithm called PB-SpGEMM uses propagation blocking to group multiplied tuples into bins and then sort and merge tuples independently in each bin.

  5. PB-SpGEMM approximately saturates the memory bandwidth in all of its three phases and attains performance as predicted by the Roofline model.

  6. On a single socket, PB-SpGEMM performs better than the best column SpGEMM algorithms for multiplications with compression factors less than four.

  7. For multiplications with compression factors greater than four, HashSpGEMM is the best performer.


  • [1] R. R. Amossen and R. Pagh (2009) Faster join-projects and sparse matrix multiplications. In Proceedings of the 12th International Conference on Database Theory, Cited by: §I.
  • [2] A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams (2016) Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing 38 (6), pp. C624–C651. Cited by: §II-B, §II-B, TABLE I, 1st item.
  • [3] A. Azad, A. Buluç, and J. Gilbert (2015) Parallel triangle counting and enumeration using matrix algebra. In IPDPSW, Cited by: §I, §IV-C.
  • [4] A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and A. Buluç (2018) HipMCL: A high-performance parallel implementation of the markov clustering algorithm for large-scale networks. Nucleic acids research. Cited by: §I, §IV-C.
  • [5] G. Ballard, A. Buluc, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo (2013) Communication optimal parallel multiplication of sparse random matrices. In Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures, pp. 222–231. Cited by: §II-C.
  • [6] G. Ballard, C. Siefert, and J. Hu (2016) Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing 38 (3), pp. C203–C231. Cited by: §I.
  • [7] S. Beamer, K. Asanović, and D. Patterson (2017) Reducing pagerank communication via propagation blocking. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 820–831. Cited by: §I.
  • [8] N. Bell, S. Dalton, and L. N. Olson (2012) Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods. SIAM Journal on Scientific Computing 34 (4), pp. C123–C152. Cited by: §I.
  • [9] A. Buluc and J. R. Gilbert (2008) On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–11. Cited by: §II-B, §II-B, TABLE I.
  • [10] A. Buluç and J. Gilbert (2008-04) On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–11. External Links: Document Cited by: §I.
  • [11] S. Dalton, L. Olson, and N. Bell (2015) Optimizing sparse matrix—matrix multiplication for the GPU. ACM Transactions on Mathematical Software (TOMS) 41 (4), pp. 25. Cited by: §I, §II-B, §II-B, §II-B, TABLE I, footnote 1.
  • [12] T. Davis The University of Florida Sparse Matrix Collection. External Links: Link Cited by: §IV-C.
  • [13] M. Deveci, C. Trott, and S. Rajamanickam (2017) Performance-portable sparse matrix-matrix multiplication for many-core architectures. In IPDPSW, pp. 693–702. Cited by: §II-B, TABLE I.
  • [14] X. Feng, Y. Xie, M. Song, W. Yu, and J. Tang (2018) Fast randomized pca for sparse data. In ACML, Cited by: §I.
  • [15] J. R. Gilbert, C. Moler, and R. Schreiber (1992) Sparse matrices in MATLAB: Design and implementation. SIAM Journal on Matrix Analysis and Applications 13 (1), pp. 333–356. Cited by: §II-B, §II-B.
  • [16] J. R. Gilbert, S. Reinhardt, and V. B. Shah (2007) High performance graph algorithms from parallel sparse matrices. In PARA, pp. 260–269. Cited by: §I.
  • [17] A. Griewank and U. Naumann (2003-03) Accumulating jacobians as chained sparse matrix products. Math. Program. 95, pp. 555–571. External Links: Document Cited by: §I.
  • [18] F. G. Gustavson (1978) Two fast algorithms for sparse matrices: multiplication and permuted transposition. ACM TOMS 4 (3), pp. 250–269. Cited by: §II-B.
  • [19] Y. Jin and J. JáJá (2016) A high performance implementation of spectral clustering on cpu-gpu platforms. 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 825–834. Cited by: §I.
  • [20] H. Kaplan, M. Sharir, and E. Verbin (2006) Colored intersection searching via sparse rectangular matrix multiplication. In Proceedings of the Twenty-second Annual Symposium on Computational Geometry, SCG ’06, pp. 52–60. External Links: ISBN 1-59593-340-9, Link, Document Cited by: §I.
  • [21] D. Langr and P. Tvrdik (2015) Evaluation criteria for sparse matrix storage formats. IEEE Transactions on parallel and distributed systems 27 (2), pp. 428–440. Cited by: §II-A.
  • [22] J. Liu, X. He, W. Liu, and G. Tan (2019) Register-aware optimizations for parallel sparse matrix–matrix multiplication. International Journal of Parallel Programming 47 (3), pp. 403–417. Cited by: §I, §II-B, §II-B, §II-C, TABLE I.
  • [23] J. D. McCalpin (1991-2007) STREAM: sustainable memory bandwidth in high performance computers. Technical report University of Virginia. Cited by: §IV-B.
  • [24] P. M. McIlroy, K. Bostic, and M. D. McIlroy (1993) Engineering radix sort. Computing systems 6 (1), pp. 5–27. Cited by: §III-D.
  • [25] Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç (2019) Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. Parallel Computing 90, pp. 102545. Cited by: §I, §II-B, §II-C, TABLE I, 2nd item, 3rd item, §IV-A, §IV-C.
  • [26] Y. Nagasaka, A. Nukada, and S. Matsuoka (2017) High-performance and memory-saving sparse general matrix-matrix multiplication for NVIDIA Pascal GPU. In ICPP, pp. 101–110. Cited by: §II-B, 2nd item.
  • [27] M. M. Ozdal (2019) Improving efficiency of parallel vertex-centric algorithms for irregular graphs. IEEE Transactions on Parallel and Distributed Systems 30 (10), pp. 2265–2282. Cited by: §I.
  • [28] S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski (2018) OuterSPACE: an outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Cited by: §II-B, TABLE I, §III-B.
  • [29] M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey (2015) Parallel efficient sparse matrix-matrix multiplication on multicore platforms. In ISC, pp. 48–57. Cited by: §II-B, TABLE I.
  • [30] S. Williams, A. Waterman, and D. Patterson (2009) Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 52 (4), pp. 65–76. Cited by: §I.
  • [31] R. Yuster and U. Zwick (2004) Detecting short directed cycles using rectangular matrix multiplication and dynamic programming. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Cited by: §I.