SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations

10/23/2019
by   Konstantinos Kanellopoulos, et al.
0

Important workloads, such as machine learning and graph analytics applications, heavily involve sparse linear algebra operations. These operations use sparse matrix compression as an effective means to avoid storing zeros and performing unnecessary computation on zero elements. However, compression techniques like Compressed Sparse Row (CSR) that are widely used today introduce significant instruction overhead and expensive pointer-chasing operations to discover the positions of the non-zero elements. In this paper, we identify the discovery of the positions (i.e., indexing) of non-zero elements as a key bottleneck in sparse matrix-based workloads, which greatly reduces the benefits of compression. We propose SMASH, a hardware-software cooperative mechanism that enables highly-efficient indexing and storage of sparse matrices. The key idea of SMASH is to explicitly enable the hardware to recognize and exploit sparsity in data. To this end, we devise a novel software encoding based on a hierarchy of bitmaps. This encoding can be used to efficiently compress any sparse matrix, regardless of the extent and structure of sparsity. At the same time, the bitmap encoding can be directly interpreted by the hardware. We design a lightweight hardware unit, the Bitmap Management Unit (BMU), that buffers and scans the bitmap hierarchy to perform highly-efficient indexing of sparse matrices. SMASH exposes an expressive and rich ISA to communicate with the BMU, which enables its use in accelerating any sparse matrix computation. We demonstrate the benefits of SMASH on four use cases that include sparse matrix kernels and graph analytics applications.

READ FULL TEXT VIEW PDF

Authors

page 5

page 10

04/29/2020

Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra

This paper describes REAP, a software-hardware approach that enables hig...
04/23/2020

PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices

Deep neural network (DNN) has emerged as the most important and popular ...
10/29/2019

DBCSR: A Blocked Sparse Tensor Algebra Library

Advanced algorithms for large-scale electronic structure calculations ar...
11/22/2020

Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads

Sparse matrices are the key ingredients of several application domains, ...
02/20/2021

ALTO: Adaptive Linearized Storage of Sparse Tensors

The analysis of high-dimensional sparse data is becoming increasingly po...
07/07/2017

GPU-Accelerated Algorithms for Compressed Signals Recovery with Application to Astronomical Imagery Deblurring

Compressive sensing promises to enable bandwidth-efficient on-board comp...
10/05/2019

Multiplierless and Sparse Machine Learning based on Margin Propagation Networks

The new generation of machine learning processors have evolved from mult...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Sparse linear algebra operations are widely used in modern applications like recommender systems (linden2003amazon; recommenderFB2019; recommendfb2)

, neural networks 

(Benjamin2014spatially; liu2015sparse), graph analytics (Brin1998the; besta2017slimsell), and high-performance computing (solversGPU; Falgout2006an; dongarra1996sparse; falgout2002hypre; henon2002pastix). The matrices involved in these operations are very large in size and highly sparse, i.e., the vast majority of the elements are zeros. For example, the matrices that represent Facebook’s and YouTube’s social network connectivity contain 0.0003% (SmithFacebook) and 2.31% (leskovec2016snap) non-zero elements, respectively. These highly sparse matrices lead to significant inefficiencies in both storage and computation. First, they require an unnecessarily large amount of storage space, which is largely occupied by zeros. Second, computation on highly sparse matrices involves a large number of unnecessary operations (such as additions and multiplications) on zero elements. The traditional solution to these inefficiencies is to compress the matrix and store only the non-zero elements, and then operate only on the non-zero values.

Prior works take two major approaches to designing such compression schemes. The first approach is to devise general compression formats or encodings  (datacompression2006; im1999optimizing; formatSurvey2016; csr52015; variableblock2005; blockoptim1999; Sengupta2007scan; bccoo2014), such as CSR (csr52015), COO (Sengupta2007scan), BCSR (im1999optimizing), and CSR5 (csr52015). Such formats essentially store the non-zero elements and their positions within the matrix using additional data structures and different encodings. Such encodings are general in applicability and are highly-efficient in storage, with high compression ratios. However, this approach involves repositioning and packing the non-zero values in the matrix, which leads to significant computation overheads that diminish the overall benefit. Determining the positions of the non-zero elements in the compressed encoding (i.e., indexing) requires a series of pointer-chasing operations in memory that, as we demonstrate, are highly inefficient in modern processors and memory hierarchies.

The second approach taken by prior work is to leverage a certain known structure in a given type of sparse matrix to avoid the cost of discovering the non-zero regions of the sparse matrix  (uniquekourtis2008; csxkourtis2011; smatautotune2013; csb2009; pattern2009; cocktail2012; csradaptgpus2014). For example, the DIA format (pattern2009) is highly effective in matrices where the non-zero values are concentrated along the diagonals of the matrix. Specializing the compression scheme to patterns in the sparsity can be efficient in both computation and storage but it is highly specific to certain types of matrices and inapplicable when the structure and extent of sparsity are not known a priori.

Our goal in this work is to enable efficient and general sparse matrix computation with a technique that satisfies three major requirements: 1) high computation and storage efficiency by storing and operating on only non-zero elements; 2) minimal overheads from the compression scheme (e.g., efficient discovery of non-zero elements) and 3) generality and applicability to any sparse matrix, regardless of its structure or the extent of its sparsity.

Our key idea is a new hardware-software co-design where we enable the hardware to recognize and exploit the compression encoding used in software for any sparse matrix. This allows us to add hardware support for highly-efficient storage and retrieval of non-zero values in sparse matrices, avoiding the overheads of software indexing (requirement 2). Our software encoding is designed to maintain low storage requirements (requirement 1) and be generally applicable to any sparse matrix without any assumption of structure or extent of sparsity (requirement 3).

We propose SMASH (Sparse MAtrix Software/Hardware), a general hardware-software cooperative mechanism that efficiently compresses and operates on sparse matrices. The key construct behind SMASH is the use of efficiently encoded hierarchical bitmaps to express sparsity, where each bit represents a region of non-zero values. These bitmaps are recognized by both hardware and software. On the software side, sparse matrices of any form are flexibly encoded using our hierarchical bitmap representation. SMASH adapts to each matrix’s sparsity characteristics by supporting multiple compression granularities throughout the bitmap hierarchy. On the hardware side, the bitmap representation allows us to use a lightweight hardware unit, the Bitmap Management Unit (BMU), to perform highly-efficient scans of the bitmap hierarchy. The BMU hardware enables low-cost indexing (that avoids expensive pointer-chasing lookups) and efficient sparse matrix computation.

To enable wide applicability, SMASH exposes five new instructions that enable the software to communicate with the Bitmap Management Unit. The new instructions enable efficient lookup of non-zero matrix regions, and are sufficiently rich to express a wide variety of operations on any type of (sparse) matrix.

We evaluate SMASH on four use cases: two sparse matrix kernels, Sparse Matrix Vector Multiplication (SpMV) and Sparse Matrix-Matrix Multiplication (SpMM), as well as two graph analytics applications, PageRank and Betweenness Centrality (BC). For our experiments, we use a collection of sparse matrices with varying structure and sparsity characteristics  (florida). We compare SMASH to two state-of-the-art compression formats, CSR (csr52015) and BCSR (im1999optimizing). We find that SMASH improves average performance by 41.5% for SpMV and SpMM, across 15 matrices and by 20% for PageRank and BC, compared to a state-of-the-art CSR implementation (tacolib). We also compare the software-only version of SMASH against two state-of-the-art sparse matrix frameworks (intel-mkl; tacolib) on a real system. We find that even with no hardware support, SMASH’s bitmap encoding outperforms a state-of-the-art CSR implementation (tacolib).

In this paper, we make the following key contributions:

  • [topsep=2pt]

  • We show that discovering the positions of non-zero elements (indexing) is a key bottleneck in sparse matrix computation. We demonstrate that efficient indexing can boost the performance of sparse matrix operations significantly.

  • We introduce SMASH, a hardware-software cooperative mechanism that enables the hardware to recognize and exploit the compression encoding used in software. SMASH consists of 1) a novel software encoding that uses a hierarchy of bitmaps to efficiently compress sparse matrices and 2) hardware support to enable highly-efficient indexing of sparse matrices that are compressed using SMASH’s software encoding.

  • We show how SMASH can efficiently compress sparse matrices with diverse structure and sparsity characteristics using the hierarchical bitmap encoding. We design and demonstrate the effectiveness of the Bitmap Management Unit (BMU) that efficiently buffers and scans the bitmap hierarchy in hardware to identify non-zero regions in the matrix. We introduce an expressive ISA that enables the flexible use of SMASH in a wide variety of sparse matrix operations.

  • We evaluate SMASH on important sparse matrix kernels and graph analytics applications using a collection of matrices with diverse structure and sparsity. We find that SMASH provides significant performance improvements compared to state-of-the-art CSR implementations while incurring a very modest area overhead in a modern out-of-order CPU.

2. Motivation

Sparse matrix operations are widely used in a variety of applications including sparse linear algebra (sparseEIGEN; sparseLU), graph processing (Brin1998the; besta2017slimsell)

, convolutional neural networks (CNNs) 

(liu2015sparse), and machine learning (ML) (dnn2018; linden2003amazon; recommenderFB2019; recommendfb2). These applications involve matrices with very high sparsity, i.e., a large fraction of zero elements. Using a compression scheme is a straightforward approach to avoid unnecessarily 1) storing zero elements and 2) performing computations on them. To this end, a variety of sparse matrix representation formats (e.g.,  (datacompression2006; im1999optimizing; formatSurvey2016; csr52015; variableblock2005; blockoptim1999; Sengupta2007scan; bccoo2014)) have been proposed to compress the sparse matrix. The most widely used state-of-the-art format is Compressed Sparse Row (CSR) (csr52015). In this section, we describe the CSR format and demonstrate its inefficiency.

2.1. Compressed Storage Formats

The Compressed Sparse Row (CSR) format  (csr52015) is widely used in many libraries that involve sparse matrix operations (intel-mkl; openblas1; openblas2; sparsex2018; tacolib). It consists of three one-dimensional arrays: row_ptr, col_ind, and values. Given an matrix, the row_ptr array is used to store (and determine) the number of non-zero elements per row; the col_ind array indicates the column indices of the non-zero elements; and the values array stores the values of only the non-zero elements. Discovering the position of a non-zero element in row requires streaming through col_ind from col_ind[row_ptr[i]] up to col_ind[row_ptr[i+1]] to discover its column index in the row. Figure 1 illustrates an example of a compressed matrix using the CSR format. In this example, in order to discover the non-zero elements of row 1 (i.e., second row from the top) of A, we search in col_ind starting from index col_ind[row_ptr[1] == 1] up to col_ind[row_ptr[2] == 3].

Figure 1. Compressed Sparse Row format for a matrix with 6 non-zero elements. We count the number of non-zero elements of row by computing (row_ptr[]-row_ptr[]). col_ind holds the column index of each non-zero element.

A variant of CSR is Compressed Sparse Column (CSC). CSC stores the elements in column-major order instead of row-major. The col_ptr array is used to store (and determine) the number of non-zero elements per column, the row_ind array holds the row indices of the non-zero elements, and the values array stores the values of the non-zero elements themselves.

CSR significantly reduces the amount of memory needed to store a sparse matrix, especially when the matrix is large and its sparsity is high. However, CSR and schemes with CSR-like structures (csr52015; Sengupta2007scan) have one major requirement: in order to determine where the non-zero elements are located in the original matrix, the corresponding indices need to be retrieved from the row_ptr and col_ind data structures. Accessing these data structures adds many additional instructions and requires a series of indirect data-dependent memory accesses. These overheads reduce the benefits of avoiding the computation on zero elements. Hence, even though CSR-like formats reduce storage requirements and avoid needless computation, discovering the positions of the non-zero elements still is an unsolved challenge that causes performance and efficiency overheads.

2.1.1. Sparse Matrix Vector Multiplication (SpMV)

We consider the SpMV kernel , where is a sparse matrix, is a dense vector, and is the output vector. The naive 2D implementation of the SpMV kernel involves performing computations on every element of the two-dimensional matrix A and incurs high computational and storage overheads. Code Listing 2.1.1 presents the CSR-based implementation. In this case, the algorithm iterates over only the non-zero elements and avoids unnecessary zero-element computations. However, it introduces a pointer-chasing operation when col_ind is loaded and then used as an index to load the appropriate element of the vector (x[col_ind[j]]). Only after this complex indexing operation can we perform the multiplication with the corresponding non-zero element in matrix A (values[j]), as shown in Line 3 of Code Listing 2.1.1.

[escapeinside = ——, xleftmargin=3em, linenos]c for (i = 0; i ¡ N; i++) for (j = —row˙ptr[i]—; j ¡ —row˙ptr[i+1]—; j++) y[i] += values[j] * x[—col˙ind[j]—] List of listings 0 CSR-based SpMV implementation. The column index of each element is needed to perform the multiplication with the x vector.

2.1.2. Sparse Matrix Matrix Multiplication (SpMM)

SpMM is traditionally performed using inner product multiplication (outerspace2018). This results in a series of dot product operations between each row of the first matrix (A) and each column of the second matrix (B) to produce the final elements of the result matrix (C). The naive () SpMM implementation is prohibitively expensive due to the high number of unnecessary computation operations on zero elements. A CSR-based implementation of SpMM, shown in Code Listing 2.1.2, avoids such unnecessary computations. In SpMM, matrix A is compressed using CSR and matrix B using CSC. SpMM iterates over the rows of A and columns of B (Lines 1-2 in Code Listing 2.1.2). For each non-zero element in each row of A, we need to search through col_ind of matrix A and row_ind of matrix B to discover which elements should be multiplied during each dot product. This process is called index matching (Lines 4-6 in Code Listing 2.1.2). Figure 2 presents an example of index matching. We need to match the positions of the non-zero elements of matrix A at row and the non-zero elements of matrix B at column to perform the dot product correctly. Given that index matching is performed for every dot product operation, a CSR-based SpMM implementation requires a large number of position-finding operations for non-zero elements and thus the indexing mechanism plays a critical role in performance and efficiency.

[escapeinside=——,xleftmargin=3em, linenos ]c for each row of A for each column of B for each non-zero element in row of A k1 = —search˙in˙col˙ind˙of˙A()— k2 = —search˙for˙row˙ind˙of˙B()— if (k1 == k2) y[i][j] += a_val[—k1—] * b_val[—k2—] List of listings 0 CSR-based inner-product SpMM implementation. Index matching (Lines 4-6) is needed to perform the multiplication of A’s rows and B’s columns (line 7).

Figure 2. Index matching in SpMM. We search col_ind of matrix A and row_ind of matrix B to find indices that match. Only the indices of the first two elements of A’s Row 0 and B’s Column 0 match. The remaining two elements in A’s Row 0 or B’s Column 0 have at least one zero element in them and their indices do not match.

2.2. Limitations of Existing Compressed Storage Formats

Compressed storage formats, such as CSR (csr52015), BCSR (im1999optimizing), and CSR5 (csr52015), are effective at reducing the storage area and avoiding redundant computations on the zero elements of the matrix. However, as described above, they require additional computation and indirect memory accesses to find the indices, i.e., the row and column positions, of non-zero elements. This increases the computation burden and memory traffic, and hence lowers the potential gains from the compression scheme.

To understand the impact of this indexing overhead on sparse matrix processing, we conduct an experiment where we compare a state-of-the-art CSR implementation to an idealized version in which accessing the positions of non-zero elements does not incur any additional computation or memory access. Figure 3 shows the speedup and executed instruction count of this idealized CSR over the regular state-of-the-art CSR. As shown in the figure, eliminating the indexing part of the CSR format significantly improves performance: for Sparse Matrix Addition, for SpMV, and for SpMM. These performance benefits come from the reduced number of executed instructions (by 49%, 42%, 65%, respectively) and from eliminating the expensive pointer-chasing lookups in memory.

Figure 3. Speedup and normalized number of executed instructions of an ideal indexing scheme over the baseline CSR, averaged across 15 sparse matrices for Sparse Matrix Addition, SpMV, and SpMM (see Section 6 for methodology).

2.3. Other Approaches to Sparse Matrix Compression

Other approaches  (uniquekourtis2008; csxkourtis2011; smatautotune2013; csb2009; pattern2009; cocktail2012; csradaptgpus2014) aim to maximize the efficiency of sparse matrix computations by specializing to particular matrix types and thus trading off generality. These approaches assume and leverage a specific matrix structure or known pattern. Saad et al.  (saad2003iterative) assume a diagonal matrix and base the compression scheme around the assumption that all non-zero elements are only along the matrix diagonal. Kourtis et al.  (uniquekourtis2008) assume matrices with few unique non-zero elements in designing the compression scheme. As a result, these approaches are not applicable to a wide range of important classes of applications like CNNs and graph processing algorithms, where such assumptions do not hold.

3. SMASH : Design Overview

We introduce SMASH, a hardware-software cooperative mechanism that 1) enables highly-compressed storage and avoids unnecessary computation; 2) significantly reduces the overheads imposed by the compression scheme itself (i.e., enables efficient discovery of the positions of non-zero elements); and 3) can be used generally across a diverse set of sparse matrices and sparse matrix operations.

The key idea of SMASH is to enable the hardware to recognize and exploit the compression encoding used in software. We devise a new construct, recognized by both the hardware and software, to compress sparse matrices efficiently: a hierarchy of bitmaps. Each bitmap in the hierarchy efficiently encodes sparsity by denoting the presence/absence of non-zero values in a matrix region using a single bit. The size of the region varies with the level of the bitmap in the hierarchy and can be adjusted by software. This representation does not assume any structure in the matrix and can be used to efficiently compress a diverse set of sparse matrices. At the same time, as we demonstrate, it enables designing hardware that can exploit the known sparsity in data and hence perform highly-efficient indexing of sparse matrices.

3.1. Design Challenges

Designing SMASH involves addressing two major challenges:

Challenge 1: Efficiently Encoding Bitmaps. Representing each element in a matrix with a single bit to denote sparsity is highly-inefficient in terms of storage and computation. Hence, we need a more efficient bitmap representation that is effective regardless of the matrix sparsity and the location/distribution of the non-zero values. At the same time, hardware should be able to flexibly interpret and leverage this bitmap encoding.

Challenge 2: Flexibility and Expressiveness. To express a diverse set of sparse matrix operations in any application, we need a rich cross-layer interface between the application and the underlying hardware that 1) allows the software to flexibly manipulate and index sparse matrices encoded using our hierarchical bitmap encoding and 2) enables hardware to easily interpret the sparse matrix operations in the application and effectively accelerate those operations.

3.2. SMASH: Key Components

Hierarchy of Bitmaps. To address Challenge 1, we represent the positions of the non-zero values in any sparse matrix with a hierarchy of bitmaps. Our system is designed to support a certain maximum number of levels of the hierarchy. Each level of the hierarchy encodes the presence of non-zero values with a configurable compression ratio. This compression ratio is determined by the software based on the sparsity and distribution of non-zero values in a given matrix. With this representation, a zero matrix would require only one bit and one level of the bitmap encoding.

In Figure 4, we show an example with a 3-level bitmap hierarchy. Each bit in Bitmap-2 encodes the presence of non-zero values in two consecutive regions in Bitmap-1 (hence, the compression ratio at this level is two). Bitmap-1, on the other hand, encodes the presence of non-zero values in four consecutive regions in Bitmap-0 (hence, the compression ratio at this level is four). The selection of compression ratio at any level is a tradeoff between computation efficiency and storage efficiency. With a higher compression ratio, we store fewer bits to encode the presence/absence of non-zero elements but may perform unnecessary computations on zero elements. With a lower compression ratio, on the other hand, we can compute only on non-zero elements, but this would incur a higher storage overhead.

Figure 4. (a) A three-level bitmap hierarchy with different compression ratios between the levels. The NZA (Non-Zero Values Array) stores the non-zero values of the matrix. (b) We store in memory only the non-zero blocks of the bitmaps and the NZA.

The non-zero elements of the matrix are kept in a data structure called the Non-Zero Values Array (NZA). The granularity at which they are stored in memory depends on the compression ratio of the lowest level of the bitmap hierarchy (Bitmap-0). As we show in the example bitmap hierarchy of Figure 4, Bitmap-0 encodes the 4-element non-zero blocks of the NZA using a single bit (i.e., the compression ratio is 4:1).

Our hierarchical bitmap compression mechanism enables efficient representation of matrices with varying degrees of sparsity and varying distributions of non-zero elements by adjusting the bitmap representation granularity (i.e., compression ratios at all levels of the bitmap hierarchy). It can also be flexibly and efficiently interpreted by hardware, as we describe next.

Bitmap Management Unit (BMU). We design a new hardware unit that buffers and efficiently scans the bitmap hierarchy to quickly find the non-zero elements. The BMU is responsible for efficiently and quickly calculating the indices of the non-zero regions in the matrix using the bitmap hierarchy. To this end, it caches and efficiently operates on the bitmaps in the hierarchy. Section 4.2 describes the BMU in detail.

Cross-layer interface. To flexibly accelerate a diverse range of operations on any sparse matrix, we expose to the software a rich set of primitives that 1) are general, 2) can control the BMU to quickly find the locations of non-zero elements, and 3) enable the processing core to skip unnecessary computations. These primitives are implemented as five new ISA instructions that are designed to 1) communicate the parameters of a sparse matrix and its bitmap hierarchy to the BMU, 2) load the bitmaps into the BMU, 3) scan the bitmaps to determine the location of the non-zero elements in the matrix, and 4) communicate the ¡row, column¿ positions of the non-zero elements in the original sparse matrix back to the application. These instructions are sufficiently expressive to be used for a diverse set of sparse matrix computations, such as Sparse Matrix Vector Multiplication, Sparse Matrix Matrix Multiplication, Sparse Matrix Addition, and more (sparseLU; sparseEIGEN), thereby addressing Challenge 2 (Section 3.1). Section 4.3 describes the SMASH ISA primitives in detail. Section 5 shows examples of how the primitives can be used in software applications.

4. SMASH: Detailed Design

In this section, we provide a detailed description of the different components of SMASH and their operation.

4.1. Software Compression (Hierarchical Bitmap Compression)

The hierarchy of bitmaps is the key construct of SMASH that enables 1) highly-compressed matrix storage and 2) efficient discovery of the positions of the non-zero regions of the matrix. The Non-Zero Values Array (NZA) holds the actual values of the sparse matrix. Every set bit of the last-level Bitmap-0 corresponds to one non-zero block in the NZA. The size of each non-zero block in the NZA (that is encoded by a single set bit in Bitmap-0) depends on the compression ratio used for Bitmap-0.

There are two major factors that impact the effectiveness of our hierarchical bitmap compression scheme: 1) the selected compression ratio at each level of the bitmap hierarchy (Section 4.1.1) and 2) the distribution of non-zero elements in the matrix (Section 4.1.2).

4.1.1. Impact of compression ratio.

Figure 5 demonstrates the impact of choosing different compression ratios for Bitmap-0 using a simplified example. We show two cases where the bitmap encodes the same matrix using two different compression ratios between Bitmap-0 and the NZA. In case

1
, the bitmap uses a single bit to encode 8 consecutive elements of the matrix, i.e., the compression ratio is 8:1. The bitmap requires 2 bits to encode the entire matrix and the NZA holds one 8-element non-zero block (consisting of 2 non-zero and 6 zero elements). In case

2
, the bitmap uses a single bit to encode 4 consecutive elements of the matrix, i.e, the compression ratio is 4:1. In this case, the bitmap requires 4 bits to encode the entire matrix and the NZA holds one 4-element block (consisting of 2 non-zero and 2 zero elements).

Figure 5. Two bitmaps that compress the matrix with two different compression ratios. Bitmap

1
encodes blocks of 8 elements with a single bit, while Bitmap

2
encodes blocks of 4 elements with a single bit.

As we demonstrate with this example, a higher compression ratio reduces the size of the bitmap and thus scanning it becomes more efficient. However, with a higher compression ratio, the NZA unnecessarily stores more zero elements. Since zeroes within the block cannot be identified a priori, the processor unnecessarily performs computation on them.

Hence, the compression ratio forms a tradeoff between 1) smaller bitmaps that can be quickly scanned and 2) more zero elements in the NZA storage and unnecessary computations on them. This major tradeoff also applies to the higher levels of the bitmap hierarchy.

4.1.2. Impact of distribution of non-zero elements in the sparse matrix

The distribution of non-zero elements across the matrix also affects the size of the NZA. On the one hand, if the non-zero elements of the matrix are closely clustered, the number of non-zero blocks of the matrix decreases and the NZA stores fewer blocks. On the other hand, if the non-zero elements of the matrix are distributed more uniformly across the matrix, the number of non-zero blocks of the matrix increases and the NZA may need to hold more non-zero blocks (that also contain more zeros).

4.1.3. Conversion to the hierarchical bitmap format

If the input data to any user application is already stored using another compression format (such as CSR), the application needs to convert the sparse matrices to the hierarchical bitmap encoding and the NZA used in SMASH. This is done using three steps. First, the application identifies all the non-zero blocks in the matrix using the indexing mechanism required by the original format. The size of the block depends on the assumed size of the non-zero blocks in the NZA. Second, the application creates the NZA by appending these non-zero blocks contiguously in memory. Third, the application creates the bitmap hierarchy, starting with Bitmap-0. To create Bitmap-0, the application determines the locations of the non-zero blocks of the matrix (where the block size is equal to the Bitmap 0 compression ratio). For each one of these locations, the application sets the corresponding bit of Bitmap-0 to 1. Next, the application creates the higher levels of the bitmap hierarchy: each level of the hierarchy is created based on the compression ratio of Bitmap- and the corresponding set bits of Bitmap-. Assuming the chosen compression ratio of level is 8:1, we set each bit in Bitmap- if there are one or more set bits in the corresponding 8 elements in Bitmap-.

We note that this conversion process from any format to our hierarchical bitmap encoding can be automated in software, and, if needed, accelerated with hardware support.

4.2. Hardware Indexing
(Bitmap Management Unit)

The Bitmap Management Unit (BMU) provides the key functionality of scanning the bitmap hierarchy to quickly identify the non-zero elements of the matrix in a highly-efficient manner. It recognizes the bitmap encoding used in software based on the parameters of the bitmap (and sparse matrix) provided to it via the SMASH software-hardware interface (Section 4.3).

4.2.1. BMU components.

Figure 6 demonstrates the structure of the BMU. It consists of four key components: 1) the SRAM buffers that hold the bitmaps when they are being scanned, 2) the hardware logic that scans the buffers to find the non-zero blocks, 3) programmable registers that hold configuration parameters, such as the matrix dimensions and the compression ratios of the bitmap hierarchy, and that effectively orchestrate the operation of the hardware logic, and 4) two registers to store the row and column indices of the non-zero elements determined by the BMU. The BMU supports multiple groups of the components presented in Figure 6, to enable the indexing of multiple sparse matrices, where each group is dedicated to buffering and scanning a single sparse matrix.

The SRAM buffers hold the bitmaps one block at a time. In our implementation, each buffer is 256 bytes in size. The compression ratio (the number of bytes encoded by a single bit) at each level of the bitmap, including between Bitmap-0 and the NZA, must be less than or equal to the bitmap buffer size. This is to avoid loading the buffers multiple times from memory to scan a single block, which would be expensive and inefficient. For example, with a 256-byte SRAM buffer size, the maximum compression ratio supported in the BMU is = 2048:1.

4.2.2. BMU operation.

As depicted in Figure 6, identifying the location of a non-zero block in the matrix involves three steps. First, the hardware logic scans the bitmap buffers to find the set bits and discover the positions of the non-zero blocks 

1
. Second, the hardware logic reads the matrix dimensions and the compression ratios from the programmable registers to calculate the row and column indices of the current non-zero block 

2
. Third, it updates the output registers 

3
that hold the row

4
and column

5
indices of the non-zero block. These registers can be repeatedly read by the CPU to iteratively find the location of the non-zero elements. To find the next non-zero element, the hardware logic looks for the next set bit in the bitmap block or loads the next bitmap block from the memory hierarchy to perform the search for the non-zero blocks. These operations are controlled by software with five new ISA instructions

6
, which we describe in Section 4.3.

Figure 6. Bitmap Management Unit consists of four key components: 1) SRAM buffers to store portions of the bitmaps that are being operated on, 2) hardware logic that scans the buffers, 3) registers to hold the matrix and bitmap parameters, and 4) output registers to store the row  

4
and column indices  

5
of the non-zero blocks. The BMU communicates with the CPU using the SMASH ISA  

6
.

4.2.3. Efficient indexing with the BMU

The BMU iteratively communicates to the CPU the row and column indices of the non-zero elements in the sparse matrix. Since we use multiple levels of indirection with the hierarchical bitmap, finding each non-zero element involves traversing the bitmap hierarchy in a depth-first manner. Every time a set bit is encountered at any bitmap level, we save that bit’s index within the bitmap and then traverse the lower-level bitmap associated with that set bit. The BMU traverses the hierarchy in this manner (saving the index of the set bit at each level) until it reaches Bitmap-0. Any set bit in Bitmap-0 directly maps to a block of elements in the sparse matrix that has at least one non-zero element. Using the saved indices of the set bits at each level of the hierarchy, as well as the corresponding compression ratios, the BMU calculates the index of the non-zero block in the original sparse matrix.

The final index = , is computed using the hierarchy of bitmaps in the following way:

where is the compression ratio of Bitmap- and is the index of the encountered set bit while scanning Bitmap-. The row and column indices of the non-zero block in the original sparse matrix are calculated as follows: and . These indices are stored in the two output registers dedicated to the row index and the column index in the BMU. They are retrieved by the CPU using new ISA instructions (described below). The application iterates through the non-zero blocks of the NZA to compute on the non-zero values. For each consecutive non-zero block, the application reads the row and column indices from the BMU registers to identify the ¡row,column¿ location of the non-zero block in the original sparse matrix.

4.3. SMASH ISA: Software/Hardware Interface

We introduce five new instructions in the ISA to control the functionality provided by the BMU. Table 1 shows the instructions. These instructions are designed to i) communicate parameters of the matrix and the bitmap hierarchy to the BMU (MATINFO,BMAPINFO), ii) load the bitmaps into the BMU buffers (RDBMAP), iii) iteratively scan the bitmaps to determine the location of the next non-zero element in the matrix (PBMAP), and iv) communicate the row and column indices (in the original sparse matrix) of the non-zero element back to the application (RDIND).

MATINFO: This instruction communicates the dimensions of the matrices to the BMU. It has three source operands: row,col, and grp. row represents the number of rows and col represents the number of columns of the matrix. grp is used to select the group of the BMU. If the user application involves two sparse data structures, MATINFO needs to be executed twice (one for each matrix) before performing other operations with the BMU.

BMAPINFO: This instruction communicates the compression ratio of each bitmap to the BMU. It has three source operands: comp for the compression ratio, lvl selects the bitmap level in the hierarchy, and grp selects the group of the BMU.

ISA instruction Functionality
matinfo row,col,grp Loads the matrix dimensions into the registers of the BMU.
bmapinfo comp,lvl,grp Loads the compression ratio comp that bitmap at lvl operates with.
rdbmap [mem],buf,grp Loads bitmap starting from [mem] into SRAM buffer buf.
pbmap grp Signals the BMU to scan the SRAM buffers and find the row and column indices of the next non-zero block.
rdind rd1,rd2,grp Loads into rd1 and rd2 the row and column indices of the current non-zero block that are stored in the row index and column index output registers of the BMU.

Table 1. SMASH instructions

RDBMAP: This instruction loads the bitmap into the bitmap buffers in the BMU. It has three source operands: a [mem] location, a buf selector, and a grp selector. It loads a bitmap block starting from the address pointed by [mem] into the buffer buf of the group grp .

PBMAP: This instruction signals the hardware logic of the BMU to scan the bitmap buffers and find the index of the next non-zero block. The hardware logic updates the row index and column index output registers with the position of the non-zero block. This instruction has one source operand: grp selects the group of the BMU.

RDIND: This instruction communicates to the CPU the row and column indices of the current non-zero block that was identified by the BMU after the execution of PBMAP. This essentially involves reading the row index and column index output registers of the BMU. RDIND has two destination registers: rd1 and rd2, and a grp selector. RDIND loads the row index and column index into rd1 and rd2, respectively. grp is used to select the group of the BMU.

4.4. An Alternative: Software-only SMASH

The hierarchical bitmap encoding of SMASH can be used entirely in software, as a pure software compression mechanism without any hardware support. In this case, a sparse matrix is represented using the hierarchy of bitmaps but the indexing is still performed entirely in software, i.e., the BMU and the ISA are not used to accelerate the indexing. If used entirely in software, one 64-byte block of the bitmap needs to be loaded using four memory load instructions. A Count Leading Zeros (e.g., CLZ in x86) bitwise instruction is needed to find the first most-significant set bit. For every set bit that is found, one bitwise AND is needed to mask the set bit and then search for the next one. This adds more computation to find the set bits compared to the mechanism we describe in Section 4.2. In contrast to Software-only SMASH, the BMU loads the whole block at once and does not need AND operations to find set bits. As we show in Section 7.2, Software-only SMASH cannot leverage the full benefits of the hierarchical bitmap encoding and incurs additional computational overhead. Even so, as we also show in Section 7.2, Software-only SMASH still outperforms CSR on average, because it uses fewer instructions overall.

5. SMASH Example Use Cases

We describe in detail how the SpMV and SpMM operations are performed using SMASH. We assume a 3-level bitmap hierarchy for SpMV and, for simplicity of explanation, a 1-level bitmap for SpMM. Our SpMM example consists of two sparse data structures.

5.1. Example Use Case 1: SpMV

Figure 7 and Algorithm 1 describe the execution flow for SpMV using SMASH. SpMV involves only one sparse matrix and a dense vector, x. Therefore, it utilizes only one group of the BMU’s components. MATINFO is used in the beginning of the algorithm to communicate the dimensions of the matrices

1
to the BMU (line 2 in Algorithm 1). BMAPINFO is executed once for each bitmap to communicate the compression ratios

2
(lines 3-5). RDBMAP is executed three times at the beginning to load the bitmap hierarchy into the bitmap buffers 

3
(lines 6-8).  Whenever a non-zero block is found, PBMAP is used to search in the SRAM buffers to find the row and column indices of the next non-zero block of the NZA 

4
(line 11) and RDIND returns these indices of the non-zero block back to the application (line 12). The processor loads the block from the NZA, and multiplies it with the x vector’s block at the row index returned by RDIND

5
(lines 15-16).

Figure 7. SpMV flow of execution. One matrix is compressed using a 3-Level Bitmap Hierarchy.
1# Operation: A * x = C
2matinfo rows,columns,0 # Load dimensions to BMU
3bmapinfo comp2,2,0
4bmapinfo comp1,1,0 # Load compression ratios to BMU
5bmapinfo comp0,0,0
6rdbmap [bitmap2],2,0
7rdbmap [bitmap1],1,0 # Load 3 bitmaps in BMU buffers
8rdbmap [bitmap0],0,0
9ctrNZ = 0 # Initialize counter of NZ blocks
10for all non-zero blocks of the sparse matrix
11    pbmap 0 # Scan the bitmaps
12    rdind rowInd,colInd,0 # Read index of the NZ block
13    ctrElmt = 0 # Initialize counter of elements
14    for all elements of block (rowInd,colInd)
15         NZA_ind = ctrNZ*comp0 + ctrElmt
16         C[rowInd+ctrElmt]+=NZA[NZA_ind]*x[colInd+ctrElmt]
17         ctrElmt+=1 # Point to the next element
18    ctrNZ+=1 # Point to the next NZ block
Algorithm 1 : SpMV using SMASH

5.2. Example Use Case 2: SpMM

Figure 8 and Algorithm 2 describe the execution flow for SpMM using SMASH. We describe SpMM in the case where a 1-level bitmap hierarchy is used for each of the two sparse data structures. In the initialization phase of the algorithm, we need to execute MATINFO twice, one for each sparse matrix

1
(lines 2-3 in Algorithm 2). We also need to execute BMAPINFO twice, one for each bitmap 

2
(lines 4-5). The program iterates over the rows of matrix A and columns of matrix B. For each row of matrix A, we load the bitmap at the correct offset using RDBMAP as follows: [mem]=bitmapA+rowOffset, where buf=0 and grp=0

3
(line 7). For each column of matrix B, we load the bitmap at the correct offset in the second group of buffers using RDBMAP as follows: [mem]= bitmapB+colOffset, where buf=0 and grp=1 (line 9). Index matching requires the use of the PBMAP instructions 

4
(lines 10-11) to search the bitmaps of both matrices and determine the matching indices of NZA_A’s and NZA_B’s blocks. RDIND instructions 

5
(lines 12-13) are executed to load the indices of the non-zero blocks into registers so that the column index of A can be compared to the row index of B (line 14). If the indices match, the algorithm performs the inner product between the corresponding blocks of NZA_A and NZA_B (line 15). If the row index of A is greater than the current row or if the column index of B is greater than the current column of B (line 16), the algorithm skips the remaining inner-product computation (lines 10-15) between the current row of A and the current column of B. This implies the absence of anymore non-zero elements in the current row of A (or column of B) and thus, unnecessary computation on zero elements is avoided.

The index matching phase of the algorithm requires calculating the indices of each non-zero element in both A and B for each inner product (line 10-16). A format like CSR would incur extra computations and expensive indirect memory accesses to perform this step (i.e., the index matching). With SMASH, we leverage the BMU to accelerate the index matching. We demonstrate the benefits of our scheme in Section 7.

0.97

5.2.1. Generality of SMASH for other use cases

In this work, we evaluate SMASH using two sparse matrix kernels, SpMV and SpMM, that are central to many important applications (e.g.,  (van2008graph; liu2015sparse; penn2006efficient)). However, SMASH is generally applicable to any sparse matrix computation. Sparse matrix computations operate only on the non-zero elements of the matrix and hence fundamentally require 1) identifying the indices (row and column) of the non-zero elements and 2) retrieving those non-zero elements from memory. With the proposed ISA instructions, we can efficiently determine the location (in the matrix) of the non-zero blocks of the NZA, regardless of 1) the computation that will be performed on the non-zero elements and 2) the structure and the extent of sparsity in the matrix. Other examples of widely used operations on sparse matrices that SMASH can accelerate include Sparse LU Decomposition (sparseLU)

, Sparse Eigenvalue Calculation 

(sparseEIGEN; sparse-EIGEN2) and Sparse Iterative Solvers  (saad2003iterative). 0.99

Figure 8. SpMM flow of execution. Two matrices are compressed, each using a single-level Bitmap Hierarchy.
1# Operation: A * B = C
2matinfo rowsA,colsA,0 # Load dimensions to BMU
3matinfo rowsB,colsB,1
4bmapinfo comp0_A,0,0  # Load compression ratios to BMU
5bmapinfo comp0_B,0,1
6for i in [0 .. rowsA)  # Iterate over rows of A
7    rdbmap [bitmapA+rowOffset],0,0 # Load bitmap_A
8    for j in [0 .. colsB) # Iterate over columns of B
9        rdbmap [bitmapB+colOffset],0,1  # Load bitmap_B
10        do:  pbmap 0 # Find next non-zero block
11             pbmap 1 # in both matrices
12             rdind rowIndA,colIndA,0 # Read NZ index in A
13             rdind rowIndB,colIndB,1 # Read NZ index in B
14             if (colIndA  == rowIndB) # Index matching
15                C[rowIndA][colIndB]+=inner_pr(NZA_A,NZA_B)
16        while (rowIndA < i) && (colIndB < j)
Algorithm 2 : SpMM using SMASH

6. Experimental Setup

We model and evaluate SMASH using the zsim simulator  (sanchez2013zsim). Table 2 provides the system configuration we evaluate. We simulate each workload until completion. We use an Intel Xeon system (intelxeon) to perform experiments on real hardware. Table 5 provides the configuration of the real system.

CPU 3.6 GHz, Westmere-like (westmere) OOO, 4-wide issue;
128-entry ROB; 32-entry LQ and SQ;
L1 Data + Inst. Cache 32 KB, 8-way, 2-cycle; 64 B line; LRU policy;

MSHR size: 10; Stride prefetcher;

L2 Cache 256 KB, 8-way, 8-cycle; 64 B line; LRU policy;
MSHR size: 20; Stride prefetcher;
L3 Cache 1MB , 16-way, 20-cycle; 64 B line; LRU policy;
MSHR size: 64; Stride prefetcher;
DRAM 1-channel; 16-bank; open-row policy; 4GB DDR4;
Table 2. Simulated system configuration
Name # Rows Non-Zero Elements Sparsity (%)
M1: descriptor_xingo6u 20,738 73,916 0.01
M2: g7jac060sc 17,730 183,325 0.06
M3: Trefethen_20000 20,000 554,466 0.14
M4: IG5-16 18,846 588,326 0.17
M5: TSOPF_RS_b162_c3 15,374 610,299 0.26
M6: ns3Da 20,414 1,679,599 0.40
M7: tsyl201 20,685 2,454,957 0.57
M8: pkustk07 16,860 2,418,804 0.85
M9: ramage02 16,830 2,866,352 1.01
M10: pattern1 19,242 9,323,432 2.52
M11: gupta3 16,783 9,323,427 3.31
M12: nd3k 9,000 3,279,690 4.05
M13: human_gene1 22,283 24,669,643 4.97
M14: exdata_1 6,001 2,269,500 6.30
M15: human_gene2 14,340 18,068,388 8.79
Table 3. Evaluated sparse matrices
Graph Vertices Edges
G1: com-Youtube 1.1M 2.9M
G2: com-DBLP 317K 1M
G3: roadNet-CA 1.9M 2.7M
G4: amazon0601 403K 3.3M
Table 5. Real system configuration
CPU Intel Xeon Gold 5118
2.30 GHz 14nm (intelxeon)
L1 384 KB, 8-way
L2 12 MB, 16-way
L3 16.5MB, 11-way
Main memory DDR4-2400
Table 4. Input graphs

Workloads: Sparse Matrix Kernels. We evaluate SMASH using two sparse matrix kernels, Sparse Matrix Vector Multiplication and Sparse Matrix Matrix Multiplication. We use the TACO library’s (tacolib) respective implementations of these kernels as our baseline. For input datasets, we use a diverse set of 15 sparse matrices from the Sparse Suite Collection (florida). The matrices have different sparsities and distributions of non-zero elements. The term sparsity refers to the fraction of non-zero elements in the matrix over the total number of elements. Table 3 presents these matrices, sorted in ascending order of their sparsity. We use the term to refer to each matrix in Section 7

. We open source SMASH’s software implementations of these sparse matrix kernels

(smash).

Workloads: Graph Processing. We implement PageRank and Betweenness Centrality from the Ligra Benchmark Suite  (pushpull) as SpMV computation and evaluate the performance of SMASH over the default CSR-based versions of these two algorithms. PageRank (pagerank1999) was first used by Google to rank website pages. Specifically, it takes as input a graph and computes the rank of each vertex, which represents the relative importance of each node (e.g., webpage). PageRank iteratively uses SpMV to calculate the ranks of nodes in the graph (prspmv2010).

Betweenness Centrality (bc1977) is a measure of the significance of each vertex based on the number of shortest paths that pass through it. Betweenness Centrality iteratively uses SpMV to perform breadth-first searches in the graph (pushpull). We evaluate these two workloads using a set of four graph inputs from the Sparse Suite Collection (florida). We use the term to refer to each graph input in Section 7.

7. Evaluation Results

We evaluate five different mechanisms: (i) TACO-CSR: The CSR-based implementation from the TACO library (tacolib); (ii) TACO-BCSR: The BCSR-based implementation from the TACO library (tacolib); (iii) MKL-CSR: CSR-based SpMV and SpMM implementations from Intel MKL (intel-mkl); (iv) Software-only SMASH: SMASH’s hierarchical bitmap encoding implemented purely in software without the BMU; and (iv) SMASH: our complete proposed scheme, with the hierarchical bitmap encoding and the BMU. The matrices we evaluate vary in sparsity and hence, for SMASH implementation of each matrix, we use different compression ratios in the bitmap hierarchy for SMASH . We denote the bitmap configuration of each matrix (and graph) as , where denotes the matrix id and denotes the compression ratios of each level in the bitmap hierarchy. We evaluate 4 different use cases: SpMV, SpMM (Section 7.2), PageRank, and Betweenness Centrality (Section 7.3).

7.1. Software-only Approaches

We first compare the performance of three state-of-the-art sparse matrix formats (TACO-CSR (tacolib), TACO-BCSR (tacolib), and MKL (intel-mkl)) and the hierarchical bitmap encoding used in SMASH (i.e., Software-only SMASH) on our real Intel Xeon system (Table 5). Our goals with this experiment are to 1) compare existing state-of-the-art software solutions to identify a baseline for our simulation experiments and 2) evaluate the performance of SMASH’s hierarchical bitmap encoding without any hardware support. The TACO and MKL formats employ a range of software optimizations that are orthogonal to the matrix format itself and the indexing mechanism. To ensure a fair comparison, our implementation of Software-only SMASH includes all the software optimizations used by the TACO compiler, but uses SMASH’s hierarchical bitmap encoding instead of CSR.

Figure 9 depicts the speedup of TACO-BCSR, MKL, and software-only SMASH, normalized to TACO-CSR, for SpMV and SpMM. We make two observations, averaged across the 15 matrices we evaluate. First, MKL outperforms TACO-CSR in both SpMV and SpMM by 15% and 25% respectively. MKL also outperforms TACO-BCSR, but by a smaller margin: 3% in SpMV and 4% in SpMM. While MKL uses the same CSR format as the TACO compiler, it also employs a range of proprietary software optimizations that lead to better performance than the TACO implementations. We can only compare to the TACO implementations in the simulation experiments as the additional optimizations in the closed-source MKL library cannot be added to SMASH (or to any other technique) to enable a fair and insightful comparison.

Figure 9. Performance of software-only approaches on a real Intel Xeon system (normalized to TACO-CSR).

Second, we observe that Software-only SMASH outperforms TACO-CSR by 5% in SpMV and 10% in SpMM, but is outperformed by both TACO-BCSR and MKL. Software-only SMASH incurs a performance overhead because 1) indexing the hierarchical bitmap entirely in software requires more instructions than indexing the CSR format and 2) unlike CSR, which eliminates all zeros, SMASH’s hierarchical bitmap encoding may require unnecessary computations on zero values, depending on the choice of compression ratios in the bitmap hierarchy. However, despite these overheads, the hierarchical bitmap encoding avoids the expensive indexing and pointer-chasing operations in CSR and outperforms TACO-CSR.

7.2. Sparse Matrix Kernels

7.2.1. Performance Results

Figure 10 shows the speedup of TACO-CSR, TACO-BCSR, Software-only SMASH, and SMASH, normalized to TACO-CSR, for the SpMV kernel. Figure 11 shows the number of executed instructions in each mechanism, normalized to TACO-CSR. We make three observations.

Figure 10. Speedup (normalized to TACO-CSR) for SpMV.
Figure 11. Number of executed instructions (normalized to TACO-CSR) for SpMV.

First, SMASH significantly outperforms all other mechanisms: it is 38% faster over TACO-CSR and 32% over TACO-BCSR, on average. SMASH’s speedup is mainly due to executing fewer indexing instructions (47% less than TACO-CSR and 30% less than TACO-BCSR, on average) and avoiding expensive pointer-chasing operations in memory. Second, SMASH is highly effective regardless of the sparsity of the matrix. For the matrices with the highest sparsity in our evaluation ( and ), TACO-BCSR is inefficient because it encodes data in blocks, which leads to unnecessary computation on zero elements. SMASH avoids such overhead by leveraging the configurability of compression ratios in our hierarchical bitmap encoding to adapt the compression ratio to the matrix sparsity. Third, we observe that the effectiveness of Software-only SMASH depends on the sparsity of the matrix. Software-only SMASH incurs a higher performance overhead when the sparsity is higher because Software-only SMASH performs more unnecessary computation on zero elements and executes more instructions searching the bitmaps for non-zero bits. In these cases, TACO-CSR outperforms Software-only SMASH. When the matrix is denser, the benefits of avoiding pointer-chasing and indirect indexing outweigh the additional computation and search operations on zero elements. Thus, Software-only SMASH outperforms both TACO-CSR and TACO-BCSR for less sparse matrices. This strong impact of sparsity on overall performance is not seen as strongly in SMASH because the hardware support makes the indexing highly efficient even when the matrix is extremely sparse. As a result, SMASH outperforms all other approaches regardless of the matrix sparsity.

Figures 12 and 13 depict the speedup and the number of executed instructions, respectively, for the SpMM kernel.

Figure 12. Speedup (normalized to TACO-CSR) for SpMM.
Figure 13. Number of executed instructions (normalized to TACO-CSR) for SpMM.

We observe the same trends and make the same observations for SpMM as we do for SpMV, but with an important difference. SpMM requires twice the number of indexing operations as SpMV per dot product. As a result, the performance benefits of SMASH are even higher over TACO-CSR for SpMM than for SpMV: SMASH outperforms TACO-CSR by 44% and TACO-BCSR by 30%, on average.

We conclude that SMASH is a highly effective mechanism to accelerate sparse matrix computation, regardless of the sparsity of the matrix, and it significantly outperforms state-of-the-art compression schemes for all matrices we evaluate.

7.2.2. Sensitivity to Compression Ratio

As discussed in Section 4.1.1, the compression ratio between Bitmap-0 and the NZA defines an important tradeoff between 1) smaller bitmaps and 2) more zero elements in the NZA that lead to unnecessary computation on zero elements. In Figures 14 and 15, we quantitatively evaluate this tradeoff for SpMV and SpMM respectively, by showing the speedups SMASH provides with different compression ratios. The speedups are normalized to a configuration that uses a 2:1 compression ratio between Bitmap-0 and the NZA (i.e., each bit in Bitmap-0 encodes two elements in the NZA).

We make two observations. First, increasing the compression ratio from 2:1 to 8:1 degrades performance, by 4% on average (up to 13%) for SpMV and by 5% on average (up to 15%) for SpMM. This is a direct result of more unnecessary computations on zero elements in the NZA which cannot be skipped as a result of the high compression ratio.111Note that we assume there is enough memory to store matrices with any bitmap compression ratio. Second, we observe that matrices with higher density can in some cases benefit from a higher compression ratio (performance increases by 18% in and 40% in , by going from a compression ratio of 2:1 to 8:1). These matrices exhibit a more clustered distribution of non-zero values (i.e., the non-zero elements are close to each other). As a result, even though the compression ratio is higher, the number of zeros in the NZA (and hence the unnecessary computation on zero elements) does not increase in proportion. Instead, SMASH benefits from scanning smaller bitmaps during the indexing operation with a higher compression ratio.

Figure 14. Sensitivity of SMASH speedup to the compression ratio between Bitmap-0 and the NZA for SpMV.
Figure 15. Sensitivity of SMASH speedup to the compression ratio between Bitmap-0 and the NZA for SpMM.

We conclude that the compression ratio between Bitmap-0 and the NZA plays an important role on the effectiveness of SMASH. Our evaluations indicate that a compression ratio of 2:1 is most effective on average and should be used when the structure of sparsity in unknown. However, a matrix that has a known clustered distribution of non-zero elements may significantly benefit from a higher compression ratio.

7.2.3. Locality of Sparsity

In Figures 16 and 17, we illustrate the impact of the distribution of non-zero elements in a sparse matrix on the effectiveness of SMASH. We define a new metric, locality of sparsity, which is the ratio of the average number of non-zero elements per block of the NZA to the size of each NZA block (expressed as a percentage). A matrix with a 100% locality of sparsity would have no zero elements in any NZA block (i.e., all the non-zero elements are clustered at the granularity of the NZA block size). A matrix with 12.5% locality, on the other hand, has exactly one non-zero element per NZA block assuming the NZA holds 8 elements per block.

Figures 16 and 17 compare the speedup of SMASH when the locality of sparsity is varied for three different matrices (). The results are normalized to the performance of SMASH when the locality of sparsity is 12.5%. The three matrices are chosen to have widely different sparsities (0.06%, 0.85%, and 4.97%).

Figure 16. Sensitivity to locality of sparsity in SpMV.
Figure 17. Sensitivity to locality of sparsity in SpMM.

We make two observations. First, the speedup of SMASH increases with an increase in locality of sparsity (by up to 25% in for SpMV when going from 12.5% to 100% locality of sparsity). This is because the NZA blocks contain fewer zeros when locality is higher, and this leads to fewer unnecessary computations on zero elements and faster scans of bitmaps during indexing. Second, the performance impact of locality of sparsity depends on the number of non-zero elements in the matrix, i.e., sparsity. The benefits of locality diminish when the matrix is more sparse. This is because indexing of the matrices dominates the overall computation time when the sparsity is very low and very little time is spent on computing over the non-zero values. As a result, reducing the amount of unnecessary computation on zero elements in the NZA blocks does not provide significant performance benefit.

7.3. Graph Applications

Figure 18 compares the default CSR-based and the SMASH-based implementations of the PageRank and Betweenness Centrality applications from the Ligra benchmark suite (ligra) in terms of performance and the number of executed instructions.

Figure 18. Speedup and normalized number of executed instructions for PageRank and Betweenness Centrality using SMASH (normalized to the CSR-based implementations).

We observe that the SMASH-based implementations outperform the CSR-based implementations by 27% and 31% respectively, for PageRank and Betweenness Centrality. Similar to SpMV and SpMM, SMASH’s speedups in graph workloads come from executing fewer instructions to index the sparse matrices and avoiding the expensive pointer-chasing operations in memory. However, SMASH’s benefits are lower in graph workloads than in the SpMV and SpMM kernels since sparse matrix indexing operations in PageRank and BC form a smaller component of the overall execution time. We conclude that SMASH is effective in graph applications.

7.4. Storage Efficiency

A key goal of a sparse matrix compression scheme is to efficiently store the matrix in a manner that avoids saving zero elements and minimizes the amount of metadata required. In Figure 19, we compare the storage efficiency of SMASH and CSR by measuring the ratio of the size of the original uncompressed matrix to the total size of all the data structures required to encode the matrix with SMASH or the CSR format (called the total compression ratio).

Figure 19. Total compression ratios of SMASH and CSR. Y-axis is in log scale.

CSR stores only the non-zero elements and uses one integer per non-zero element to save the non-zero element’s position in the matrix, regardless of the sparsity or the locality of sparsity of the matrix. SMASH, on the other hand, encodes each block in the NZA using a single bit (we assume each block to hold 2 elements in Figure 19). If the non-zero elements are not contiguously located, the NZA may hold zero elements in its blocks.

As a result, when the matrices are highly sparse, CSR only stores the few non-zero elements and their positions in the matrix. In contrast, SMASH may store a large number of zero elements in the NZA and many zero bits in the bitmap hierarchy to encode the positions of the few non-zero elements. Hence, as we observe in Figure 19, CSR has a higher total compression ratio than SMASH for matrices that are highly sparse (-).

However, as the matrices become denser, SMASH provides either a similar compression ratio to or a much higher compression ratio than CSR. The reason is twofold. First, the non-zero elements are more likely to be located close to each other at higher matrix densities (i.e., the locality of sparsity is higher). SMASH, as a result, is likely to unnecessarily store fewer zero elements in the NZA. Second, at higher matrix densities, the cost of storing one additional integer index per non-zero element in the CSR format becomes higher than encoding the zero and non-zero regions of the sparse matrix using bitmaps. Hence, in Figure 19, we observe that SMASH generally has a significantly higher (up to 2.48x) total compression ratio than CSR for matrices with higher densities. In some cases (e.g., , ), even though the matrix has high density, SMASH does not provide a greatly higher compression ratio than CSR. This is because these matrices have low locality of sparsity, which causes SMASH to store more zero elements in the NZA and more zero bits in the bitmap hierarchy.

We conclude that SMASH provides high compression ratios when encoding sparse matrices, enabling highly-efficient storage of sparse matrices in memory. The hierarchical bitmap structure is highly effective in exploiting any locality of sparsity in the matrix to provide even higher compression ratios.

7.5. Format Conversion Overhead

In cases where the sparse matrix is already stored in another format (such as CSR), it first needs to be converted to the hierarchical bitmap encoding in order to leverage the indexing benefits of SMASH. Figure 20 depicts the total time spent on such conversion operations relative to the computation itself for SpMV, SpMM, and PageRank, assuming that the sparse matrix has to be stored in the CSR format but processed using SMASH. In SpMM and PageRank, the total time spent on conversion from CSR to SMASH and vice versa is only a small fraction of the overall computation (10% in SpMM and 0.5% in PageRank), and hence imposes only a small conversion overhead. SpMV, on the other hand, is a short-running kernel and hence, the conversion time dominates SpMV’s total execution time (55% of the overall execution time).

Figure 20. Breakdown of end-to-end execution time, assuming the matrix has to be stored in CSR format but processed using SMASH.

We conclude that the conversion overhead is negligible compared to the demonstrated benefits of SMASH for long running applications and cases where SpMV and SpMM are invoked multiple times (e.g., in neural networks and graph applications). However, for short-running kernels that are invoked only once, the benefits of SMASH may not justify the conversion overhead from another format, assuming that other format is necessary for storage.

7.6. SMASH Area Overhead

0.96 The major area overhead in SMASH comes from the buffers and the registers in the BMU. Assuming a BMU with 4 groups of 3 bitmap buffers (where the size of each buffer is 256 bytes), the total additional SRAM required for the bitmaps is 3KB and 140 bytes for the registers. We evaluate the area overhead of the BMU using CACTI 6.5 (CACTI). In a modern Intel Xeon E5-2698 CPU core, with a 32 KiB L1, 256 KiB L2 and 2.5 MiB L3 Cache, the BMU incurs an area overhead of at most 0.076%. 0.99

8. Related Work

0.96

To our knowledge, this is the first work to propose a hardware-software cooperative compression scheme and indexing mechanism that 1) enables highly-compressed storage of sparse matrices, 2) provides highly-efficient hardware-supported indexing of sparse matrices to accelerate sparse matrix kernels, and 3) is general and applicable to any sparse matrix operation and matrices with any sparsity and structure.

Sparse matrix operations have been extensively studied in the past. Prior work in this area falls into three categories: 1) sparse matrix compression formats; 2) software optimizations for sparse matrix kernels; and 3) hardware accelerators and hardware-software cooperative mechanisms for sparse matrix kernels. In this section, we briefly discuss these prior works and contrast them with SMASH.

Sparse Matrix Compression Formats. Prior works propose a range of compression formats for sparse matrices (datacompression2006; im1999optimizing; formatSurvey2016; csr52015; variableblock2005; blockoptim1999; Sengupta2007scan; bccoo2014), including the state-of-the-art formats such as CSR (csr52015), CSR5 (csr52015), and BCSR (im1999optimizing) that are widely used today. These formats are designed to be highly-efficient in storage and can be generally applied to any sparse matrix. However, as we demonstrate in this paper, these formats incur significant performance overheads as a result of inefficient indexing. We quantitatively compare SMASH to CSR (formatSurvey2016) and BCSR (formatSurvey2016) in Section 7 and demonstrate that SMASH’s hierarchical bitmap compression mechanism, along with its hardware support for indexing, significantly outperforms these formats by enabling highly-efficient storage and indexing of sparse matrices.

Several sparse matrix compression formats (uniquekourtis2008; csxkourtis2011; smatautotune2013; csb2009; pattern2009; cocktail2012; csradaptgpus2014) leverage specific characteristics of the sparse matrix to enable more efficient storage and indexing. For example, CSX (csxkourtis2011) first processes each matrix to detect patterns such as dense diagonals or repeated values and then encodes them using compression formats tailored to the detected pattern. CSR-DU (uniquekourtis2008) and CSR-VI  (uniquekourtis2008) leverage repetition or similarity among non-zero values to compress the sparse matrix more efficiently. These works have three major shortcomings. First, they are limited in applicability to matrices that possess specific patterns in sparsity. SMASH, in contrast, can be used to compress any sparse matrix and accelerate any operation on it. Second, these approaches require expensive preprocessing of matrices to identify patterns in sparsity. Third, similar to the general compression formats, these approaches incur significant performance overheads due to inefficient indexing.

Software Optimizations for Sparse Matrix Kernels. There is a large body of prior work on software optimizations to accelerate sparse matrix computation by making sparse matrix kernels more memory or cache efficient  (caching1998; AkbudakA2017exploiting; Nishtala2007when; rbandwidth2011) and more amenable towards parallelization  (kng2013; Buluc2008challenges; intel-mkl; Merrill2016). A range of prior works also include general software frameworks, such as MKL (intel-mkl) and TACO (tacolib), that leverage these optimizations to produce more efficient code for CPUs (intel-mkl; tacolib; mergespmv2016; rbandwidth2011; henon2002pastix; bulucc2012parallel; manycoreathena2017; white1997improving; toledo1997improving; openblas1; williams2007optimization; Mellor2004optimizing; Im2004sparsity; Nishtala2007when; sparsex2018), GPUs (gpuframe2014; prspmv2010; spmmv2018; bccoo2014; gpusm2015; solversGPU; gpuopt2015), or heterogeneous CPU-GPU systems (gpucpu2015; gpufram22012). SMASH is orthogonal to these software optimizations and can be seamlessly integrated into software frameworks that employ such optimizations to further improve the performance and energy efficiency of sparse matrix computation.

Hardware Accelerators and Hardware-Software Cooperative Mechanisms for Sparse Matrix Kernels. Prior works propose a range of hardware accelerators (hardanalytics2016; outerspace2018; camspm2017; Mishra2017fine; Nurvitadhi2015asparse; cambricon1; cambricon2) or FPGA designs (Yu2013design; spmvfpgacolumnmajor2014; spmvfpga2015; bwfpga2014) for sparse matrix computation. Several of these works also leverage emerging memory technologies such as memristors (memristor2016) and 3D-stacked memories (Zhu2013accelerating) to accelerate sparse matrix kernels. These prior approaches are either 1) only applicable to certain applications, such as SpMV (spmvfpga2015) or sparse neural networks (cambricon1; cambricon2), or 2) assume hardware dedicated to a specific type of sparse matrix kernel only. SMASH, in contrast, can be generally applied across a diverse set of sparse matrix operations and does not require fully-dedicated hardware to any particular matrix/kernel type. The hierarchy of bitmaps in SMASH can be used purely in software without hardware support in any system, and the hardware unit proposed in SMASH can be added with low overhead to general-purpose processors such as CPUs and GPUs.

Prior works propose a range of hardware-software cooperative mechanisms (tesseract2015; pei2015; hotpads; tsai2; hats2018; xmem; localgpu; pageover2015; tetris2017; graphP; graphR; googleworkloads2018; impica2016; graphq2019) to accelerate memory-bound operations and can be applied to accelerate sparse matrix computations. These approaches are designed to be generally applicable to a very wide range of applications. SMASH, in contrast, addresses the challenges of designing a hardware-software cooperative mechanism for sparse matrix computation. Hence, SMASH is largely orthogonal to these mechanisms and can be integrated into them to accelerate sparse matrix computation. 0.99

9. Conclusion

0.95

We introduce SMASH, a general hardware-software cooperative mechanism that accelerates sparse matrix operations and enables highly-efficient indexing and storage of sparse matrices in memory. The key idea of SMASH is to explicitly enable the hardware to recognize and exploit data sparsity. To this end, we develop a flexible and efficient sparse matrix encoding, based on a hierarchy of bitmaps, that is recognized by both hardware and software. This encoding enables efficient compression of any sparse matrix, regardless of the structure of its sparsity. We develop a hardware mechanism that can directly interpret sparse matrices encoded with hierarchical bitmaps and accelerate computation on those matrices. The bitmap representation, along with the hardware support, greatly reduces the performance overheads of the expensive indexing operations that make state-of-the-art sparse matrix formats inefficient in computation. The expressive SMASH ISA provides programmability of the hardware support and generality across a wide variety of sparse matrix kernels. Our evaluation over a diverse set of 15 matrices and four graphs demonstrates that SMASH significantly improves the performance of SpMV, SpMM, and two graph algorithms (PageRank and Betweenness Centrality), compared to the state-of-the-art CSR format. We believe that the new ideas introduced in SMASH are applicable beyond CPUs and can be a good fit for GPUs, hardware accelerators, and processing in/near memory engines.

Acknowledgments

0.89 We thank the anonymous reviewers for feedback. We thank the SAFARI Research Group members for feedback and the stimulating intellectual environment they provide. We acknowledge the generous gifts provided by our industrial partners: Alibaba, Facebook, Google, Huawei, Intel, Microsoft, and VMware. This research was supported in part by the Semiconductor Research Corporation. Christina Giannoula is funded for her postgraduate studies from the General Secretariat for Research and Technology (GSRT) and the Hellenic Foundation for Research and Innovation (HFRI). 1.0

References