Differentiating-based Vectorization for Sparse Kernels

11/24/2021 ∙ by Zachary Cetinic, et al. ∙ UNIVERSITY OF TORONTO 0

Sparse computations frequently appear in scientific simulations and the performance of these simulations rely heavily on the optimization of the sparse codes. The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize regular regions of computations in the sparse code. They also reorganize data and computations, at a cost, to increase the number of regular regions. In this work, we propose a novel polyhedral model, called the partially strided codelets (PSC), that enables the vectorization of computation regions with irregular data access patterns. PSCs also improve data locality in sparse computation. Our DDF inspector-executor framework efficiently mines the memory accesses in the sparse computation, using an access function differentiation approach, to find PSC codelets. It generates vectorized code for the sparse matrix multiplication kernel (SpMV), a kernel with parallel outer loops, and for kernels with carried dependence, specifically the sparse triangular solver (SpTRSV). We demonstrate the performance of the DDF-generated code on a set of 60 large and small matrices (0.05-130M nonzeros). DDF outperforms the highly specialized library MKL with an average speedup of 1.93 and 4.5X for SpMV and SpTRSV, respectively. For the same matrices, DDF outperforms the state-of-the-art inspector-executor framework Sympiler [1] for the SpTRSV kernel by up to 11X and the work by Augustine et. al [2] for the SpMV kernel by up to 12X.



There are no comments yet.


page 1

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sparse matrix computations are important kernels used in a large class of scientific and machine learning applications. The performance of sparse kernels is noticeably improved if the code is

vectorized to exploit single instruction multiple data (SIMD) capabilities of the underlying architecture. Vectorization potentially increases opportunities to optimize for locality, further increasing the performance of the sparse code. SIMD instructions can efficiently vectorize groups of operations that access consecutive data, i.e. have a regular access pattern. A computation is regular when its operations access memory addresses with a constant distance from each other, i.e. strided access, and otherwise irregular. However, because a sparse matrix is typically stored in a compact representation [bulucc2009parallel], the sparse matrix code accesses are often irregular, making vectorization challenging. Not vectorizing irregular segments in the sparse matrix calculation can potentially reduce locality and the performance of the sparse matrix kernel.

The most common approach to improve vectorization of irregular regions in sparse matrix codes is to reorganize data or computations so the region becomes regular and amenable to vectorization. These work typically optimize sparse kernels with no loop-carried dependencies, for example [intel-alt, Kreutzer2013, Liu2013, Liu2015, Tang2015, Chen2016, Xie2018] improve the performance of sparse matrix-vector multiplication (SpMV) with reorganization. CSR5 [liu2015csr5] and CVR [xie2018cvr] reorganize data to create contiguous accesses and reorganize computations to increase the number of independent operations for vectorization. However, their computation reorganization typically increases the number of floating-point operations in the kernel, reducing the benefits of vectorization. Sparse storage formats such as ELL and DIA [saad2003iterative]

are used to pad the sparse matrix with additional nonzeros. Padding increases regular regions in the matrix and hence increases the number of consecutive vector loads. In sparse matrices with very irregular patterns, the amount of padding will increase, leading to more floating-point operations. Reorganization approaches are often library-based and thus are specialized for a specific set of matrix patterns or the access patterns appearing in a specific matrix. They also need to be manually ported to new architectures. For example, work in

[Yesil2020] primarily optimizes matrices from power-law graph computations, CVR is for sparse graphs, and ELL is most efficient for block-structured matrices such as factors of a direct solver [vazquez2010improving].

Compilers, such as those based on the polyhedral model, generate vectorizable and portable code for affine codes such as dense matrix computations [Boulet1998, Chen2012, Kershaw1978, Puschel2005, Spampinato2014, Tiwari2009]. However, compilers are limited in vectorizing sparse codes because of the existing indirection in the memory access patterns of sparse computations, caused by data compaction. This indirection introduces additional challenges to compilers when optimizing sparse kernels with loop carried dependencies, e.g. the sparse triangular solver. Recently the polyhedral method was extended to optimize sparse matrix computations via inspector-executor approaches [strout2012set, Venkat2015, strout2018sparse]. The inspectors execute at runtime to resolve indirection and find data dependencies between operations and this information is used to transform the original source code into optimized code called the executor.

Fig. 1: Various generated codes for the given sparsity pattern (shown in the matrix) used to compute the Sparse Matrix-Vector multiplication (SpMV) kernel. a) Shows the code that computes SpMV using the compressed sparse row (CSR) storage format. b) Shows the DDF generated code. c) Shows the code generated from the Sympiler framework [Cheshmi2017]. d) Shows the code generated using the work by Augustine et al.[Augustine2019]

Amongst inspector-executor frameworks, Augustine et al. [Augustine2019], which we refer to as the PIC framework, and Sympiler  [Cheshmi2017] inspect the sparsity pattern of input matrices to generate vectorized code for sparse matrix computations. The work in [Augustine2019] proposes an inspector to mine the memory trace of the SpMV kernel to find regular regions in the computation for vectorization. The remaining regions are unrolled. Unrolling irregular regions limits the scalability of this work because of increasing code size for large matrices and for matrices with high irregularity; their framework can primarily optimize small sparse matrices (with less than 0.5 million nonzeros). Also, they do not support sparse kernels with loop-carried dependencies. Sympiler uses an inspector to resolve data dependencies in sparse kernels such as SPTRV. Its inspector finds iterations that operate on rows with similar patterns, i.e. regular regions, and applies code transformations that primarily optimize for tiling and vectorization. Sympiler adds additional nonzeros to some of the rows with non-consecutive memory accesses and thus pads the input matrix to form row-blocks. Row-blocks are regular regions that can be easily vectorized. To limit the overheads from padding, Sympiler only creates row-blocks when profitable, thus limiting the size of row blocks it creates. It is thus not able to efficiently optimize sparse matrices with few rows of similar sparsity patterns.

The standard approach of prior work in vectorizing sparse kernels is to either find regions with regular computations or create more regular regions in the computations with for example data reorganization. The focus of this work is to expose vectorization opportunities in computation regions with irregular computation patterns. We present a novel polyhedral model which we call the partially strided codelet that increases opportunities for vectorization of sparse matrix codes and improves data locality. PSCs represent computation regions that have non-consecutive, i.e. unstrided, memory accesses but can still benefit from vectorization because a subset of their accesses is strided. We also present a novel inspector-executor framework called DDF (Differentiating Data access Functions) that efficiently inspects the memory access patterns in sparse matrix codes to mine for PSCs using an access function differentiation approach and to generate optimized vectorized code for the matrix computation. DDF’s inspector supports sparse kernels with and without loop-carried dependencies as it is able to look for data dependencies in the computation region.

The contributions of the work are:

  • A novel polyhedral model, called the partially strided codelet, that enables the vectorization of computation regions with non-consecutive memory accesses in sparse codes while improving temporal and spatial locality.

  • A memory access differentiation approach, that uses the first order partial difference (FOPD) of access functions to distinguish strided access from non-strided accesses in the sparse matrix computation. FOPD’s are used to mine for PSCs in spare matrix codes.

  • The DDF inspector-executor framework that mines the data accesses of sparse codes with or without dependencies to find PSCs. The inspector executes in parallel with low overhead and generates a compact code that can execute in parallel.

  • We demonstrate the performance of DDF for SpTRSV and SpMV kernels. The speedup of the DDF-specialized code for SpTRSV is on average 1.79 and 4.5 faster than that of Sympiler and MKL respectively. For SpMV, DDF-generated code is on average 1.42 and 1.93 faster than that of CSR5 and MKL respectively. For small problems that the work from [Augustine2019] supports, DDF is 10.57 faster than their approach.

Ii Motivation

In this section, we use the matrix in Figure 1 to explain the approach used in inspector-executor frameworks, i.e. Sympiler and the work in  [Augustine2019] (which we call PIC), to optimize different computation regions in the SpMV kernel and compare to the approach proposed in this work. The example uses the compressed sparse row (CSR) format to store matrix and then computes where is a vector. The region with the green border is a regular computation region in the SpMV computation and the red and orange colors show regions with non-strided accesses.

All tools can efficiently vectorize the green region since all of its accesses are consecutive and thus strided. BLAS routines are typically used to optimize such regions. They enable vectorization, and in part, also improve locality by reusing the entries in the input vector (e.g. x[6], x[7]) and the output vector (e.g. y[3], y[4]).

For the non-strided segments, Sympiler uses a padding strategy, when profitable, to create regular regions that can be easily vectorized. However, Sympiler will not pad the red and yellow regions, since they are too irregular, and padding will introduce a large number of additional operations. Instead, Sympiler will fall back to the original non-affine code and miss any additional vectorization opportunities. PIC will generate fully unrolled code for the non-strided regions, as shown in Figure 1d. While this reduces the number of instructions, it increases code size and as a result, increases the number of instruction cache misses, especially for large matrices, limiting PIC’s scalability.

For irregular regions, DDF uses partially strided codelets to improve locality and to provide a better vectorization efficiency. The SpMV code for the red region has non-strided accesses in x, while its accesses to Ax and y are strided. DDF generates the PSC codelet bordered with red in Figure 1b. The code stores the non-strided addresses in s to be reused across all three iterations of instead of loading indices for every operation. This also enables the reuse of index Ap[i] in Ax so it does not have to be loaded per each outermost iteration. Similarly, for the yellow region, accesses to both Ax and x are non-strided, but vectorizing computations of the innermost loop improves spatial locality due to contiguous accesses in Ax.

Iii Partially Strided Codelet

This section introduces partially strided codelets and discusses their classification as well as efficiency in optimizing computation regions with irregular access patterns. We also present a novel cost model and a differentiation-based PSC detection strategy, both of which are used in the DDF framework to efficiently find PSCs in sparse matrix computations.

Fig. 2: Polyhedral representation of the SpMV kernel shown in Figure 1a

Iii-a Definitions

Polyhedral model and data access functions. A loop nest that contains a set of statements is represented with a polyhedral model through an integer polyhedron sets and relations . A statement is made of a data space described with which is a disjoint set containing . An integer polyhedral set is a collection of inequalities that create bounds for each dimension inside . For each a data access function is used to describe how the data space is accessed by the iteration space of . In other words, a data access function maps an iteration space to a data space, i.e. . The SpMV code in Figure 1a has one statement. That statement has three data spaces y, Ax, and x as well as three data access functions. Figure 2 shows the polyhedron sets for and also the access functions corresponding to each data space.

Fig. 3: a) Shows the set of SpMV operations from the red outlined region of the matrix in Figure 1. b) Shows the iteration space and access functions of a codelet that maps to the data-space seen in section of the figure.

Codelet. A polyhedral model that has a convex integer polyhedron with no flow dependencies, i.e. read after write dependency between access functions, is a codelet. A codelet only has one statement and the operation in that statement is a SIMD-supported operation. A set of SpMV operations is shown in Figure 3a and its polyhedral representation including iteration space and data access functions are shown in Figure 3b. All operations in this model are independent and have only one statement (y[*] += Ax[*] * x[*]), with a multiply-add operation which is supported by Intel and AMD SIMD units. These properties satisfy the criteria for being a codelet.

Strided data access function. A function that can be expressed with a linear combination of induction variables in is a strided access function. The data access function and in Figure 3b are both strided and can be expressed as a linear combination of .

Iii-B Partially Strided Codelet Classification

An efficient way to vectorize a codelet is to find operations with strided accesses across different iterations. This enables the vectorization of more operations and potentially increases data reuse between different iterations. However, current approaches are primarily limited to vectorizing codelets that all of their access functions are strided. An efficient BLAS [blackford2002updated] implementation is used for this purpose and thus, we call these codelets types BLAS codelets. In this work we define a novel set of codelets called Partially Strided Codelets that have at most strided access functions and at least one non-strided access function. These codelets can benefit from vectorization because they have one or more strided access functions.

PSCs are classified into different types based on the number of access functions that are strided. For example in SpMV and SpTRSV that have codelets with three access functions,

, two types of PSCs can be defined. The PSC type I codelet is used when two of access functions are strided, and the PSC type II codelet is used when only one access function is strided. Figure 1b shows two types of PSCs, outlined in yellow and red, and also the BLAS codelet, outlined in green.

Iii-C PSC Efficiency and Cost Model

Fig. 4: Shows best performing codelet type on a wide variety (150K total points) of computational regions found in SpMV/SpTRSV.

Depending on the number of operations and strided accesses in a computation region, either PSC or BLAS codelets can be profitable. To illustrate this, we extracted over 150,000 computation regions, with a different number of operations and strided accesses, from the SpMV and SpTRSV kernels on 60 matrices in the Suitesparse repository [davis2011university]. Each region is optimized using one or several codelets of the same type and the best performing codelet type for that region is inserted as a point in Figure 4. The red, green, and blue colors are used for PSC I, PSC II, or BLAS codelets respectively. The figure shows that some regions perform best with PSCs and some with BLAS codelets, e.g. PSC I provides the fastest execution time over all codelets for regions with a large percentage of non-strided accesses.

The performance of codelets differs based on the number of memory accesses that occur from running the codelets. Per Section III-A, an access function is described as where is an integer and is an -dimensional array. If is a strided access function, an integer for each needs to be loaded from memory before the operation in the codelets executes. Each integer, , is the coefficient in the linear combination of . However, for a non-strided access function, the number of memory accesses in the codelet is equal to the size of for , which is larger than one. Thus, compared to BLAS codelets, PSCs access more memory addresses. But this does not imply that PSC codelets are inefficient because the number of codelets used to cover a computational region also affects the total memory accesses. For example, the computation region in Figure 3b is a PSC I with one non-strided access function where and thus requires loading 4 memory addresses. If BLAS is selected for the same region, a minimum of 3 codelets will need to be used leading to 9 loads.

Fig. 5: Correlation between predicted cost of codelets and average execution time of each codelet. Averages were obtained using 1,000,000 executions per codelet.

To efficiently determine the types of codelets for a computation region, we first define a cost model to compute the cost of a single codelet. The cost model uses the number of memory accesses of each memory access function, i.e. and the number of operations in a codelet(

) to estimate the execution time of the codelet. The cost of a codelet

is defined as:


If a computation region is described with codelet types then the overall cost of the codelets for that region is . Figure 5 shows the correlation between the codelet cost model and the execution time of codelets. We use a set of codelets with different sizes from SpMV and SpTRSV on different computation regions of the sparse matrices in Suitesparse [Davis2011]. The X-axis shows the cost of each codelet, and the Y-axis shows their corresponding execution time in seconds. As shown, the cost model predicts the performance of the codelets with a correlation of .

Iii-D Differentiation Based PSC Detection

In this section, we explain how the first order partial difference (FOPD) of the access functions in a computation region can be used to detect a codelet type.

First Order Partial Differentiation (FOPD). Given the data access function with iteration space of , the first order partial difference of with respect to is computed as . FOPD shows if accesses to a data space are strided with respect to the induction variable . Figure 6 illustrates the process of computing the FOPD for the access function given the computation region shown in Figure 3. For example, the FOPD of evaluated at is .

Fig. 6: Illustrates the process of taking the derivative of the access function with respect to and .

FOPD of access functions are used to distinguish types of codelets by finding strided access functions. Given a codelet with three access functions and the iteration space of , an access function is strided if its FOPDs with respect to are equal in the entire iteration space. In other words, is strided if the elements in are equal to each other and similarly for elements in . With the strided definition above, all codelet types can be defined per definitions in Section III-B. For example, the first three accesses in function in Figure 6 are to consecutive locations (0, 1, 2) in Ax. The FOPD of the first two accesses, wrt. , is FOPD and for the second and third accesses is FOPD. Similarly the FOPD of the accesses for are FOPD and FOPD. Since FOPD and FOPD are equal, and FOPD and FOPD are equal, the function in the iteration space is strided. Because function is also strided and is not strided, the codelet is categorized as a PSC type I.

1#include "DDF.h"
2int main(){
3 Kernel SpMV;
4 SpMV.generate_c("spmv.h", Arch::x86);
5 return 1;
Listing 1: The input code to DDF.
1#include "DDF.h"
2#include "spmv.h"
3int main(){
4  Matrix A("input.mtx");
5  Vector x("x.mtx"), y("y.mtx");
6 /// ------ Inspector ------ ///
7 vector<codelet *> clist=DDF_Inspector(A.pattern(), SpMV);
8 /// ------ Executor ------ ///
9 SpMV_Vectorized(clist, A, x, y);
10 return 1;
Listing 2: A sample driver code to call the DDF-generated code.

Iv DDF: Differentiating Data Access Functions to Mine PSC Codelets

DDF is an inspector-executor framework that finds partially strided codelets from an input code and the matrix sparsity pattern using an inspector and generates a vectorized code as its executor. The DDF input specification is shown in Listing 1. It takes the input kernel, SpMV or SpTRSV, and the target architecture type and generates a driver code shown in Listing 2 and a vectorized code in spmv.h. The vectorized code is generated when generate_c is called in line 4 of Listing 1. In lines 4–5 of the Listing 2, the input matrix and vectors and are loaded. DDF’s inspector then creates a list of codelets in line 7, and the vectorized code executes in line 9.

Iv-a The DDF Inspector

DDF’s inspector uses Algorithm 1 to generate efficient code for the input sparse kernel. The algorithm first creates groups of independent iterations and then for the computation regions in each group finds a best codelet combination to generate.

Inputs and output

The inputs to the DDF inspector algorithm are a kernel code , i.e., SpMV or SpTRSV, and the pattern of the input matrix . The algorithm generates a list of codelets to be used in the vectorized code. In DDF, the kernel code is represented with an abstract syntax tree (AST) that represents loops and operations. The pattern is stored in a compressed sparse row (CSR) storage.

Input : 
Output : 
/* 1) Grouping independent memory accesses */
1 ;
2 ;
3 ;
/* 2) Codelet Creation */
4 for  do
5        FOPDs ;
6        ;
7        for  do
8               FOPDs);
9               FOPDs);
10               FOPDs);
11               ;
12               ;
Algorithm 1 DDF Inspector

Iv-A1 Grouping independent memory accesses

The first step of the inspector algorithm computes memory access functions of the input code and then generates a graph and uses it to create groups of operations that are independent to support parallelism. Function in line 1 of Algorithm 1 uses kernel and goes over to compute access functions and the iteration space of the input kernel. Then in line 1 uses access functions to compute a graph . The vertices in this graph represent iterations of the outermost loop and its edges are the flow dependencies across iterations of . In kernels with loop carried dependencies, is passed to the LBC algorithm [Cheshmi2019] to create independent parallel workloads. For sparse kernels with parallel loops, does not have any edges, thus instead, the function groups iterations based on the number of operations. Finally, iterations of the outermost loop are grouped and the groups are stored in using in line 3.

Iv-A2 Codelet creation

For every group of iterations , the algorithm generates the FOPD of the access functions in line 5. Function in line 1 groups operations across consecutive iterations and creates a number of computation regions stored in . To reduce the overhead of mining, the algorithm limits the search window in line 6 to consecutive iterations. Our experiments show that this does not lead to slowdowns because consecutive iterations in sparse codes typically operate on consecutive rows of a matrix, and thus their vectorization can potentially improve spatial locality.

Fig. 7: Shows generic codes used to evaluate different codelet types; all codelets are parameterized by m, n which bound the outer and inner dimensions respectively. Each type also uses f,g and h which are special structs containing offset information used to process each codelet.

Lines 8-12 in the algorithm inspect each computation region and finds the best codelet combination for that region using three different strategies, i.e. , , and . Since the strategies mine for codelets, they take as input the FOPDs computed from line 5 along with . The BLAS-first strategy prioritizes finding BLAS codelets and then mines for PSCs. The PSC I-first strategy mines for PSC type I codelets first, and the PSC II-strategy looks for PSC type II codelets and then other codelets. To reduce the execution time of each strategy, we store the list of codelets for each mined region and look up this list when executing the other strategies. The costs of codelets from each approach are stored in constants , , and and then in line 11 the combination with the lowest cost is chosen. Line 12 appends the most efficient codelet combination found for the current computation region to the output list of codelets in the Algorithm, i.e. .

Iv-B DDF Executor

The DDF driver runs the executor code, as shown in line 9 of listing 2. To efficiently use the instruction cache and to eliminate the need for recomputing a new executor code per matrix pattern, DDF generates a parametric code. The parametric code for the three types of codelets, BLAS, PSC I, and PSC II are shown in Figure 7. As shown each generic code of a codelet class takes the data access functions as input and can vectorize any codelet of the same class. These generic codelets are called inside a switch statement in the executor code, shown in Listing 3. In the executor, each mined codelet in the codelet list is mapped to a generic codelet using this switch statement. A change in the input sparsity pattern results in a different codelet list, however, the same parametric code is still used. For matrices of different size, the codelets list size changes, however, the same executor code in Listing 3 can be used. Thus, the code size is invariant to the size of matrix.

1#include "Codeletsx86.h"
3void vectorized_code(vector<Codlet*> clsit, CSR* A, double *x, double *y){
4 for(int i = 0; i < clist.partitions(); i++){
5#pragma omp parallel
6  for(int j = 0; j < clist[i].size(); j++){
7   switch(clist[i][j].type){
8    case BLAS:
9     BLAS_codelet();
10     break;
11    case PSCI:
12     PSCI_codelet();
13     break;
14    case PSCII:
15     PSCII_codelet();
16     break;
Listing 3: The vectorized code for SpMV generated by DDF in spmv.h.

V Results

We evaluate the performance of DDF using two kernels, sparse triangular solver (SpTRSV) and sparse matrix-vector multiplication (SpMV). DDF is compared to two other inspector-executor approaches, i.e. Sympiler and PIC [Augustine2019]. We also compare DDF to libraries CSR5 [liu2015csr5], MKL [mkl_lib], and our implementation of SpMV using the ELLPACK [saad2003iterative] storage format (referred to as ELL in the figures). Sympiler does not support SpMV, and SpTRV is not supported by the CSR5, PIC, and ELL implementations. Hence those tools are omitted from the respective figures.

The set of symmetric positive definite (SPD) matrices is used for evaluation. These matrices are selected from the SPD symmetric matrices in the SuiteSparse repository [davis2011university] and are diverse in both size and sparsity pattern. For the SpTRSV evaluations, we only use the lower triangular half of the SPD symmetric matrices. Matrices with ID 0=30 are the thirty largest matrices in the repository, Matrix IDs 31-49 are L-factors [davis2005cholmod] of the largest matrices, and 10 small matrices (ID = 50-60) with 50-350K nonzeros are also included. L-factors matrices are included to compare the performance of DDF to Sympiler because Sympiler is best suited to optimize these matrices. Small matrices are included to compare with PIC as it does not scale to larger problems. The test-bed architectures are an Intel(R) Xeon(R) Gold 5115 CPU (2.8GHz, 14080K L3 Cache) with 20 cores and 64GB of main memory (Intel) and an AMD Ryzen 3900x CPU (3.8GHz, 64MB L3 Cache) with 12 cores and 32GB of main memory (AMD). The performance of kernels is reported on both architectures but the analysis is only shown for the Intel machine. All generated code, implementations of different approaches, and library drivers are compiled with GCC v.7.2.0 compiler and with the -O3 flag. All benchmarks are executed 5 times and the median value of all runs is reported. To report the performance in GFLOP/s, we compute the theoretical floating-point operations for each kernel and matrix and divide it by execution times of each tool.

The sequential implementation of SpMV CSR and SpTRSV CSR are used as a baseline. Since the code for PIC is not publicly available, we created an in-house inspector-executor implementation of their approach with feedback from the authors of [Augustine2019]. PIC was originally developed for a single thread, however, we extended its support to be parallel and report the best performance between the two implementations in all figures. A timeout of 4 hours was used for all runs (including the inspector and the executor time). Thus, PIC does not have data points in respective figures for large matrices. Also, our implementation of ELL which is based on [saad2003iterative], has missing points in some figures because the number of non-zero fill-ins exceeds the main memory of our test-bed architecture.

V-a SpMV Performance

The performance of the SpMV kernel for DDF, MKL, CSR5, ELLPACK, baseline, and PIC is shown in Figure 8. As shown, DDF is faster than all other tools for over 88% of the matrices tested. DDF’s code is on average 1.93x, 1.42x, 10.57x, and 7.19x faster than in order MKL, CSR5, ELLPACK, and PIC respectively on the Intel architecture. DDF’s code is on average 1.22x, 1.29x, 8.54x, and 10.93x faster than in order MKL, CSR5, ELLPACK, and PIC respectively on the AMD architecture.

Fig. 8: Top: Shows DDF speedups for the SpMV kernel over the single threaded baseline SpMV CSR algorithm ran on an Intel architecture. Speedups also reported for parallel MKL, parallel CSR5, parallel ELLPACK and single threaded PIC [Augustine2019] over the single threaded baseline SpMV CSR algorithm. Bottom: Shows DDF, MKL, CSR5, ELLPACK and PIC speedups for the SpMV kernel using an AMD architecture.
Fig. 9: Shows percentage of points for all matrices processed by BLAS, PSC type 1 or PSC type 2 codelets for SpMV and SpTRSV.

To demonstrate the effect of partially strided codelets on the performance of DDF, in Figure 9 we use a stacked bar for DDF. The stacks show the GFlop/s for the baseline code (input to DDF), for DDF when it only mines for BLAS codelets (DDF-baseline+BLAS and refer to DDF BLAS_only), and from running the entire DDF algorithm, i.e. Algorithm 1, that mines for BLAS and PSC (shown with DDF or DDF-baseline+BLAS+PSC in the figure). As shown, DDF is on average 10 faster than DDF BLAS_only which demonstrates the importance of using PSC codelets. Figure 9 shows the percentage of operations that are vectorized in the generated code from DDF with PSC I, PSC II, and BLAS codelets, obtained by averaging over all matrices in the benchmark. As shown, over 80% of the operations in SpMV are vectorized with PSC codelets.

The PSC codelets improve data locality in DDF’s generated code. Figure 10 shows the relation between DDF’s performance and the performance of the parallel (OpenMP) version of the baseline code. Average memory access latency [hennessy_quantitative] is used as a measure for locality and is computed by gathering the number of misses and accesses to L1, L2, and LLC caches using the PAPI [terpstra2010collecting] performance counters. Figure 10 shows the coefficient of determination or R is 0.83 which indicates a good correlation between speedup and the memory access latency.

Fig. 10: Shows the correlation between speedup over the baseline while profiling the code with PAPI  [terpstra2010collecting] and the relative memory cycle for matrices used in the test set for SpMV. Relative memory cycle is the average memory cycle of the baseline over the average memory cycle of the DDF framework codes.

To further explain why DDF is faster than other tools, we conduct a few experiments and report the resulting average over all matrices in this paragraph. For MKL, we compute its average memory access latency, which is 4.36x slower than that of DDF, contributing to the worse performance compared to DDF. To compare to CSR5, we count the number of instructions. CSR5 executes 1.73x more instructions compared to DDF. This is potentially due to the overhead of the segmented sum calculations used in their approach to improving vectorization and load balance. DDF is on average 10.57x faster than ELL, because the percentage of fill-in in the ELL implementation is 800% relative to the non-zero elements in the matrix, significantly increasing the number of operations in their method. To conclude, the SpMV code of DDF is faster than existing implementations because it improves data locality and/or reduces the number of instructions via vectorization.

V-B SpTRSV Performance

Figure 11 compares the performance of SpTRSV using DDF, MKL, and Sympiler. The stacked bar of DDF shows the additional performance gained from mining partially strided codelets; the trend is similar to that in Figure 8 for SpMV. On average DDF is faster than MKL, Sympiler, and the baseline 4.5, 1.79, and 3.62 respectively for the intel architecture. On the AMD architecture, DDF is faster than MKL and Sympiler 2.66, 1.38 respectively.

As shown in Figure 11, partially strided codelets are the main contributors to the overall performance of the DDF’s SpTRSV code. On average, DDF is 3.9 faster compared to when only BLAS codelets are generated. Similar to SpMV, PSCs contribute to optimizing 78% of the operations over all matrices (Figure 6b). However, the number of PSC II codelets has increased from 18% in SpMV to 53% in SpTRSV. The number of computational regions with more than one strided access function is small, due to the existing dependencies in the SpTRSV kernel, thus, more PSC type II codelets are generated. Similar to the SpMV kernel, the mined PSCs in DDF improve locality. The correlation coefficient between the speedup and relative memory cycle is 0.67 which is consistent with the trend in SpMV.

While MKL provides an efficient and vectorized implementation for single-threaded SPTRV executions, it’s not optimized to execute on parallel processors; the performance of MKL’s parallel code is similar to its serial implementation. Sympiler performs well for matrices that contain row-blocks or can be padded with up to 30% nonzeros to create row-blocks. These row-blocks are converted to BLAS calls and thus improve locality. L-factors are generated from a Cholesky factorization of the original matrices in the SuiteSparse collection. Since L-factors are created with a supernodal factorization process [chen2008algorithm], the resulting matrices are well-structured and have many row-blocks. As shown, Sympiler’s performance on L-factor matrices (IDs 31-50) is close to that of DDF’s with an average speedup of 1.18, while for all other matrices DDF’s performs on average time 2.1 better than Sympiler.

Fig. 11: Top: Shows breakdowns for DDF speedups, when enabling BLAS and PSC codelet mining, MKL and Sympiler speedups over the single threaded baseline SpTRSV CSR algorithm on Intel archtecture. Bottom: Shows DDF, MKL and Sympiler speedups on AMD architecture.

V-C The Inspector Overhead

We compare the inspection time of DDF to that of other inspector-executors/code-generation frameworks, i.e. Sympiler and PIC. We compute the number of executor runs (NER) that amortize the cost of the inspector using
. The baseline time is obtained by running sequential implementation of the kernel. The PIC inspector would timeout for over 83.3% of matrices because of its large inspection overhead and code compilation time. For small matrices that PIC would execute, more than one million executor runs are needed to amortize the cost of the inspection. DDF’s inspection time is on average 0.5 seconds with an average NER of 15 for SpMV. The inspection time of DDF and Sympiler are similar with an average NER of less than 100 for both tools for SPTRV. The largest matrices in our benchmark are inspected in less than 7 seconds with DDF. Sparse kernels such as SpMV and SpTRSV are typically used in iterative solvers, for example, to compute a residual in each iteration or to apply a preconditioner per iteration. Even with preconditioning, these solvers typically converge to a solution after tens of thousands of iterations [benzi2000robust, kershaw1978incomplete, papadrakakis1993accuracy] and hence inspector-executor frameworks such as DDF and Sympiler lead to noticeable speedups as their inspection time overheads are amortized after a few initial iterations of the solver.

Vi Related Work

Numerous hand-optimized libraries [intel-alt, eigenweb] and implementations [Li2003, Hutchinson1995, davis2005cholmod, davis2004algorithm] exist that optimize the performance of sparse matrix computations for different parallel architectures and also optimize vectorization on a single core. Libraries such as MKL and Eigen as well as implementations in [Yesil2020, li2013gpu, williams2007optimization, kamin2014optimization, merrill2016merge] optimize the performance of SpMV on shared memory architectures and improve SIMD vectorizability. A number of library implementations such [saad2003iterative, vuduc2003automatic, vuduc2005fast, li2020vbsf, liu2015csr5, tang2015optimizing, xie2018cvr] reorganize data and computation to increase opportunities for vectorization. A class of these libraries implement and optimize sparse kernels based on available storage formats; for example [kreutzer2014unified] optimizes SpMV based on ELLPACK. Other libraries work best for matrices arising from specific applications such as [Yesil2020], which optimizes SpMV for large matrices from graph analytics, or KLU [davis2010algorithm] which works best for circuit simulation problems.

Domain-specific compilers use domain information to enable the application of code transformations and optimizations such as vectorization. For example domain-specific compilers such as [jones1993partial, quillere2000generation, Spampinato2014] optimize signal processing applications, stencil computations are optimized in [jones1993partial, quillere2000generation, spampinato2014basic], and matrix assembly and mesh analysis is optimized in [alnaes2014unified, luporini2017algorithm]. While these compilers provide highly-optimized code for the applications that they accelerate, they do not generate specialized code for the input pattern and the computations do not have indirect accesses.

Inspector-Executor approaches inspect the irregular access patterns of sparse matrix computations at run-time to enable the automatic optimization of sparse codes [Shin2010, Tang2011, VanDerSpek2011, vasilache2006polyhedral, Venkat2015, agrawal1995interprocedural, das1995index, Rodriguez, saltz1990run]. The index array accesses of the sparse code is analyzed using an inspector and the information is used at run-time to execute the code efficiently. The sparse polyhedral framework [lamielle2010enabling, strout2012set, venkat2016automating] uses uninterpreted function symbols to express regular and irregular segments of sparse codes. As a result it is able to automatically generate inspector executors at compile time that can resolve data dependencies in sparse computations. These approaches do not generate code that is specialized for the sparsity of the input matrix. Sympiler [Cheshmi2017] and ParSy [Cheshmi2019] are amongst the inspector executor frameworks that inspect the matrix sparsity pattern and as a result generate vectorized and parallel code specialized for the input sparsity. Their optimizations for tiling and vactorization are based on detecting row-blocks that primarily exist in matrices obtained from numerical factorizations.

Augustine et al. [Augustine2019] proposed an approach based on the Trace Reconstruction Engine [Rodriguez2016, Ketterlin2008] where polyhedral representations are built by inspecting the sequence of addresses being accessed in the sparse matrix vector multiplication. A followup to this work, proposes to use program guided optimization for better vectorization [Selva2019]. These approaches lead to generating code that is specialized for the sparsity pattern of the input matrix and improves SIMD vectorization in SpMV. However, their work can only support small matrices (below 0.5M nonzeros) because of inspector overheads and because the size of the generated code increases with the matrix size. Sparse computations with loop-carried dependencies such as the sparse triangular solver are also not supported. DDF inspects memory address accesses to find SIMD vectorizable code for sparse triangular solve and SpMV and supports both small and large matrices.

Vii Conclusion

In this work, we present partially strided codelets that enable the vectorization of computation regions with unstrided memory accesses in sparse matrix codes. We demonstrate how these codelets increase opportunities for vectorization in sparse codes and also improve data locality in their computation. A novel inspector-executor framework called DDF is proposed. DDF uses an efficient inspector to mine for PSCs with a memory access differentiation approach and as a result, generates highly efficient code for sparse kernels. The performance of the DDF-generated code is compared to state-of-the-art library implementations and other inspector-executor frameworks for the sparse matrix-vector multiply and the sparse triangular solve kernels. We also demonstrate that the inspection overhead of DDF is negligible.

Viii Acknowledgments

This work was supported in part by NSERC Discovery Grants (RGPIN-06516, DGECR00303), the Canada Research Chairs program, and U.S. NSF awards NSF CCF-1814888, NSF CCF-1657175; used the Extreme Science and Engineering Discovery Environment (XSEDE) [Towns et al. 2014] which is supported by NSF grant number ACI-1548562; and was enabled in part by Compute Canada and Scinet111www.computecanada.ca