Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

11/15/2017 ∙ by Athena Elafrou, et al. ∙ National Technical University of Athens 0

This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sparse matrix-vector multiplication (SpMV) is a fundamental building block of iterative methods for the solution of large sparse linear systems and the approximation of eigenvalues of large sparse matrices arising in applications from the scientific computing, machine learning and graph analytics domains. It has been repeatedly reported to achieve only a small fraction of the peak performance of current computing systems due to a number of inherent performance limitations, that arise from the algorithmic nature of the kernel, the employed matrix storage format and the sparsity pattern of the matrix. In particular, SpMV is characterized by a very low flop:byte ratio, since the kernel performs

operations on amount of data111We define as the number of nonzero elements and as the number of rows/columns in the matrix. For simplicity, we assume a square matrix., indirect memory references as a result of storing the matrix in a compressed format and irregular memory accesses to the right-hand side vector due to sparsity.

SpMV is typically a memory bandwidth bound kernel for the majority of sparse matrices on multicore platforms. Its bandwidth utilization is strongly dependent on the sparsity pattern of the matrix and the underlying computing platform. Consequently, most optimization efforts proposed in the literature over the past years have focused on reducing traffic between caches and main memory, primarily by compressing the memory footprint of the matrix [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. With the advent of many-core architectures, including GPGPUs and the Intel Xeon Phi processors, the performance landscape has become more diverse. For instance, the larger number of cores available in such architectures make SpMV performance more sensitive to the workload distribution among threads. As the performance of SpMV is becoming increasingly dependent on both the input problem and the underlying computing platform, there is no one-size-fits-all solution to attain high performance. Thus, we advocate the significance of matrix- and architecture-aware runtime specialization. The value of cherry-picking optimizations for SpMV is further emphasized by the fact that blindly applying optimizations can actually hinder performance. As an example, Fig. 1 shows the effect of different optimizations on an Intel Xeon Phi processor. Even though each optimization achieves significant gains for some matrices, it may result in nonnegligible slowdowns for others. Optimizing SpMV usually involves a preprocessing step to analyze the sparse matrix structure and may include format conversion, parameter tuning, etc. While the overhead of this step may be amortized for applications that reuse the same matrix over many invocations of SpMV, it can outweigh any performance benefit when a smaller number of iterations is required for solver convergence, which is often the case in preconditioned solvers [13], or when the structure of the matrix changes frequently, e.g. in graph applications. This is why many recent efforts have focused on designing more lightweight optimizations [14, 15].

Fig. 1: Speedup (slowdown) of different software optimizations applied to the CSR SpMV kernel on Intel Xeon Phi (codename Knights Corner).

In this paper, we propose an adaptive SpMV optimizer that seeks to provide performance stability, i.e., improved performance for all sparse matrices, and low overhead in order to be beneficial to problems that require a small number of iterations to converge. We especially focus on designing an optimizer that is more lightweight than a trivial one that runs all optimizations and selects the best for the target application.

In order to provide performance stability, our optimizer relies on SpMV bottleneck analysis of the input matrix on the target platform. We formulate bottleneck detection as a classification problem (Section III-A

) and first develop a profile-guided classifier (Section 

III-C), that relies on micro-benchmarks to classify the matrix. As the online profiling phase has a nonnegligible cost, we go one step beyond, and propose a classifier that relies only on matrix properties that can be cheaply computed (Section III-D

). This feature-guided classifier is pre-trained during an offline stage with the use of machine learning techniques and performs feature extraction on-the-fly. The runtime overhead of this classifier is very low, making it extremely lightweight (Section 

IV-D).

Once the performance bottlenecks of a matrix have been detected, we apply suitable optimizations to tackle them. We employ a simple and easy-to-implement set of optimizations (Section III-E), based on the general-purpose Compressed Sparse Row (CSR) sparse matrix storage format. Our approach, however, is compatible with a plethora of other optimizations found in the literature. Again here, the idea is to resort to optimizations that require short preprocessing times, further contributing to our objective for a lightweight scheme.

We evaluate our optimizer using both classifiers on two generations of Intel’s Xeon Phi many-core processor and an Intel Broadwell processor. The performance analysis in Section III demonstrates the diversity of SpMV bottlenecks among different matrices, while our experimental results in Section IV show the diversity of SpMV bottlenecks among different computing platforms and the effectiveness of the proposed approach in distinguishing and appropriately optimizing the majority of matrices, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV implementations available in the latest Intel MKL library, which is highly optimized for x86-based platforms.

Ii Background

In order to avoid the extra computation and storage overheads imposed by the large majority of zero elements found in a sparse matrix, it has been the norm to store the nonzero elements contiguously in memory and employ auxiliary data structures for their proper traversal. The most widely-used general-purpose sparse matrix storage format, namely the Compressed Sparse Row (CSR) format [16], uses a row pointer array to index the start of each row within the array of nonzero elements, and a column index array to store the column of each nonzero element. The SpMV kernel using this format is given in Fig. 2.

1:procedure Spmv(::in, ::in, ::out)
2:: matrix in CSR format
3:: input vector
4:: output vector
5:      for  do
6:            for  do
7:                 
8:            end for
9:      end for
10:end procedure
Fig. 2: SpMV implementation using the CSR format.

SpMV is characterized by a low operational intensity. It performs floating-point operations on amount of data, leading to a flop:byte ratio of less than 1. According to the Roofline performance model [17], computational kernels with low operational intensities tend to be memory bound. Furthermore, the auxiliary indexing structures required to access the nonzero elements, namely rowptr and colind in case of CSR, introduce additional load operations, traffic for the memory subsystem, and cache interference [2]. Access to the input vector is irregular and depends on the sparsity pattern of the matrix. This fact complicates the process of exploiting any spatial or temporal reuse. Many sparse matrices contain a large number of short rows that may degrade performance on par with the small trip count of the inner loop [18]. Finally, many sparse matrices are characterized by rows with highly uneven lengths or regions with different sparsity patterns, which may result in workload imbalance [19].

Iii Optimization Tuning Methodology

Optimization selection can be solved in various ways. For instance, one could simply take an empirical approach: measure how different optimizations work for a particular matrix on the target machine and then apply the most efficient optimization on future runs of the same matrix. This is the most accurate method, however, it incurs a substantial overhead to sweep over all candidate optimizations, some of them requiring significant preprocessing time including format conversion and autotuning parameters. In order to reduce this overhead, one could employ a “blind” machine learning approach. By defining optimizations as classes, one could train a classifier with machine learning techniques to select a proper optimization. However, in this case, the classifier would need to be rebuilt whenever an optimization is added or replaced and, furthermore, no intuition would be provided into when and why an optimization works.

We propose a solution that is lightweight, intuitive and enables a modular framework design. We formulate optimization selection as a multiclass, multilabel classification problem, where classes represent performance bottlenecks. For every class, we implement an optimization that is designed to address the corresponding bottleneck. Given an input matrix, its bottlenecks are identified through classification and the corresponding optimizations are jointly applied. Since most optimizations proposed in the literature are designed with a specific bottleneck in mind, by detecting bottlenecks instead of optimizations, our classifier explains why an optimization actually improved performance for a particular matrix. Furthermore, by decoupling bottleneck identification from the application of optimizations, one can build a classifier once and optimizations can be henceforth added or replaced in a plug-and-play fashion.

Iii-a Formulation as a Classification Problem

Based on the analysis presented in Section II and the extensive research that has been realized over the past few decades in SpMV performance analysis and optimization, we define the following classes, each representing a potential SpMV performance bottleneck:

  • []

  • This class includes matrices that achieve memory bandwidth utilization values close to the peak (which may differ depending on the working set size) and are, therefore, Memory Bandwidth bound. Such matrices usually exhibit a regular sparsity structure.

  • This class refers to matrices that are Memory Latency bound. Such matrices exhibit poor locality in the accesses to the right-hand side vector due to a highly irregular sparsity pattern, which cannot be detected by hardware prefetching mechanisms available in current architectures.

  • This class includes matrices with highly uneven row lengths or regions with completely different sparsity patterns that exhibit thread IMBalance due respectively to an uneven workload distribution or computational unevenness.

  • This class includes matrices that present CoMPutational bottlenecks. Such matrices either fit in the system’s cache hierarchy and have an operational intensity closer to the corresponding dense kernel, pushing them closer to the ridge point of the Roofline model [17], or have the majority of nonzero elements concentrated in a few dense rows, e.g. power law matrices.

The above classes are quite generic and cover a wide variety of matrices with different structural characteristics.

Iii-B Per-Class Performance Bounds

In this section, we perform a thorough bound and bottleneck analysis for the SpMV kernel. Inspired by the “bounding techniques” described in [20], for every class defined in Section III-A

, we derive an upper bound on performance that represents the maximum performance that can be achieved by completely eliminating the corresponding bottleneck. The idea here is that by comparing the actual performance with the per-class upper bounds, we are able to estimate which bottlenecks are important to address and what is the potential performance gain. In the following analysis,

stands for performance, for memory traffic, for bandwidth, for size and for execution time.

  • []

  • For matrices of this class the available memory bandwidth is saturated. Thus, an upper performance bound can be derived by assuming maximum sustainable main memory bandwidth as

    where is the maximum sustainable memory bandwidth of the system222We adjust the bandwidth upwards for matrices that fit in the system’s cache hierarchy., while and are the minimum memory traffic that can be generated by the matrix stored in format and the vectors respectively. Note that compulsory misses set the minimum memory traffic and, hence, and .

  • For this class, we can derive an upper bound by running a modified SpMV kernel where irregular accesses to the right-hand side vector are converted to regular accesses. In case of the CSR format, we can achieve this by setting all entries of the colind array to the row index of the corresponding element.

  • An upper bound for this class can be estimated using the median execution time among all threads as

    We use the median instead of the mean, as we require reduced importance to be attached to outliers.

  • For matrices of this class, we can define a very loose upper bound by running a modified SpMV kernel where we completely eliminate indirect memory references, resulting in unit-stride accesses only. In case of the CSR format, we no longer use

    colind to index vector , but always access , where is the current row index.

A generic upper bound on SpMV performance, independent of the format used to store the sparse matrix, can be defined as follows:

where and are defined as before. In order to estimate we assume we can only compress the indexing information of a sparse matrix (not the values). In this case, a minimum can be determined by completely eliminating the indexing structures, i.e., .

We apply the bound and bottleneck analysis on an Intel Xeon Phi co-processor using the CSR format as our baseline SpMV implementation. You may refer to Section IV-A for details on the experimental setup. The performance of the baseline kernel is denoted by . Among the previously defined performance bounds, only and require a micro-benchmark to be run on-the-fly. and only require to be measured on the target system, while can be deduced from the baseline run. Fig. 3 shows the performance of the baseline along with the per-class and generic upper bounds for a representative set of matrices from the University of Florida Sparse Matrix Collection [21]. The relation of each per-class upper bound to the baseline differs among matrices, indicating a potential diversity in the bottlenecks of each matrix.

Iii-C Profile-Guided Classifier

By comparing the per-class upper bounds to the baseline performance presented in the previous section, we are able to derive heuristics that can be used to determine the class(es) of each matrix. Examining Fig. 

3 in more detail, we distinguish the following cases:

Fig. 3: SpMV performance using the CSR format and per-class upper bounds on Intel Xeon Phi (codename Knights Corner).
  • (consph, rajat30, circuit5M, etc.) Performance did not improve by eliminating irregular memory accesses, hence SpMV does not suffer from excessive cache misses. Note that these matrices incurred a slowdown with software prefetching in Fig. 1.

  • (citationCiteseer, web-Google, nd24k, etc.) This implies a balanced execution time among all threads.

  • (consph, nd24k, boneS10, etc.) In this case, either SpMV has saturated the available main memory bandwidth and, thus, the matrix is a candidate of the MB class.

  • and/or (poisson3Db, parabolic_fem, offshore) In this case CSR SpMV benefited from eliminating irregularity and/or balancing thread execution time, indicating it may belong to the ML and/or IMB classes.

  • (ASIC_680k, rajat30, degme) This suggests that SpMV using the CSR format might be limited by computational bottlenecks, based on the following analysis:

    (1)

    Since the memory footprint of the matrix in the computation of is smaller (we no longer use the colind array) and we have eliminated irregularity as well, we can assume that and . Thus, in order for (1) to hold, . Thus, the matrix is not memory bound (neither latency nor bandwidth). If additionally the execution is imbalanced and the threads causing the imbalance are compute-limited. This condition holds for matrices that have the majority of nonzero elements concentrated in a few dense rows, e.g., ASIC_680k, rajat30, degme, FullChip and circuit5M. Note that these matrices improve with vectorization in Fig. 1. If , the matrix may be suffering from loop overheads in the presence of very short rows, e.g., webbase-1M, flickr, circuit5M.

  • Based on a theoretical analysis similar to the previous one, we conclude that this condition may occur when SpMV’s working set fits in the system’s cache hierarchy and the system’s bandwidth vastly improves for such working sets. This case does not appear in our matrix suite on KNC, it does however on other platforms.

Our analysis clearly exposes a diversity in performance bottlenecks on Intel Xeon Phi. Many matrices suffer from thread imbalance and excessive cache misses, while a subset of those seem to have competing bottlenecks, e.g., parabolic_fem, offshore, webbase-1M, etc. This diversity justifies an SpMV optimizer that is matrix-adaptive. Towards this direction, we design a rule-based classification algorithm based on the previous observations. Since we collect data for the performance bounds during an online profiling phase, we will henceforth refer to this classifier as the profile-guided classifier.

The algorithm is depicted in Fig. 4. The classification is performed by comparing per-class performance bounds and by evaluating their impact on the baseline kernel. The values of and

, which form the hyperparameters of our classifier, have been tuned using grid search, which simply performs an exhaustive search through the specified hyperparameter space for a combination of values that maximizes some performance metric. We choose to maximize the average performance gain of the corresponding optimizations on a large set of matrices (see Section 

III-E). Notice that it is possible for a matrix not to be classified. This reflects cases for which it is not worth applying any of the optimizations in our pool. Also note that this classifier implicitly deduces SpMV performance characteristics on the underlying computing platform through hyperparameter tuning and profiling, making it architecture-adaptive.

1:procedure Classify(, , , , , )
2:      
3:      if  then
4:            
5:      end if
6:      if  then
7:            
8:      end if
9:      if  and  then
10:            
11:      end if
12:      if  or  then
13:            
14:      end if
15:      return
16:end procedure
Fig. 4: Profile-guided classifier. Parameters and were optimized through exhaustive grid search.

Iii-D Feature-Guided Classifier

We go one step further and design a classifier that relies on structural features of the sparse matrix to perform the classification. The intuition behind this approach is that one can devise features that represent comprehensive characteristics of a sparse matrix that are closely related to SpMV performance. For example, the number of nonzero elements per row can reveal workload imbalance during SpMV execution. In particular, if a matrix contains very uneven row lengths and a row-partitioning scheme is being employed for workload distribution, then threads that are assigned long rows will take longer to execute resulting in thread imbalance. We leverage supervised learning techniques to build this classifier. The advantage of this approach is that, given a set of labeled matrices, the classification rules can be automatically deduced. We will henceforth refer to this classifier as the

feature-guided classifier.

We experiment with a Decision Tree classifier and adjust it to perform multilabel classification in order to detect all bottlenecks. We also add a dummy class to reflect matrices not worth optimizing with any of our optimizations (see Section 

III-E). Learning is performed using an optimized version of the CART algorithm, which has a runtime cost of for the construction of the tree, where is the number of features used and is the number of samples used to build the classifier. The query time for this classifier is . We generate the classifier using the scikit-learn machine learning toolkit [22].

Iii-D1 Feature Extraction

This classifier uses real-valued features to perform the classification. Table I shows all the features we experimented with, along with the time complexity of their extraction from the input matrix. denotes the number of rows in the matrix and the number of nonzero elements. We define as the number of nonzero elements of row , the column distance between the first and last nonzero element of row , , , where is the number of groups formed by consecutive elements in row and, finally, is the number of nonzero elements in row that can generate cache misses. We naively say that an element will generate a cache miss when its distance from the previous element in the same row exceeds the number of elements that fit in a cache line of the system.

Feature Definition Complexity
0:exceeds or 1:fits in LLC
TABLE I: Sparse matrix features used for classification

Iii-D2 Training Data Selection

We train our classifier with a data set consisting of sparse matrices from the University of Florida Sparse Matrix Collection [21]. We have selected a matrix suite consisting of 210 matrices from a wide variety of application domains, to avoid being biased towards a specific sparsity pattern.

Iii-D3 Training Data Labeling

Since the classes of a matrix cannot be determined in a straightforward manner, we use our profile-guided classifier for this purpose. An issue that arises by this choice is that the validity of the labels and, consequently, the accuracy of the trained classifier, depends on the behavior of the profile-guided classifier.

Iii-E Optimization Pool

Table II gives a mapping of classes to optimizations. We have incorporated optimizations that require minimal preprocessing time, further contributing to our objective for a lightweight scheme. Note that in case multiple bottlenecks are detected, we jointly apply the corresponding optimizations.

For matrices that are memory bandwidth limited, we employ a rather simple compression scheme, just for illustrative purposes. We use delta indexing on the column indices of the nonzero elements of the matrix, a technique which was originally applied to SpMV by Pooch and Nieder [23]. We use 8- or 16-bit deltas wherever possible, but never both, in order to limit the branching overhead during SpMV computation. We employ vectorization as well, which helps attain a higher bandiwdth utilization value on modern processors with wide SIMD units.

For the ML class of matrices, we employ software prefetching, since the source of excessive cache misses in SpMV is a result of indirect memory addressing to the right-hand side vector, which cannot be efficiently tackled by current hardware prefetching mechanisms. A single prefetch instruction was inserted in the inner loop of SpMV, with a fixed prefetch distance equal to the number of elements that fit in a single cache line of the hardware platform. Data are prefetched into the L1 cache.

For matrices that suffer from thread imbalance, we include multiple optimizations, as thread imbalance can be a result of either workload imbalance (number of nonzero elements per thread) or computational unevenness. We select the final optimization based on structural features of the matrix. The first subcategory includes matrices with highly uneven row lengths that we detect by comparing the and features from Table I. For these matrices, we basically decompose the matrix into two parts stored in a modified CSR format, depicted in Fig. 5. SpMV is performed in two steps, described in Fig. 6. First, SpMV is performed as usual by skipping the long rows, and, then long rows are computed with a different assignment of computation to threads. Every row is computed by all threads and a reduction of partial results follows. For the second subcategory, which we detect with the feature, we employ the auto scheduling policy available in the OpenMP runtime system [24]. When the auto schedule is specified, the decision regarding scheduling is delegated to the compiler, which has the freedom to choose any possible mapping of iterations (rows in this case) to threads.

Finally, for matrices that present computational bottlenecks, we combine inner loop unrolling with vectorization.

Class Optimization
MB column index compression through delta encoding [23] + vectorization
ML software prefetching on vector x
IMB

matrix decomposition, auto scheduling (OpenMP)

CMP inner loop unrolling + vectorization
TABLE II: Mapping of matrix classes to optimizations
Fig. 5: Matrix decomposition optimization for matrices with very long rows.
1:procedure Spmv(::in, ::in, ::out)
2:      for  do
3:            for  do
4:                 
5:                 
6:            end for
7:      end for
8:      for  do : number of long lines
9:            
10:            
11:            
12:            for 
13: do
14:                 
15:
16:            end for
17:      end for
18:end procedure
Fig. 6: Decomposed SpMV implementation for matrices with highly uneven row lengths.

Iv Experimental Evaluation

Our experimental evaluation focuses on three aspects: matrix classification accuracy, the performance benefit of applying our methodology for optimizing SpMV, as well as the runtime cost of the optimization workflow.

Iv-a Experimental Setup and Methodology

Our experiments were performed on a first generation Intel Xeon Phi co-processor (KNC), a second generation Intel Xeon Phi bootable processor (KNL) configured in the Flat mode, in which the high-bandwidth MCDRAM memory (HBM) is directly addressable, and an Intel Broadwell processor (Broadwell). On KNL, we allocate the entire application on HBM using the numactl tool. Table III lists the technical specifications of each platform. We disabled the turbo boost technology to avoid fluctuations in the measurements.

We used the OpenMP parallel programming interface and compiled our software using Intel ICC-17.0.1 and the Intel OpenMP library. The benchmarks used for the profile-guided classifier as well as the baseline SpMV implementation use the CSR format and are compiled with the -O3 -qopt-prefetch=0 -qopt-threads-per-core=4 flags. To simulate the typical SpMV computation in scientific applications, we used double precision floating-point values for the nonzero elements. Unless stated otherwise, the baseline and optimized implementations employ a static one-dimensional row partitioning scheme, where each partition has approximately equal number of nonzero elements and is assigned to a single thread. For comparison purposes we used Intel MKL 2017.0. Finally, we enforced the thread affinity policy by setting OMP_PLACES=threads and OMP_PROC_BIND=close using all the cores in each system.

Whenever we report performance rates for SpMV, e.g. Gflop/s, they have been summarized over 5 benchmark runs using the harmonic mean. The rate of each individual run is the rate of the arithmetic means of the absolute counts (floating-point operations and seconds) of 128 SpMV operations, that is, we perform warm cache measurements.

Codename Knights Corner (KNC) Knights Landing (KNL) Broadwell
Model Intel Xeon Phi 3120P Intel Xeon Phi 7250 Intel Xeon E5-2699 v4
Microarchitecture Intel Many Integrated Core Intel Many Integrated Core Intel Broadwell
Clock frequency 1.10 GHz 1.40 GHz 2.20 GHz
L1 cache (D/I) 32/32 32/32 32/32
L2 cache 30 34 256
L3 cache - - 55
Cores/Threads 57/228 68/272 22/44
Sockets 1 1 1
STREAM triad main/llc [25] 128/140 395/570 60/200
TABLE III: Technical characteristics of the experimental platforms

Iv-B Feuture-Guided Classifier Accuracy

First, we evaluate our feature-guided classifier in terms of accuracy, assuming the labels generated by the profile-guided classifier are correct. We estimate how accurately our models perform using Leave-One-Out cross validation. According to this methodology, for a training set of matrices (), experiments are performed. For each experiment matrices are used for training and one for testing. The computed performance metric per experiment is the Exact Match Ratio, which is defined as the percentage of samples for which the predicted set of classes is fully correct, i.e. it exactly matches the set of labels produced by our profile-guided classifier. However, since we apply at least one optimization per matrix we can tolerate predictions that are partially correct, thus, we also define a performance metric, denoted by Partial Match Ratio, that considers a prediction to be correct if it contains at least one correct class. The final accuracy score reported by cross validation is the average of the values computed in the loop of experiments.

Table IV reports the most accurate feature-guided classifiers on KNC. Due to space limitations, we do not present the classifiers of the other platforms, however, they employ a similar set of features and similar accuracies. We report classifiers with increasing time complexity in the feature extraction phase along with their achieved accuracy score based on both the Exact and Partial Match Ratio performance metrics. The selection of features for the classifiers has been a result of exhaustive search.

Features Complexity Accuracy Exact (%) Accuracy Partial (%)
O(N) 80 95
O() 84 100
TABLE IV: Feature-guided Decision Tree classifiers on KNC

Iv-C Performance of selected optimizations

Contrary to standard classification problems, our key measure of success is not classifier accuracy—rather, we aim to maximize the average performance improvement of our optimizer over the baseline implementation, as well as the CSR mkl_dcsrmv() kernel available in the Intel MKL library. In this way, we designate the value of optimization tuning for SpMV. We also evaluate the recently introduced Inspector-Executor CSR mkl_sparse_d_mv() kernel of Intel MKL.

Fig. 7 presents SpMV performance achieved by MKL CSR, MKL Inspector-Executor CSR, our baseline CSR implementation, our profiling- and feature-guided classifiers, as well as the oracle—the perfect optimizer that always selects the best optimization available—on all experimental platforms for a representative set of matrices. The classes of each matrix according to the profile-guided classifier are also reported. We note that we used the feature-guided classifier with the highest accuracy reported in Table IV.

(a) KNC
(b) KNL
(c) Broadwell
Fig. 7: SpMV performance landscape on each experimental platform. MKL refers to the CSR SpMV kernel available in the Intel MKL library, MKL Inspector-Executor refers to the autotuned CSR SpMV kernel available in the Intel MKL library, baseline refers to our baseline CSR SpMV (see text in Section IV-A), feat to the feature-guided optimizer, prof to the profile-guided optimizer, and oracle to the perfect optimizer. Note that MKL Inspector-Executor is not available on KNC.

On KNC and KNL, our initial observation concerns the diversity of performance bottlenecks detected by our profile-guided classifier. This was expected based on our bound and bottleneck analysis in Section III-B, and is mainly due to the architectural characteristics of the Xeon Phi processors. Both KNC and KNL contain a large number of cores, a fact that favors workload imbalance, and a very expensive (an order of magnitude higher compared to multi-cores) cache miss latency, which further exposes irregularity in the accesses to the right-hand side vector. Thus, there are many matrices that fall out of the standard MB class. Our classifiers manage to successfully capture this trend and select proper optimizations due to their matrix- and architecture-awareness. Notice that the largest speedups for our classifiers are achieved by combining optimizations from different classes.

Specifically on KNC, rajat30 and flickr are the only matrices where our classifiers do not detect the optimal optimization. In case of rajat30, which is characterized with a high concentration of nonzero elements in a few dense rows, our classifiers do not recognize it as a candidate of the ML class and, consequently, do not apply software prefetching which offers the additional performance boost. By further exploring, we discovered that the benchmark that exposes irregularity for the profile-guided classifier (see Section III-B), can actually detect the irregularity in this matrix by looking at it in partitions, instead of looking at it as a whole. We intend to extend our classification approach to incorporate this idea in future work. In case of flickr, the combination of auto scheduling, prefetching and vectorization did not pay off, with the best performance being achieved by simply applying prefetching. In total, the profile- and feature-guided classifiers achieve an impressive average and speedup over MKL CSR respectively.

On KNL, the profile- and feature-guided classifiers achieve an average and speedup over MKL CSR respectively. The MKL Inspector-Executor also significantly improves over MKL CSR by on average, but our classifiers manage to further improve performance in most cases. The largest speedups over the Inspector-Executor occur for matrices with imbalanced execution, that is either due to highly uneven row lengths ({IMB, CMP} set of classes), e.g., ASIC_680k, rajat30, degme, or computational unevenness ({ML, IMB} set of classes), e.g., parabolic_fem, thermal2. Also notice how some matrices present different or additional bottlenecks compared to KNC. For instance, human_gene1 is latency bound (ML class) on KNC, while on KNL it is bandwidth bound (MB class).

Finally, on Broadwell the profile- and feature-guided classifiers achieve an average and speedup over MKL CSR respectively. Speedups occur mostly for matrices that entirely or partially fit in the system’s cache hierarchy and include the ML or IMB classes. The MKL Inspector-Executor also achieves an average over MKL CSR, but our classifiers manage to further improve performance in most cases.

Iv-D Runtime Overhead

The work-flow of the proposed optimization process comprises two basic steps: bottleneck detection through classification and generation of optimized code based on the classification results. Concerning the first step, our profile-guided classifier runs a number of micro-benchmarks on the input matrix and applies the empirically-tuned classification algorithm (see Section III-C), while the feature-guided classifier extracts matrix properties and applies a pre-trained classifier (see Section III-D). Runtime code generation is performed Just-In-Time (JIT). Depending on the selected optimizations, e.g. compression, an intermediate preprocessing step may be required.

In order to provide a meaningful illustration of how lightweight the proposed optimizers are, we are going to evaluate them in the context of iterative methods for the solution of large sparse linear systems, e.g., variations of the Conjugate Gradient (CG) and the Generalized Minimal Residual (GMRES) methods. Such solvers repeatedly call SpMV and usually require hundreds to thousands of iterations to converge, making it possible to amortize the overhead of the optimization process, e.g., optimization selection, format conversion, runtime code generation, parameter tuning etc. However, it is quite common in real-life applications to run preconditioned versions of these methods to accelerate convergence. In this case, the number of iterations may be significantly smaller, ranging from dozens to hundreds, thus, limiting the online overhead that can be tolerated [13].

In Table V we present the minimum number of iterations required for our optimizers, as well as the MKL CSR Inspector-Executor and two trivial optimizers—one that runs all single optimizations (total of 5 in our case) and one that also includes combinations of 2 (total of 15 in our case) and selects the best for each matrix—to be beneficial over CSR MKL in the context of an iterative solver. Specifically, if and are the total execution times of the solver using a baseline and an optimized SpMV implementation respectively, and the optimizer overhead, then the number of solver iterations required to amortize that cost can be estimated as follows:

where is the number of iterations required for the solver to converge, and and the execution time of SpMV before and after optimization. For simplicity, we have assumed that the performance of other computations () in the solver are not affected by changes in the SpMV implementation. By setting an we get:

In Table V, we report the worst and best observed cases among our representative matrix suite, as well as the average over all matrices. We run 64 SpMV iterations to get valid timing measurements. Note that the trivial optimizers incur the preprocessing overheads of every optimization, while our optimizers only incur the overheads of the selected optimizations based on their predictions. As expected, both optimizers that leverage the profiling- and feature-guided classifiers have a significant advantage over the trivial optimizers. The same stands for the MKL Inspector-Executor. However, our feature-guided optimizer requires by far the smallest number of iterations to provide a speedup over CSR MKL, making it the most lightweight approach.

Optimizer
trivial-single 455 910 8016
trivial-combined 1992 3782 37111
profile-guided 145 267 3145
feature-guided 27 60 567
MKL Inspector-Executor 28 336 1229
TABLE V: Minimum number of solver iterations required to amortize the autotuning runtime overhead of different optimizers on KNL

V Related Work

Different sparse matrices have different sparsity patterns, and different architectures have different strengths and weaknesses. In order to achieve the best SpMV performance for the target sparse matrix on the target platform, an autotuning approach has long been considered to be beneficial. The first autotuning approaches attempted to tune parameters of specific sparse matrix storage formats. Towards this direction, the Optimized Sparse Kernel Interface (OSKI) library [26] was developed as a collection of high performance sparse matrix operation primitives on single core processors. It relies on the SPARSITY framework [27] to tune the SpMV kernel, by applying multiple optimizations, including register blocking and cache blocking. Autotuning has also been used to find the best block and slice sizes of the input sparse matrix on modern CMPs and GPUs [28].

There have been some research efforts closer to our work. The clSpMV framework [29] is the first framework that analyzes the input sparse matrix at runtime, and recommends the best representation of the given sparse matrix, but it is restricted to GPU platforms. Towards the same direction, the authors in [30] present an analytical and profiling-based performance modeling to predict the execution time of SpMV on GPUs using different sparse matrix storage formats, in order the select the most efficient format for the target matrix. For each format under consideration, they establish a relationship between the number of nonzero elements per row in the matrix and the execution time of SpMV using that format, thus encapsulating to some degree the structure of the matrix in their methodology. Similarly, in [31]

, the authors propose a probabilistic model to estimate the execution time of SpMV on GPUs for different sparse matrix formats. They define a probability mass function to analyze the sparsity pattern of the target matrix and use it to estimate the compression efficiency of every format they examine. Combined with the hardware parameters of the GPU, they predict the performance of SpMV for every format. Since compression efficiency is the determinant factor in this approach, it is mainly targeted for memory bandwidth bound matrices. Closer to our approach is the SMAT autotuning framework 

[32]. This framework selects the most efficient format for the target matrix using feature parameters of the sparse matrix. It treats the format selection process as a classification problem, with each format under consideration representing a class, and leverages a data mining approach to generate a decision tree to perform the classification, based on the extracted feature parameters of the matrix. The distinguishing advantage of our optimization selection methodology over the aforementioned approaches, is that it decouples the decision making from specific optimizations, by predicting the major performance bottleneck of SpMV instead of SpMV execution time using a specific optimization. Thus, in contrary to the above frameworks, where incorporating a new optimization requires either retraining a model or defining a new one, our decision-making approach allows an autotuning framework to be easily extended, simply by assigning the new optimization to one of the classes.

Vi Conclusions

In this paper, we show how a matrix- and architecture-adaptive optimizer can provide significant speedups for SpMV. The proposed optimizer treats SpMV optimization as a classification problem. It classifies the input matrix based on its performance bottlenecks and applies suitable optimizations to tackle them. We describe two classifiers: a) a profile-guided classifier that relies on online profiling to perform the decision making, and b) a classifier trained offline with machine learning techniques that only extracts structural properties of the matrix on-the-fly. All aspects of the optimization process focus on minimizing online overheads in order to be applicable in real-life scenarios. Experimental evaluation on a representative matrix suite on one multi- and two many-core platforms shows that our methodology is very promising, achieving significant speedups over the Intel MKL library with an online overhead that can be easily amortized.

Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732204 (Bonseyes). This work was supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 16.0159. The opinions expressed and arguments employed herein do not necessarily reflect the official views of these funding bodies.

References

  • [1] S. Toledo, “Improving the memory-system performance of sparse-matrix vector multiplication,” IBM Journal of research and development, vol. 41, no. 6, pp. 711–725, 1997.
  • [2] A. Pinar and M. T. Heath, “Improving performance of sparse matrix-vector multiplication,” in Proceedings of the 1999 ACM/IEEE conference on Supercomputing.   ACM, 1999, p. 30.
  • [3] J. C. Pichel, D. B. Heras, J. C. Cabaleiro, and F. F. Rivera, “Improving the locality of the sparse matrix-vector product on shared memory multiprocessors,” in Parallel, Distributed and Network-Based Processing, 2004. Proceedings. 12th Euromicro Conference on.   IEEE, 2004, pp. 66–71.
  • [4] R. W. Vuduc and H.-J. Moon, “Fast sparse matrix-vector multiplication by exploiting variable block structure,” in High Performance Computing and Communications.   Springer, 2005, pp. 807–816.
  • [5] J. Willcock and A. Lumsdaine, “Accelerating sparse matrix computations via data compression,” in Proceedings of the 20th annual international conference on Supercomputing.   ACM, 2006, pp. 307–316.
  • [6] K. Kourtis, G. Goumas, and N. Koziris, “Optimizing sparse matrix-vector multiplication using index and value compression,” in Proceedings of the 5th conference on Computing frontiers.   ACM, 2008, pp. 87–96.
  • [7] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix–vector multiplication on emerging multicore platforms,” Parallel Computing, vol. 35, no. 3, pp. 178–194, 2009.
  • [8] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,” in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures.   ACM, 2009, pp. 233–244.
  • [9] M. Belgin, G. Back, and C. J. Ribbens, “Pattern-based sparse matrix representation for memory-efficient smvm kernels,” in Proceedings of the 23rd international conference on Supercomputing.   ACM, 2009, pp. 100–109.
  • [10] K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris, “Csx: an extended compression format for spmv on shared memory systems,” in ACM SIGPLAN Notices, vol. 46, no. 8.   ACM, 2011, pp. 247–256.
  • [11] A. Buluç, S. Williams, L. Oliker, and J. Demmel, “Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication,” in Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International.   IEEE, 2011, pp. 721–733.
  • [12] M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, “A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide simd units,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C401–C423, 2014.
  • [13] R. Li and Y. Saad, “Gpu-accelerated preconditioned iterative linear solvers,” The Journal of Supercomputing, vol. 63, no. 2, pp. 443–466, 2013.
  • [14] J. L. Greathouse and M. Daga, “Efficient sparse matrix-vector multiplication on gpus using the csr storage format,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’14.   Piscataway, NJ, USA: IEEE Press, 2014, pp. 769–780. [Online]. Available: https://doi.org/10.1109/SC.2014.68
  • [15] W. Liu and B. Vinter, “Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication,” in Proceedings of the 29th ACM on International Conference on Supercomputing, ser. ICS ’15.   New York, NY, USA: ACM, 2015, pp. 339–350. [Online]. Available: http://doi.acm.org/10.1145/2751205.2751209
  • [16] Y. Saad, Numerical methods for large eigenvalue problems.   SIAM, 1992, vol. 158.
  • [17] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
  • [18] J. Mellor-Crummey and J. Garvin, “Optimizing sparse matrix–vector product computations using unroll and jam,” International Journal of High Performance Computing Applications, vol. 18, no. 2, pp. 225–236, 2004.
  • [19] W. T. Tang, R. Zhao, M. Lu, Y. Liang, H. P. Huyng, X. Li, and R. S. M. Goh, “Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on intel xeon phi,” in Code Generation and Optimization (CGO), 2015 IEEE/ACM International Symposium on.   IEEE, 2015, pp. 136–145.
  • [20] E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik, Quantitative system performance: computer system analysis using queueing network models.   Prentice-Hall, Inc., 1984.
  • [21] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011.
  • [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [23] U. W. Pooch and A. Nieder, “A survey of indexing techniques for sparse matrices,” ACM Computing Surveys (CSUR), vol. 5, no. 2, pp. 109–133, 1973.
  • [24] L. Dagum and R. Enon, “Openmp: an industry standard api for shared-memory programming,” Computational Science & Engineering, IEEE, vol. 5, no. 1, pp. 46–55, 1998.
  • [25] J. D. McCalpin, “Stream: Sustainable memory bandwidth in high performance computers,” 1995.
  • [26] R. Vuduc, J. W. Demmel, and K. A. Yelick, “Oski: A library of automatically tuned sparse matrix kernels,” in Journal of Physics: Conference Series, vol. 16, no. 1.   IOP Publishing, 2005, p. 521.
  • [27] E.-J. Im, K. Yelick, and R. Vuduc, “Sparsity: Optimization framework for sparse matrix kernels,” International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 135–158, 2004.
  • [28] J. W. Choi, A. Singh, and R. W. Vuduc, “Model-driven autotuning of sparse matrix-vector multiply on gpus,” in ACM Sigplan Notices, vol. 45, no. 5.   ACM, 2010, pp. 115–126.
  • [29] B.-Y. Su and K. Keutzer, “clspmv: A cross-platform opencl spmv framework on gpus,” in Proceedings of the 26th ACM international conference on Supercomputing.   ACM, 2012, pp. 353–364.
  • [30] P. Guo, L. Wang, and P. Chen, “A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on gpus,” Parallel and Distributed Systems, IEEE Transactions on, vol. 25, no. 5, pp. 1112–1123, 2014.
  • [31] K. Li, W. Yang, and K. Li, “Performance analysis and optimization for spmv on gpu using probabilistic modeling,” Parallel and Distributed Systems, IEEE Transactions on, vol. 26, no. 1, pp. 196–205, 2015.
  • [32] J. Li, G. Tan, M. Chen, and N. Sun, “Smat: an input adaptive auto-tuner for sparse matrix-vector multiplication,” in ACM SIGPLAN Notices, vol. 48, no. 6.   ACM, 2013, pp. 117–126.