I Introduction
Sparse matrixvector multiplication (SpMV) is a fundamental building block of iterative methods for the solution of large sparse linear systems and the approximation of eigenvalues of large sparse matrices arising in applications from the scientific computing, machine learning and graph analytics domains. It has been repeatedly reported to achieve only a small fraction of the peak performance of current computing systems due to a number of inherent performance limitations, that arise from the algorithmic nature of the kernel, the employed matrix storage format and the sparsity pattern of the matrix. In particular, SpMV is characterized by a very low flop:byte ratio, since the kernel performs
operations on amount of data^{1}^{1}1We define as the number of nonzero elements and as the number of rows/columns in the matrix. For simplicity, we assume a square matrix., indirect memory references as a result of storing the matrix in a compressed format and irregular memory accesses to the righthand side vector due to sparsity.SpMV is typically a memory bandwidth bound kernel for the majority of sparse matrices on multicore platforms. Its bandwidth utilization is strongly dependent on the sparsity pattern of the matrix and the underlying computing platform. Consequently, most optimization efforts proposed in the literature over the past years have focused on reducing traffic between caches and main memory, primarily by compressing the memory footprint of the matrix [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. With the advent of manycore architectures, including GPGPUs and the Intel Xeon Phi processors, the performance landscape has become more diverse. For instance, the larger number of cores available in such architectures make SpMV performance more sensitive to the workload distribution among threads. As the performance of SpMV is becoming increasingly dependent on both the input problem and the underlying computing platform, there is no onesizefitsall solution to attain high performance. Thus, we advocate the significance of matrix and architectureaware runtime specialization. The value of cherrypicking optimizations for SpMV is further emphasized by the fact that blindly applying optimizations can actually hinder performance. As an example, Fig. 1 shows the effect of different optimizations on an Intel Xeon Phi processor. Even though each optimization achieves significant gains for some matrices, it may result in nonnegligible slowdowns for others. Optimizing SpMV usually involves a preprocessing step to analyze the sparse matrix structure and may include format conversion, parameter tuning, etc. While the overhead of this step may be amortized for applications that reuse the same matrix over many invocations of SpMV, it can outweigh any performance benefit when a smaller number of iterations is required for solver convergence, which is often the case in preconditioned solvers [13], or when the structure of the matrix changes frequently, e.g. in graph applications. This is why many recent efforts have focused on designing more lightweight optimizations [14, 15].
In this paper, we propose an adaptive SpMV optimizer that seeks to provide performance stability, i.e., improved performance for all sparse matrices, and low overhead in order to be beneficial to problems that require a small number of iterations to converge. We especially focus on designing an optimizer that is more lightweight than a trivial one that runs all optimizations and selects the best for the target application.
In order to provide performance stability, our optimizer relies on SpMV bottleneck analysis of the input matrix on the target platform. We formulate bottleneck detection as a classification problem (Section IIIA
) and first develop a profileguided classifier (Section
IIIC), that relies on microbenchmarks to classify the matrix. As the online profiling phase has a nonnegligible cost, we go one step beyond, and propose a classifier that relies only on matrix properties that can be cheaply computed (Section IIID). This featureguided classifier is pretrained during an offline stage with the use of machine learning techniques and performs feature extraction onthefly. The runtime overhead of this classifier is very low, making it extremely lightweight (Section
IVD).Once the performance bottlenecks of a matrix have been detected, we apply suitable optimizations to tackle them. We employ a simple and easytoimplement set of optimizations (Section IIIE), based on the generalpurpose Compressed Sparse Row (CSR) sparse matrix storage format. Our approach, however, is compatible with a plethora of other optimizations found in the literature. Again here, the idea is to resort to optimizations that require short preprocessing times, further contributing to our objective for a lightweight scheme.
We evaluate our optimizer using both classifiers on two generations of Intel’s Xeon Phi manycore processor and an Intel Broadwell processor. The performance analysis in Section III demonstrates the diversity of SpMV bottlenecks among different matrices, while our experimental results in Section IV show the diversity of SpMV bottlenecks among different computing platforms and the effectiveness of the proposed approach in distinguishing and appropriately optimizing the majority of matrices, leading to significant speedups over the CSR and InspectorExecutor CSR SpMV implementations available in the latest Intel MKL library, which is highly optimized for x86based platforms.
Ii Background
In order to avoid the extra computation and storage overheads imposed by the large majority of zero elements found in a sparse matrix, it has been the norm to store the nonzero elements contiguously in memory and employ auxiliary data structures for their proper traversal. The most widelyused generalpurpose sparse matrix storage format, namely the Compressed Sparse Row (CSR) format [16], uses a row pointer array to index the start of each row within the array of nonzero elements, and a column index array to store the column of each nonzero element. The SpMV kernel using this format is given in Fig. 2.
SpMV is characterized by a low operational intensity. It performs floatingpoint operations on amount of data, leading to a flop:byte ratio of less than 1. According to the Roofline performance model [17], computational kernels with low operational intensities tend to be memory bound. Furthermore, the auxiliary indexing structures required to access the nonzero elements, namely rowptr and colind in case of CSR, introduce additional load operations, traffic for the memory subsystem, and cache interference [2]. Access to the input vector is irregular and depends on the sparsity pattern of the matrix. This fact complicates the process of exploiting any spatial or temporal reuse. Many sparse matrices contain a large number of short rows that may degrade performance on par with the small trip count of the inner loop [18]. Finally, many sparse matrices are characterized by rows with highly uneven lengths or regions with different sparsity patterns, which may result in workload imbalance [19].
Iii Optimization Tuning Methodology
Optimization selection can be solved in various ways. For instance, one could simply take an empirical approach: measure how different optimizations work for a particular matrix on the target machine and then apply the most efficient optimization on future runs of the same matrix. This is the most accurate method, however, it incurs a substantial overhead to sweep over all candidate optimizations, some of them requiring significant preprocessing time including format conversion and autotuning parameters. In order to reduce this overhead, one could employ a “blind” machine learning approach. By defining optimizations as classes, one could train a classifier with machine learning techniques to select a proper optimization. However, in this case, the classifier would need to be rebuilt whenever an optimization is added or replaced and, furthermore, no intuition would be provided into when and why an optimization works.
We propose a solution that is lightweight, intuitive and enables a modular framework design. We formulate optimization selection as a multiclass, multilabel classification problem, where classes represent performance bottlenecks. For every class, we implement an optimization that is designed to address the corresponding bottleneck. Given an input matrix, its bottlenecks are identified through classification and the corresponding optimizations are jointly applied. Since most optimizations proposed in the literature are designed with a specific bottleneck in mind, by detecting bottlenecks instead of optimizations, our classifier explains why an optimization actually improved performance for a particular matrix. Furthermore, by decoupling bottleneck identification from the application of optimizations, one can build a classifier once and optimizations can be henceforth added or replaced in a plugandplay fashion.
Iiia Formulation as a Classification Problem
Based on the analysis presented in Section II and the extensive research that has been realized over the past few decades in SpMV performance analysis and optimization, we define the following classes, each representing a potential SpMV performance bottleneck:

[]

This class includes matrices that achieve memory bandwidth utilization values close to the peak (which may differ depending on the working set size) and are, therefore, Memory Bandwidth bound. Such matrices usually exhibit a regular sparsity structure.

This class refers to matrices that are Memory Latency bound. Such matrices exhibit poor locality in the accesses to the righthand side vector due to a highly irregular sparsity pattern, which cannot be detected by hardware prefetching mechanisms available in current architectures.

This class includes matrices with highly uneven row lengths or regions with completely different sparsity patterns that exhibit thread IMBalance due respectively to an uneven workload distribution or computational unevenness.

This class includes matrices that present CoMPutational bottlenecks. Such matrices either fit in the system’s cache hierarchy and have an operational intensity closer to the corresponding dense kernel, pushing them closer to the ridge point of the Roofline model [17], or have the majority of nonzero elements concentrated in a few dense rows, e.g. power law matrices.
The above classes are quite generic and cover a wide variety of matrices with different structural characteristics.
IiiB PerClass Performance Bounds
In this section, we perform a thorough bound and bottleneck analysis for the SpMV kernel. Inspired by the “bounding techniques” described in [20], for every class defined in Section IIIA
, we derive an upper bound on performance that represents the maximum performance that can be achieved by completely eliminating the corresponding bottleneck. The idea here is that by comparing the actual performance with the perclass upper bounds, we are able to estimate which bottlenecks are important to address and what is the potential performance gain. In the following analysis,
stands for performance, for memory traffic, for bandwidth, for size and for execution time.
[]

For matrices of this class the available memory bandwidth is saturated. Thus, an upper performance bound can be derived by assuming maximum sustainable main memory bandwidth as
where is the maximum sustainable memory bandwidth of the system^{2}^{2}2We adjust the bandwidth upwards for matrices that fit in the system’s cache hierarchy., while and are the minimum memory traffic that can be generated by the matrix stored in format and the vectors respectively. Note that compulsory misses set the minimum memory traffic and, hence, and .

For this class, we can derive an upper bound by running a modified SpMV kernel where irregular accesses to the righthand side vector are converted to regular accesses. In case of the CSR format, we can achieve this by setting all entries of the colind array to the row index of the corresponding element.

An upper bound for this class can be estimated using the median execution time among all threads as
We use the median instead of the mean, as we require reduced importance to be attached to outliers.

For matrices of this class, we can define a very loose upper bound by running a modified SpMV kernel where we completely eliminate indirect memory references, resulting in unitstride accesses only. In case of the CSR format, we no longer use
colind to index vector , but always access , where is the current row index.
A generic upper bound on SpMV performance, independent of the format used to store the sparse matrix, can be defined as follows:
where and are defined as before. In order to estimate we assume we can only compress the indexing information of a sparse matrix (not the values). In this case, a minimum can be determined by completely eliminating the indexing structures, i.e., .
We apply the bound and bottleneck analysis on an Intel Xeon Phi coprocessor using the CSR format as our baseline SpMV implementation. You may refer to Section IVA for details on the experimental setup. The performance of the baseline kernel is denoted by . Among the previously defined performance bounds, only and require a microbenchmark to be run onthefly. and only require to be measured on the target system, while can be deduced from the baseline run. Fig. 3 shows the performance of the baseline along with the perclass and generic upper bounds for a representative set of matrices from the University of Florida Sparse Matrix Collection [21]. The relation of each perclass upper bound to the baseline differs among matrices, indicating a potential diversity in the bottlenecks of each matrix.
IiiC ProfileGuided Classifier
By comparing the perclass upper bounds to the baseline performance presented in the previous section, we are able to derive heuristics that can be used to determine the class(es) of each matrix. Examining Fig.
3 in more detail, we distinguish the following cases:
(consph, rajat30, circuit5M, etc.) Performance did not improve by eliminating irregular memory accesses, hence SpMV does not suffer from excessive cache misses. Note that these matrices incurred a slowdown with software prefetching in Fig. 1.

(citationCiteseer, webGoogle, nd24k, etc.) This implies a balanced execution time among all threads.

(consph, nd24k, boneS10, etc.) In this case, either SpMV has saturated the available main memory bandwidth and, thus, the matrix is a candidate of the MB class.

and/or (poisson3Db, parabolic_fem, offshore) In this case CSR SpMV benefited from eliminating irregularity and/or balancing thread execution time, indicating it may belong to the ML and/or IMB classes.

(ASIC_680k, rajat30, degme) This suggests that SpMV using the CSR format might be limited by computational bottlenecks, based on the following analysis:
(1) Since the memory footprint of the matrix in the computation of is smaller (we no longer use the colind array) and we have eliminated irregularity as well, we can assume that and . Thus, in order for (1) to hold, . Thus, the matrix is not memory bound (neither latency nor bandwidth). If additionally the execution is imbalanced and the threads causing the imbalance are computelimited. This condition holds for matrices that have the majority of nonzero elements concentrated in a few dense rows, e.g., ASIC_680k, rajat30, degme, FullChip and circuit5M. Note that these matrices improve with vectorization in Fig. 1. If , the matrix may be suffering from loop overheads in the presence of very short rows, e.g., webbase1M, flickr, circuit5M.

Based on a theoretical analysis similar to the previous one, we conclude that this condition may occur when SpMV’s working set fits in the system’s cache hierarchy and the system’s bandwidth vastly improves for such working sets. This case does not appear in our matrix suite on KNC, it does however on other platforms.
Our analysis clearly exposes a diversity in performance bottlenecks on Intel Xeon Phi. Many matrices suffer from thread imbalance and excessive cache misses, while a subset of those seem to have competing bottlenecks, e.g., parabolic_fem, offshore, webbase1M, etc. This diversity justifies an SpMV optimizer that is matrixadaptive. Towards this direction, we design a rulebased classification algorithm based on the previous observations. Since we collect data for the performance bounds during an online profiling phase, we will henceforth refer to this classifier as the profileguided classifier.
The algorithm is depicted in Fig. 4. The classification is performed by comparing perclass performance bounds and by evaluating their impact on the baseline kernel. The values of and
, which form the hyperparameters of our classifier, have been tuned using grid search, which simply performs an exhaustive search through the specified hyperparameter space for a combination of values that maximizes some performance metric. We choose to maximize the average performance gain of the corresponding optimizations on a large set of matrices (see Section
IIIE). Notice that it is possible for a matrix not to be classified. This reflects cases for which it is not worth applying any of the optimizations in our pool. Also note that this classifier implicitly deduces SpMV performance characteristics on the underlying computing platform through hyperparameter tuning and profiling, making it architectureadaptive.IiiD FeatureGuided Classifier
We go one step further and design a classifier that relies on structural features of the sparse matrix to perform the classification. The intuition behind this approach is that one can devise features that represent comprehensive characteristics of a sparse matrix that are closely related to SpMV performance. For example, the number of nonzero elements per row can reveal workload imbalance during SpMV execution. In particular, if a matrix contains very uneven row lengths and a rowpartitioning scheme is being employed for workload distribution, then threads that are assigned long rows will take longer to execute resulting in thread imbalance. We leverage supervised learning techniques to build this classifier. The advantage of this approach is that, given a set of labeled matrices, the classification rules can be automatically deduced. We will henceforth refer to this classifier as the
featureguided classifier.We experiment with a Decision Tree classifier and adjust it to perform multilabel classification in order to detect all bottlenecks. We also add a dummy class to reflect matrices not worth optimizing with any of our optimizations (see Section
IIIE). Learning is performed using an optimized version of the CART algorithm, which has a runtime cost of for the construction of the tree, where is the number of features used and is the number of samples used to build the classifier. The query time for this classifier is . We generate the classifier using the scikitlearn machine learning toolkit [22].IiiD1 Feature Extraction
This classifier uses realvalued features to perform the classification. Table I shows all the features we experimented with, along with the time complexity of their extraction from the input matrix. denotes the number of rows in the matrix and the number of nonzero elements. We define as the number of nonzero elements of row , the column distance between the first and last nonzero element of row , , , where is the number of groups formed by consecutive elements in row and, finally, is the number of nonzero elements in row that can generate cache misses. We naively say that an element will generate a cache miss when its distance from the previous element in the same row exceeds the number of elements that fit in a cache line of the system.
Feature  Definition  Complexity 

0:exceeds or 1:fits in LLC  
IiiD2 Training Data Selection
We train our classifier with a data set consisting of sparse matrices from the University of Florida Sparse Matrix Collection [21]. We have selected a matrix suite consisting of 210 matrices from a wide variety of application domains, to avoid being biased towards a specific sparsity pattern.
IiiD3 Training Data Labeling
Since the classes of a matrix cannot be determined in a straightforward manner, we use our profileguided classifier for this purpose. An issue that arises by this choice is that the validity of the labels and, consequently, the accuracy of the trained classifier, depends on the behavior of the profileguided classifier.
IiiE Optimization Pool
Table II gives a mapping of classes to optimizations. We have incorporated optimizations that require minimal preprocessing time, further contributing to our objective for a lightweight scheme. Note that in case multiple bottlenecks are detected, we jointly apply the corresponding optimizations.
For matrices that are memory bandwidth limited, we employ a rather simple compression scheme, just for illustrative purposes. We use delta indexing on the column indices of the nonzero elements of the matrix, a technique which was originally applied to SpMV by Pooch and Nieder [23]. We use 8 or 16bit deltas wherever possible, but never both, in order to limit the branching overhead during SpMV computation. We employ vectorization as well, which helps attain a higher bandiwdth utilization value on modern processors with wide SIMD units.
For the ML class of matrices, we employ software prefetching, since the source of excessive cache misses in SpMV is a result of indirect memory addressing to the righthand side vector, which cannot be efficiently tackled by current hardware prefetching mechanisms. A single prefetch instruction was inserted in the inner loop of SpMV, with a fixed prefetch distance equal to the number of elements that fit in a single cache line of the hardware platform. Data are prefetched into the L1 cache.
For matrices that suffer from thread imbalance, we include multiple optimizations, as thread imbalance can be a result of either workload imbalance (number of nonzero elements per thread) or computational unevenness. We select the final optimization based on structural features of the matrix. The first subcategory includes matrices with highly uneven row lengths that we detect by comparing the and features from Table I. For these matrices, we basically decompose the matrix into two parts stored in a modified CSR format, depicted in Fig. 5. SpMV is performed in two steps, described in Fig. 6. First, SpMV is performed as usual by skipping the long rows, and, then long rows are computed with a different assignment of computation to threads. Every row is computed by all threads and a reduction of partial results follows. For the second subcategory, which we detect with the feature, we employ the auto scheduling policy available in the OpenMP runtime system [24]. When the auto schedule is specified, the decision regarding scheduling is delegated to the compiler, which has the freedom to choose any possible mapping of iterations (rows in this case) to threads.
Finally, for matrices that present computational bottlenecks, we combine inner loop unrolling with vectorization.
Class  Optimization 

MB  column index compression through delta encoding [23] + vectorization 
ML  software prefetching on vector x 
IMB 
matrix decomposition, auto scheduling (OpenMP) 
CMP  inner loop unrolling + vectorization 
Iv Experimental Evaluation
Our experimental evaluation focuses on three aspects: matrix classification accuracy, the performance benefit of applying our methodology for optimizing SpMV, as well as the runtime cost of the optimization workflow.
Iva Experimental Setup and Methodology
Our experiments were performed on a first generation Intel Xeon Phi coprocessor (KNC), a second generation Intel Xeon Phi bootable processor (KNL) configured in the Flat mode, in which the highbandwidth MCDRAM memory (HBM) is directly addressable, and an Intel Broadwell processor (Broadwell). On KNL, we allocate the entire application on HBM using the numactl tool. Table III lists the technical specifications of each platform. We disabled the turbo boost technology to avoid fluctuations in the measurements.
We used the OpenMP parallel programming interface and compiled our software using Intel ICC17.0.1 and the Intel OpenMP library. The benchmarks used for the profileguided classifier as well as the baseline SpMV implementation use the CSR format and are compiled with the O3 qoptprefetch=0 qoptthreadspercore=4 flags. To simulate the typical SpMV computation in scientific applications, we used double precision floatingpoint values for the nonzero elements. Unless stated otherwise, the baseline and optimized implementations employ a static onedimensional row partitioning scheme, where each partition has approximately equal number of nonzero elements and is assigned to a single thread. For comparison purposes we used Intel MKL 2017.0. Finally, we enforced the thread affinity policy by setting OMP_PLACES=threads and OMP_PROC_BIND=close using all the cores in each system.
Whenever we report performance rates for SpMV, e.g. Gflop/s, they have been summarized over 5 benchmark runs using the harmonic mean. The rate of each individual run is the rate of the arithmetic means of the absolute counts (floatingpoint operations and seconds) of 128 SpMV operations, that is, we perform warm cache measurements.
Codename  Knights Corner (KNC)  Knights Landing (KNL)  Broadwell 

Model  Intel Xeon Phi 3120P  Intel Xeon Phi 7250  Intel Xeon E52699 v4 
Microarchitecture  Intel Many Integrated Core  Intel Many Integrated Core  Intel Broadwell 
Clock frequency  1.10 GHz  1.40 GHz  2.20 GHz 
L1 cache (D/I)  32/32  32/32  32/32 
L2 cache  30  34  256 
L3 cache      55 
Cores/Threads  57/228  68/272  22/44 
Sockets  1  1  1 
STREAM triad main/llc [25]  128/140  395/570  60/200 
IvB FeutureGuided Classifier Accuracy
First, we evaluate our featureguided classifier in terms of accuracy, assuming the labels generated by the profileguided classifier are correct. We estimate how accurately our models perform using LeaveOneOut cross validation. According to this methodology, for a training set of matrices (), experiments are performed. For each experiment matrices are used for training and one for testing. The computed performance metric per experiment is the Exact Match Ratio, which is defined as the percentage of samples for which the predicted set of classes is fully correct, i.e. it exactly matches the set of labels produced by our profileguided classifier. However, since we apply at least one optimization per matrix we can tolerate predictions that are partially correct, thus, we also define a performance metric, denoted by Partial Match Ratio, that considers a prediction to be correct if it contains at least one correct class. The final accuracy score reported by cross validation is the average of the values computed in the loop of experiments.
Table IV reports the most accurate featureguided classifiers on KNC. Due to space limitations, we do not present the classifiers of the other platforms, however, they employ a similar set of features and similar accuracies. We report classifiers with increasing time complexity in the feature extraction phase along with their achieved accuracy score based on both the Exact and Partial Match Ratio performance metrics. The selection of features for the classifiers has been a result of exhaustive search.
Features  Complexity  Accuracy Exact (%)  Accuracy Partial (%) 

O(N)  80  95  
O()  84  100 
IvC Performance of selected optimizations
Contrary to standard classification problems, our key measure of success is not classifier accuracy—rather, we aim to maximize the average performance improvement of our optimizer over the baseline implementation, as well as the CSR mkl_dcsrmv() kernel available in the Intel MKL library. In this way, we designate the value of optimization tuning for SpMV. We also evaluate the recently introduced InspectorExecutor CSR mkl_sparse_d_mv() kernel of Intel MKL.
Fig. 7 presents SpMV performance achieved by MKL CSR, MKL InspectorExecutor CSR, our baseline CSR implementation, our profiling and featureguided classifiers, as well as the oracle—the perfect optimizer that always selects the best optimization available—on all experimental platforms for a representative set of matrices. The classes of each matrix according to the profileguided classifier are also reported. We note that we used the featureguided classifier with the highest accuracy reported in Table IV.
On KNC and KNL, our initial observation concerns the diversity of performance bottlenecks detected by our profileguided classifier. This was expected based on our bound and bottleneck analysis in Section IIIB, and is mainly due to the architectural characteristics of the Xeon Phi processors. Both KNC and KNL contain a large number of cores, a fact that favors workload imbalance, and a very expensive (an order of magnitude higher compared to multicores) cache miss latency, which further exposes irregularity in the accesses to the righthand side vector. Thus, there are many matrices that fall out of the standard MB class. Our classifiers manage to successfully capture this trend and select proper optimizations due to their matrix and architectureawareness. Notice that the largest speedups for our classifiers are achieved by combining optimizations from different classes.
Specifically on KNC, rajat30 and flickr are the only matrices where our classifiers do not detect the optimal optimization. In case of rajat30, which is characterized with a high concentration of nonzero elements in a few dense rows, our classifiers do not recognize it as a candidate of the ML class and, consequently, do not apply software prefetching which offers the additional performance boost. By further exploring, we discovered that the benchmark that exposes irregularity for the profileguided classifier (see Section IIIB), can actually detect the irregularity in this matrix by looking at it in partitions, instead of looking at it as a whole. We intend to extend our classification approach to incorporate this idea in future work. In case of flickr, the combination of auto scheduling, prefetching and vectorization did not pay off, with the best performance being achieved by simply applying prefetching. In total, the profile and featureguided classifiers achieve an impressive average and speedup over MKL CSR respectively.
On KNL, the profile and featureguided classifiers achieve an average and speedup over MKL CSR respectively. The MKL InspectorExecutor also significantly improves over MKL CSR by on average, but our classifiers manage to further improve performance in most cases. The largest speedups over the InspectorExecutor occur for matrices with imbalanced execution, that is either due to highly uneven row lengths ({IMB, CMP} set of classes), e.g., ASIC_680k, rajat30, degme, or computational unevenness ({ML, IMB} set of classes), e.g., parabolic_fem, thermal2. Also notice how some matrices present different or additional bottlenecks compared to KNC. For instance, human_gene1 is latency bound (ML class) on KNC, while on KNL it is bandwidth bound (MB class).
Finally, on Broadwell the profile and featureguided classifiers achieve an average and speedup over MKL CSR respectively. Speedups occur mostly for matrices that entirely or partially fit in the system’s cache hierarchy and include the ML or IMB classes. The MKL InspectorExecutor also achieves an average over MKL CSR, but our classifiers manage to further improve performance in most cases.
IvD Runtime Overhead
The workflow of the proposed optimization process comprises two basic steps: bottleneck detection through classification and generation of optimized code based on the classification results. Concerning the first step, our profileguided classifier runs a number of microbenchmarks on the input matrix and applies the empiricallytuned classification algorithm (see Section IIIC), while the featureguided classifier extracts matrix properties and applies a pretrained classifier (see Section IIID). Runtime code generation is performed JustInTime (JIT). Depending on the selected optimizations, e.g. compression, an intermediate preprocessing step may be required.
In order to provide a meaningful illustration of how lightweight the proposed optimizers are, we are going to evaluate them in the context of iterative methods for the solution of large sparse linear systems, e.g., variations of the Conjugate Gradient (CG) and the Generalized Minimal Residual (GMRES) methods. Such solvers repeatedly call SpMV and usually require hundreds to thousands of iterations to converge, making it possible to amortize the overhead of the optimization process, e.g., optimization selection, format conversion, runtime code generation, parameter tuning etc. However, it is quite common in reallife applications to run preconditioned versions of these methods to accelerate convergence. In this case, the number of iterations may be significantly smaller, ranging from dozens to hundreds, thus, limiting the online overhead that can be tolerated [13].
In Table V we present the minimum number of iterations required for our optimizers, as well as the MKL CSR InspectorExecutor and two trivial optimizers—one that runs all single optimizations (total of 5 in our case) and one that also includes combinations of 2 (total of 15 in our case) and selects the best for each matrix—to be beneficial over CSR MKL in the context of an iterative solver. Specifically, if and are the total execution times of the solver using a baseline and an optimized SpMV implementation respectively, and the optimizer overhead, then the number of solver iterations required to amortize that cost can be estimated as follows:
where is the number of iterations required for the solver to converge, and and the execution time of SpMV before and after optimization. For simplicity, we have assumed that the performance of other computations () in the solver are not affected by changes in the SpMV implementation. By setting an we get:
In Table V, we report the worst and best observed cases among our representative matrix suite, as well as the average over all matrices. We run 64 SpMV iterations to get valid timing measurements. Note that the trivial optimizers incur the preprocessing overheads of every optimization, while our optimizers only incur the overheads of the selected optimizations based on their predictions. As expected, both optimizers that leverage the profiling and featureguided classifiers have a significant advantage over the trivial optimizers. The same stands for the MKL InspectorExecutor. However, our featureguided optimizer requires by far the smallest number of iterations to provide a speedup over CSR MKL, making it the most lightweight approach.
Optimizer  

trivialsingle  455  910  8016 
trivialcombined  1992  3782  37111 
profileguided  145  267  3145 
featureguided  27  60  567 
MKL InspectorExecutor  28  336  1229 
V Related Work
Different sparse matrices have different sparsity patterns, and different architectures have different strengths and weaknesses. In order to achieve the best SpMV performance for the target sparse matrix on the target platform, an autotuning approach has long been considered to be beneficial. The first autotuning approaches attempted to tune parameters of specific sparse matrix storage formats. Towards this direction, the Optimized Sparse Kernel Interface (OSKI) library [26] was developed as a collection of high performance sparse matrix operation primitives on single core processors. It relies on the SPARSITY framework [27] to tune the SpMV kernel, by applying multiple optimizations, including register blocking and cache blocking. Autotuning has also been used to find the best block and slice sizes of the input sparse matrix on modern CMPs and GPUs [28].
There have been some research efforts closer to our work. The clSpMV framework [29] is the first framework that analyzes the input sparse matrix at runtime, and recommends the best representation of the given sparse matrix, but it is restricted to GPU platforms. Towards the same direction, the authors in [30] present an analytical and profilingbased performance modeling to predict the execution time of SpMV on GPUs using different sparse matrix storage formats, in order the select the most efficient format for the target matrix. For each format under consideration, they establish a relationship between the number of nonzero elements per row in the matrix and the execution time of SpMV using that format, thus encapsulating to some degree the structure of the matrix in their methodology. Similarly, in [31]
, the authors propose a probabilistic model to estimate the execution time of SpMV on GPUs for different sparse matrix formats. They define a probability mass function to analyze the sparsity pattern of the target matrix and use it to estimate the compression efficiency of every format they examine. Combined with the hardware parameters of the GPU, they predict the performance of SpMV for every format. Since compression efficiency is the determinant factor in this approach, it is mainly targeted for memory bandwidth bound matrices. Closer to our approach is the SMAT autotuning framework
[32]. This framework selects the most efficient format for the target matrix using feature parameters of the sparse matrix. It treats the format selection process as a classification problem, with each format under consideration representing a class, and leverages a data mining approach to generate a decision tree to perform the classification, based on the extracted feature parameters of the matrix. The distinguishing advantage of our optimization selection methodology over the aforementioned approaches, is that it decouples the decision making from specific optimizations, by predicting the major performance bottleneck of SpMV instead of SpMV execution time using a specific optimization. Thus, in contrary to the above frameworks, where incorporating a new optimization requires either retraining a model or defining a new one, our decisionmaking approach allows an autotuning framework to be easily extended, simply by assigning the new optimization to one of the classes.Vi Conclusions
In this paper, we show how a matrix and architectureadaptive optimizer can provide significant speedups for SpMV. The proposed optimizer treats SpMV optimization as a classification problem. It classifies the input matrix based on its performance bottlenecks and applies suitable optimizations to tackle them. We describe two classifiers: a) a profileguided classifier that relies on online profiling to perform the decision making, and b) a classifier trained offline with machine learning techniques that only extracts structural properties of the matrix onthefly. All aspects of the optimization process focus on minimizing online overheads in order to be applicable in reallife scenarios. Experimental evaluation on a representative matrix suite on one multi and two manycore platforms shows that our methodology is very promising, achieving significant speedups over the Intel MKL library with an online overhead that can be easily amortized.
Acknowledgement
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732204 (Bonseyes). This work was supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 16.0159. The opinions expressed and arguments employed herein do not necessarily reflect the official views of these funding bodies.
References
 [1] S. Toledo, “Improving the memorysystem performance of sparsematrix vector multiplication,” IBM Journal of research and development, vol. 41, no. 6, pp. 711–725, 1997.
 [2] A. Pinar and M. T. Heath, “Improving performance of sparse matrixvector multiplication,” in Proceedings of the 1999 ACM/IEEE conference on Supercomputing. ACM, 1999, p. 30.
 [3] J. C. Pichel, D. B. Heras, J. C. Cabaleiro, and F. F. Rivera, “Improving the locality of the sparse matrixvector product on shared memory multiprocessors,” in Parallel, Distributed and NetworkBased Processing, 2004. Proceedings. 12th Euromicro Conference on. IEEE, 2004, pp. 66–71.
 [4] R. W. Vuduc and H.J. Moon, “Fast sparse matrixvector multiplication by exploiting variable block structure,” in High Performance Computing and Communications. Springer, 2005, pp. 807–816.
 [5] J. Willcock and A. Lumsdaine, “Accelerating sparse matrix computations via data compression,” in Proceedings of the 20th annual international conference on Supercomputing. ACM, 2006, pp. 307–316.
 [6] K. Kourtis, G. Goumas, and N. Koziris, “Optimizing sparse matrixvector multiplication using index and value compression,” in Proceedings of the 5th conference on Computing frontiers. ACM, 2008, pp. 87–96.
 [7] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix–vector multiplication on emerging multicore platforms,” Parallel Computing, vol. 35, no. 3, pp. 178–194, 2009.
 [8] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks,” in Proceedings of the twentyfirst annual symposium on Parallelism in algorithms and architectures. ACM, 2009, pp. 233–244.
 [9] M. Belgin, G. Back, and C. J. Ribbens, “Patternbased sparse matrix representation for memoryefficient smvm kernels,” in Proceedings of the 23rd international conference on Supercomputing. ACM, 2009, pp. 100–109.
 [10] K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris, “Csx: an extended compression format for spmv on shared memory systems,” in ACM SIGPLAN Notices, vol. 46, no. 8. ACM, 2011, pp. 247–256.
 [11] A. Buluç, S. Williams, L. Oliker, and J. Demmel, “Reducedbandwidth multithreaded algorithms for sparse matrixvector multiplication,” in Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 2011, pp. 721–733.
 [12] M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, “A unified sparse matrix data format for efficient general sparse matrixvector multiplication on modern processors with wide simd units,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C401–C423, 2014.
 [13] R. Li and Y. Saad, “Gpuaccelerated preconditioned iterative linear solvers,” The Journal of Supercomputing, vol. 63, no. 2, pp. 443–466, 2013.
 [14] J. L. Greathouse and M. Daga, “Efficient sparse matrixvector multiplication on gpus using the csr storage format,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’14. Piscataway, NJ, USA: IEEE Press, 2014, pp. 769–780. [Online]. Available: https://doi.org/10.1109/SC.2014.68
 [15] W. Liu and B. Vinter, “Csr5: An efficient storage format for crossplatform sparse matrixvector multiplication,” in Proceedings of the 29th ACM on International Conference on Supercomputing, ser. ICS ’15. New York, NY, USA: ACM, 2015, pp. 339–350. [Online]. Available: http://doi.acm.org/10.1145/2751205.2751209
 [16] Y. Saad, Numerical methods for large eigenvalue problems. SIAM, 1992, vol. 158.
 [17] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
 [18] J. MellorCrummey and J. Garvin, “Optimizing sparse matrix–vector product computations using unroll and jam,” International Journal of High Performance Computing Applications, vol. 18, no. 2, pp. 225–236, 2004.
 [19] W. T. Tang, R. Zhao, M. Lu, Y. Liang, H. P. Huyng, X. Li, and R. S. M. Goh, “Optimizing and autotuning scalefree sparse matrixvector multiplication on intel xeon phi,” in Code Generation and Optimization (CGO), 2015 IEEE/ACM International Symposium on. IEEE, 2015, pp. 136–145.
 [20] E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik, Quantitative system performance: computer system analysis using queueing network models. PrenticeHall, Inc., 1984.
 [21] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011.
 [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [23] U. W. Pooch and A. Nieder, “A survey of indexing techniques for sparse matrices,” ACM Computing Surveys (CSUR), vol. 5, no. 2, pp. 109–133, 1973.
 [24] L. Dagum and R. Enon, “Openmp: an industry standard api for sharedmemory programming,” Computational Science & Engineering, IEEE, vol. 5, no. 1, pp. 46–55, 1998.
 [25] J. D. McCalpin, “Stream: Sustainable memory bandwidth in high performance computers,” 1995.
 [26] R. Vuduc, J. W. Demmel, and K. A. Yelick, “Oski: A library of automatically tuned sparse matrix kernels,” in Journal of Physics: Conference Series, vol. 16, no. 1. IOP Publishing, 2005, p. 521.
 [27] E.J. Im, K. Yelick, and R. Vuduc, “Sparsity: Optimization framework for sparse matrix kernels,” International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 135–158, 2004.
 [28] J. W. Choi, A. Singh, and R. W. Vuduc, “Modeldriven autotuning of sparse matrixvector multiply on gpus,” in ACM Sigplan Notices, vol. 45, no. 5. ACM, 2010, pp. 115–126.
 [29] B.Y. Su and K. Keutzer, “clspmv: A crossplatform opencl spmv framework on gpus,” in Proceedings of the 26th ACM international conference on Supercomputing. ACM, 2012, pp. 353–364.
 [30] P. Guo, L. Wang, and P. Chen, “A performance modeling and optimization analysis tool for sparse matrixvector multiplication on gpus,” Parallel and Distributed Systems, IEEE Transactions on, vol. 25, no. 5, pp. 1112–1123, 2014.
 [31] K. Li, W. Yang, and K. Li, “Performance analysis and optimization for spmv on gpu using probabilistic modeling,” Parallel and Distributed Systems, IEEE Transactions on, vol. 26, no. 1, pp. 196–205, 2015.
 [32] J. Li, G. Tan, M. Chen, and N. Sun, “Smat: an input adaptive autotuner for sparse matrixvector multiplication,” in ACM SIGPLAN Notices, vol. 48, no. 6. ACM, 2013, pp. 117–126.
Comments
There are no comments yet.