General, Hybrid and Optimized Sparse Toolkit (Bitbucket mirror)
Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from General Purpose Graphics Processing Units (GPGPUs) and vector computer programming. We discuss the advantages of SELL-C-sigma compared to established formats like Compressed Row Storage (CRS) and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-sigma spMVM kernel. SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent ("catch-all") sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.READ FULL TEXT VIEW PDF
General, Hybrid and Optimized Sparse Toolkit (Bitbucket mirror)
Many applications in science and engineering are based on sparse linear algebra. The solution of large eigenvalue problems or extremely sparse systems of linear equations is a central part of many numerical algorithms from quantum physics to fluid dynamics to structural mechanics. The solvers are typically composed of iterative subspace methods, including advanced preconditioners. At the lowest level, the multiplication of large sparse matrices with dense vectors (spMVM) is frequently one of the most time-consuming building blocks. Thus, the efficient implementation of this operation is of very high importance.
The spMVM kernel is usually memory-bound for realistic problems on all modern computer architectures, since its code balance (ratio of main memory data accesses to executed floating-point operations) is quite large compared to typical machine balance values (ratio of maximum memory bandwidth to arithmetic peak performance). Additional complications arise because the sparsity pattern of the matrix, i.e., the position of the non-zero entries, can have considerable impact on spMVM performance due to indirect access to the right hand side (RHS) vector; this makes it difficult to understand or even predict performance via simplistic bandwidth-based modeling. And finally, the sparse matrix storage format has a considerable performance impact and the optimal choice is known to be very sensitive to the underlying hardware. Consequently there is nowadays a large variety of sparse matrix storage formats to choose from. Some are more suitable for cache-based standard microprocessors (like CRS), while others yield better performance on vector computers (like “Jagged Diagonals Storage” [JDS]) or on graphics processing units (like ELLPACK and its variants).
Emerging coprocessors/accelerators like the Intel Xeon Phi or Nvidia Tesla GPGPUs are of special interest for executing spMVM because of their large memory bandwidth combined with a very high level of on-chip parallelism. These new compute devices are an integral part of several supercomputers already today. One may speculate that their proliferation will further increase, making (strongly) heterogeneous compute node architectures the standard building block of future cluster systems. Thus, sustainable and modern high-performance parallel software should be able to utilize both the computing power of accelerators as well as standard CPUs in the same system. As of today, in such a setting one is forced to deal with multiple sparse storage formats within the same application code. Hence, it is of broad interest to establish a single storage format that results in good performance for all architectures. Since even the current standard microprocessors feature SIMD execution or related techniques, such a format would have to support SIMD parallelism in an optimal way.
It has to be stressed that standard x86-based server microprocessors are usually so bandwidth-starved even with scalar code (i.e., they have a low machine balance) that a strongly memory-bound loop kernel such as spMVM does not benefit very much from SIMD vectorization, unless the working set is small enough to fit into a cache. However, SIMD does make a difference on designs with many very slow cores (such as the Intel Xeon Phi), and certainly on the massively threaded GPGPUs. Moreover it can be shown that efficient (i.e., SIMD-vectorized) single-core code can yield substantial energy savings on standard multicore processors by reaching the bandwidth saturation point with fewer cores [1, 2].
The high relevance of the spMVM operation for many application areas drives continuous, intense research on efficient spMVM implementations on all kinds of potential compute devices. This is why we here only briefly review relevant work in the context of establishing a single matrix data format for processor architectures used in modern supercomputers.
On cache-based CPUs the CRS format, as presented by Barrett et al. , still sets the standard if no regular matrix substructures can be exploited. Further work, especially on auto-tuning the performance of spMVM kernels on multi-core CPUs, has been done by Williams et al. . A detailed study of CRS performance characteristics on CPU architectures has been presented, e.g., by Goumas et al. .
A first comprehensive analysis of spMVM performance for GPGPUs can be found in Bell and Garland , who adopted the ELLPACK sparse matrix format which had been used on classic vector computers by Kincaid et al.  long before. Further research on this topic towards auto-tuning has been conducted by Choi et al. . These efforts have inspired a lot of subsequent work on more efficient data formats for sparse matrices on GPGPUs [9, 10, 11, 12]. A common finding in those publications is that ELLPACK-like matrix formats (such as ELLPACK, ELLPACK-R, ELLR-T, Sliced ELLR-T, pJDS) deliver the best performance for spMVM on GPGPUs.
The recent appearance of the Intel Xeon Phi coprocessor has spawned intense research activity around the efficient implementation of numerical kernels, including spMVM, on this architecture. First work from Saule et al.  is based on the (vectorized) CRS format and showed that this format is in general not suited for Intel Xeon Phi.
Liu et al.  have recently published a sparse matrix data format for the Intel Xeon Phi that is very similar to ours. They have obtained much better performance than with CRS, but the performance analysis of the format was focused on the Xeon Phi architecture, and it was not applied to GPGPUs and standard microprocessors.
Still, portability of these hardware-specific formats across different platforms remains an open issue. Recently an spMVM framework based on OpenCL has been introduced , which allows for code portability but does not provide a unified and efficient spMVM data format across compute devices of different hardware architecture. Such a format is highly desirable in order to address data distribution issues arising with dynamic load balancing and fault tolerance on heterogeneous systems: Re-distributing matrix data becomes a lot easier when all devices use the same storage format. In addition, a unified data format simplifies the definition and implementation of interfaces of numerical multi-architecture libraries.
This work demonstrates the feasibility of a single storage format for sparse matrices, which we call “SELL--.” It builds on Sliced ELLPACK  and delivers competitive performance on a variety of processor designs that can be found in modern heterogeneous compute clusters. Note that Sliced ELLPACK has only been used on GPGPUs up to now.
We examine the CRS and SELL-- formats specifically in terms of their suitability for SIMD vectorization on current x86 processors with “Advanced Vector Extensions” (AVX) and on the Intel “Many Integrated Core” (MIC) architecture. SELL-- shows best performance if the “chunk structure” of the format is chosen in accordance with the relevant SIMD width , i.e., the width of a SIMD register on x86 and Intel MIC. On GPGPUs this is the number of threads per warp. Sorting rows by the number of non-zero entries within a limited “sorting scope” of rows reduces the overhead of the scheme and improves performance on all architectures if is not too large.
In contrast to previous work on Sliced ELLPACK , our analysis is complemented by a thorough performance modeling approach which allows us to understand the influence of the two parameters and on the performance and their interaction with basic matrix properties such as the “chunk occupancy” (related to zero fill-in) and the number of non-zero entries per row.
Our analysis extends the work of Liu et al.  as it demonstrates that a single SIMD-optimized data format is appropriate for all current HPC architectures.
Using the matrices from the “Williams Group” in the University of Florida matrix collection, we finally demonstrate that a single data format and a SIMD-vectorized or CUDA-parallelized spMVM kernel with fixed values for and shows best or highly competitive performance on a standard multi-core processor (Intel Xeon Sandy Bridge), the Intel Xeon Phi accelerator, and the Nvidia Kepler K20 GPGPU for most matrix types.
This paper is strictly limited to the single-chip case, and we use OpenMP threading only. MPI and hybrid MPI+X parallelization (where X is a threading or accelerator programming model) is left for future work. Our spMVM formats also assume “general” matrices, i.e., we do not exploit special substructures that would enable optimizations such as blocking or unrolling. Adding those on top of the SELL-- matrix format implementation will be a challenge in itself.
For the performance evaluation three modern multi- and many-core architectures have been chosen in order to cover different architectural concepts which are of importance for current and future compute devices:
The Intel Xeon Sandy Bridge EP system (“Intel SNB”) stands for the class of classic cache-based x86 multi-core processors with a moderate number of powerful cores, moderate SIMD acceleration and still rather high clock frequency.
Trading core complexity and clock speed for core count and wide SIMD parallelism, the Intel Xeon Phi (“Intel Phi”) accelerator marks the transition from traditional multi-core technology to massively parallel, threaded architectures.
The Nvidia Kepler (“Nvidia K20”) architecture finally represents the class of GPGPU accelerators with their extreme level of thread parallelism, reduced core and execution complexity, and a different memory subsystem design.
Relevant specific key features of these compute devices are summarized in Table 1 and are briefly described below.
The Intel SNB is a single socket of a standard two-socket “Intel Xeon E5-2680” server. It is based on Intel’s Sandy Bridge-EP microarchitecture and supports the AVX instruction set, which works on 256-bit wide SIMD registers. It has eight cores with up to two hardware (SMT) threads each, and is equipped with 32 KiB of L1 and 256 KiB of L2 cache per core. The shared L3 cache has a total size of 20 MiB. Note that we restrict our measurements to a single socket to avoid potential interference from the ccNUMA characteristics of multi-socket shared memory systems.
The “Intel Xeon Phi 5110P” is based on Intel’s MIC architecture. It is a PCIe-based accelerator card comprising 60 rather simple cores (based on the P54C design, which was launched in 1994) with four hardware (SMT) threads each. The hardware threading is intended to compensate the deficiencies of the in-order core architecture. Each core is extended by a 512-bit wide SIMD unit, which can perform up to eight double-precision (or 16 single-precision) floating-point fused multiply-add operations in a single instruction. The Intel Xeon Phi has a shared but segmented L2 cache of 30 MiB with each segment of 512 KiB being attached to a single core. The L2 cache design has several known shortcomings; e.g., if the same cache line is used by multiple cores then each of them will hold a separate copy in its local L2 segment. This may reduce the effective L2 cache size for shared-memory parallel codes to 512 KiB in the worst case. The coprocessor is equipped with 8 GiB of global GDDR5 memory. ECC memory protection is available and was turned on for all measurements in this work. Using the SIMD units through code vectorization is essential to achieve reasonable performance on this architecture.
The “Nvidia Tesla K20c” is based on the “Kepler” architecture. It has 13 “Streaming Multiprocessors” (SMX), each with 192 single-precision CUDA cores, for a total of 2496 CUDA cores. Each double-precision unit is shared among three CUDA cores, for a total of 64 double-precision units per SMX. Within each multiprocessor, most of the hardware units are driven in a so-called “single instruction multiple threads” (SIMT) manner: A group of 32 threads, called “warp,” executes the same instruction at a time. The card has 1280 KiB of L2 cache and 5 GiB of global GDDR5 memory (with ECC memory protection, which was turned on for all measurements). Best memory performance is achieved if all threads of a warp access consecutive elements of an array at the same time (“load coalescing”).
Code compilation was done with the Intel C Compiler (ICC) 13.1.0 for the Intel machines, and with the CUDA Toolkit 5.0 for the Nvidia GPGPU. The Likwid tools111http://code.google.com/p/likwid were used for hardware performance counter measurements (e.g., memory bandwidth and energy) on the Intel SNB, and for controlling thread affinity on the Intel SNB and Phi.
A practical range for the achievable main memory bandwidth on Intel architectures is typically set by two “corner case” microbenchmarks (see Table 1): The “copy” benchmark represents the unfavorable case, while a read-only bandwidth benchmark (see 1) sets an upper limit.
Note that “non-temporal stores,” e.g., stores that bypass the cache and avoid the otherwise mandatory cache line write-allocate transfers on every write miss, were not used for the copy benchmark on either Intel architecture. Instead, the measured bandwidth available to the loop kernel was multiplied by to get the actual transfer rate over the memory interface. This mimics the data transfer properties for “skinny” sparse matrices with very few non-zero elements per row.
The slow read-only performance on the Nvidia K20 (see Table 1) reflects the difficulties with performing reduction operations on this architecture, even if they only happen within shared memory; the global reduction is even omitted in our case.
We conduct the detailed performance analysis of various storage formats based on the four matrices RM07R, kkt_power, Hamrle3, and ML_Geer from The University of Florida Sparse Matrix Collection222http://www.cise.ufl.edu/research/sparse/matrices. Their descriptions can be found in Appendix A. These specific matrices were chosen because they represent corner cases of matrix characteristics, which crucially influence the efficiency of the data layout.
The broad applicability of our insights is then validated against the matrices of the “Williams group” from The University of Florida Sparse Matrix Collection, available for download from Nvidia333http://www.nvidia.com/content/NV_Research/matrices.zip. These matrices have already been used in previous research [4, 6, 8] for analyzing the spMVM on GPGPUs, and thus provide a good basis for comparison.
Note that the four corner cases together with the 14 matrices from the “Williams group“ in our opinion constitute a sufficiently large set of test matrices in order to show the general applicability of our insights.
Basic properties such as the dimension , the number of non-zeros , the average number of non-zeros per row , and the density (fraction of non-zeros) of all considered matrices can be found in Table 2. The parameter will be introduced in subsection 3.4.
). There are strong differences between these formats in terms of the storage order of the non-zero entries, the use of padding (ELLPACK), and the row reordering (JDS), due to the peculiarities of each hardware platform. These differences make it tedious to use heterogeneous systems efficiently. In the following we identify a unified low-overhead storage format, which is designed to be efficient on the three classes of compute devices considered in this work. Guided by the equivalence of SIMD and SIMT (warp) execution, we analyze the overhead and benefit of SIMD vectorization strategies for the CRS format and for an improved variant of the ELLPACK scheme.
The CRS data format is a cache-friendly layout ensuring consecutive data access to the matrix elements and the column indices. The C version of a CRS spMVM kernel is given in 2.
The non-zero matrix entries are stored row by row in the array
val, and their original column indices are put in
col. The starting offsets of all rows are available in the array
rpt. A sketch of the CRS storage scheme for the matrix in Figure a is shown in Figure b.
Efficient SIMD vectorization requires consecutive data access for optimal performance. Thus the inner loop in 2, which runs over all non-zero entries of each row, is the target for vectorization. Applying four-way “modulo unrolling” to this loop, we can formulate the CRS spMVM kernel in a SIMD-friendly way, tailored for the AVX register width of four elements as used by the Intel SNB processor (see 3).
The compiler can often do this by itself and
vectorize the bulk loop such that the body is
executed in a SIMD-parallel manner, e.g.,
assigned to a single AVX register and
loaded with a single instruction to another register. The initial loop
peeling to satisfy alignment constraints is omitted for brevity.
The same strategy is chosen by the Intel compiler for the vectorization of the basic CRS code on the Intel Phi with an appropriate choice of unrolling factor (eight instead of four; see ).
Vectorized execution of the CRS spMVM may be inefficient, especially for matrices with few non-zeros per row (). -independent overheads of the vectorized code, like the “horizontal” add operation (line 3 in 3) or the scalar remainder loop (lines 3–3) may then eat up the performance gained by vectorization. Note that these particular costs grow with the SIMD width.
Especially on the Intel Phi, handling of scalar overheads like a remainder loop may be expensive: Even though almost all SIMD operations can be masked to “emulate” scalar execution, there is an additional penalty for setting up the mask and executing a separate instance of the loop body. The worst case occurs when the row length is in the order of (or even smaller than) the SIMD width. In this case, the amount of non-vectorized and/or inefficiently pipelined work introduces a significant overhead. For instance, on the Intel Phi a total of 16 single-precision (or eight double-precision) values can be processed with a 512-bit SIMD instruction at a time. A good utilization of the SIMD lanes thus demands an even larger than on the Intel SNB. Additionally, alignment constraints may require some loop peeling, which further reduces the SIMD-vectorized fraction. In summary, on a wide-SIMD architecture the average number of non-zeros per row needs to be substantially larger than the SIMD width of the compute device.
The rather large SIMD width of the GPGPU architecture together with the cost of reduction overhead even within a warp (see also the discussion in subsection 2.2) immediately rules out the CRS format if SIMD/SIMT execution is performed along the inner loop. Parallelizing the outer loop eliminates this problem but destroys load coalescing, since threads within a warp operate on different rows and access elements concurrently that are not consecutive in memory. Hence, CRS is a bad choice on GPGPUs in any case .
The ELLPACK format addresses the problems of CRS for GPGPUs. A Column-wise data layout and padding all rows to the same length (see Figure c) allows row-wise thread parallelization and coalesced memory access to the matrix data. Of course, the number of rows must also be padded to a multiple of the warp size (not shown in Figure c). Since ELLPACK may incur substantial overhead for padding (see white boxes in Figure c), Monakov et al.  have proposed a variant called “Sliced ELLPACK,” which substantially reduces this overhead and increases data locality between successive column computations within a warp.
The main idea is to cut the ELLPACK data layout into equally-sized “chunks” of rows with rows per chunk. Rows are zero-padded to match the length of the longest row within their chunk, reducing the padding overhead substantially as compared to ELLPACK. Then all elements within a padded chunk are stored consecutively in column-major order, and all chunks are consecutive in memory (see Figure a). Unless all rows in the chunk are of equal length, there may still be a substantial penalty in terms of data storage, which will be discussed in subsection 3.4. In addition, the number of matrix rows must be padded to a multiple of . We call this format “SELL--,” since it is parametrized by the chunk size and a “sorting scope” , which will be explained in subsection 3.4. For the remainder of this section we assume no sorting ().
4 shows the C version of the spMVM for SELL-4-,
unrolled to match the SIMD width of an AVX-capable
processor. For reference, a version of the SELL-4- spMVM with AVX intrinsics
is shown in 6.
The matrix entries and their column indices are stored in
col. In addition, the starting offset
of each chunk is stored in
cs and the width of each chunk,
i.e., the length of the longest row in the chunk, is stored in
cl[i] is equal to (
In contrast to the CRS spMVM, the SELL-- kernel has a vectorizable inner loop without a reduction operation, so the “horizontal add” is not required. Also the remainder loop handling is obsolete since is padded to a multiple of . A direct comparison with CRS shows that the inner loop unrolling in the SELL-- kernel corresponds to outer loop unrolling in CRS but ensures cache locality and eases alignment and coalescing constraints, since the matrix data accessed in the inner loop iterations is consecutive in memory. The SELL-- kernel can be vectorized by the compiler or through the use of C intrinsics on Intel systems. Thus, SELL-- is a promising candidate for delivering high efficiency on a variety of compute devices.
The optimal choice of needs to take into account both the padding
overhead of the SELL-- format and hardware-specific restrictions.
SELL--1 is identical to ELLPACK (
not strictly needed in this case), and has maximum padding overhead
as discussed, e.g., in . The other extreme case SELL-1-1 is
equivalent to CRS, and there is no padding at all.
Hence, it is crucial to choose as small as possible but still
compatible with architectural requirements. On the Nvidia K20, the
reasonable (minimum) choice is , i.e., each chunk is executed by one warp.
According to the equivalence of SIMD and SIMT execution, a
first choice for on CPUs would be the SIMD register width in units of
the matrix value data size (e.g., as shown in 4
for an AVX-capable processor in double precision in plain C and in
6 with compiler intrinsics). On the Intel Phi one would
naively set ; however, all vectorized data accesses need to be
512-bit aligned. This leads to the hardware-specific constraint that
has to be a multiple of 64 bytes on the Intel Phi, where
is the length of the -th chunk.
In our case (double-precision matrix, four-byte integer index)
this condition is fulfilled with
cl[i], so we choose for Intel Phi.
Note that on a heterogeneous system, different minimal values of may apply to each component architecture . An obvious solution to this issue in order to obtain a consistent format is to choose the global chunk height . See subsection 5.1 for a discussion of a unified data format for all architectures.
In order to quantify the overhead incurred by the zero-padding in the SELL-- format we define the “chunk occupancy” . It is the fraction of “useful” matrix data entries, i.e., the ratio between the number of non-zero matrix elements and the elements stored in the SELL-- format:
Here, is the number of chunks for the matrix,
cl[i] is defined as above:
The values for all test matrices can be found in Table 2. The meaning of will be explained below.
The minimal value for (worst case scenario) indicates a matrix structure for which the SELL-- data transfer overhead is at a maximum. Such a matrix has a single (fully populated) row with non-zeros in each chunk, and only a single non-zero in all other rows of the same chunk. In this case, elements have to be loaded per chunk while only elements are actually non-zero:
In contrast to this, a constant row length within each chunk (the row length does not have to be constant globally) leads to the best case scenario with , since no zero-padding elements have to be transferred.
A small can be increased by sorting the matrix rows by row length in descending order, so that rows of equal length end up close to each other. Obviously, the overhead becomes minimal when sorting the matrix rows globally, as shown in Figure c. In this case, and the matrix format is identical to pJDS , which can be considered as a zero-padded version of the JDS format (see Figure d) with appropriate . However, when sorting matrix rows globally there is a chance that the access pattern to the RHS vector changes substantially and spatial or temporal locality arising from the physical problem is destroyed. This may lead to an increase in code balance (more data transfers are needed per flop) and, as the kernel performance is already limited by data transfers, to a performance drop. See section 4 below for a quantification of such effects using suitable performance models.
A way to ameliorate this problem is to not sort the matrix rows globally but only within chunks of consecutive rows. Typically, this “sorting scope” is chosen to be a multiple of ; if is a divisor of , there is no effect on . Here we restrict our analysis to powers of two for and (it is certainly not ruled out that a specific, non-power of two choice of might be advantageous for a specific matrix). The effect of local sorting is shown in Figure b for and . The “optimal” , i.e., for which the RHS access is still “good” but which leads to a sufficiently large , is usually not known a priori. Only for very regular matrices can this problem be solved exactly: For the worst-case matrix with as given in Equation 6, results in a perfect . In this case, there is one chunk with length and chunks of length one within the scope of rows.
At this point it has to be noted that when sorting the matrix rows, the column indices need to be permuted accordingly in most of the application cases. This has two major reasons: First, in iterative solvers the algorithm usually switches after each iteration between the “input” and “output” vectors of the previous spMVM operation. Thus these schemes often work in the permuted indices space. Second, possible “matrix bandwidth escalation” of the non-zero pattern due to row re-ordering may be averted by permuting the column indices. The matrix bandwidth is the maximum distance of non-zero entries from the main diagonal.
Sorting the matrix rows is part of the preprocessing step and has to be done only once. Assuming that a large number of spMVM operations will be executed with the sorted matrix, the relative overhead of the sorting itself can usually be neglected. Furthermore, since we only sort inside a certain limited scope, the cost of sorting a single scope is small and parallelization across different scopes is straightforward.
On GPGPUs, the SELL-- format enables a specific optimization: Due to the fact that there is one dedicated thread per matrix row, it is easy to avoid loading zero matrix elements by letting each thread run only until the actual row length is reached. This makes the data format equivalent to the “Sliced ELLR-T” format as introduced by Dziekonski et al.  (with the number of threads running per row set to ). However, there is still a penalty for low- matrices on GPGPUs as the resources occupied by the threads of a warp are only available after the longest-running thread of this warp has finished.
Beyond the issues of vectorization and excess data transfers,
indirect access to the RHS vector
x may further impede
the performance of the spMVM kernel for reasons unconnected
to a specific data storage format.
First, performance will drop significantly if the elements of
x accessed in consecutive inner spMVM loop iterations are not
close enough to each other, so that inner cache levels or “load
coalescing” cannot be used and main memory data access becomes
irregular. A rather general approach to address this problem is to
reduce the matrix bandwidth by applying a bandwidth reduction
algorithm, such as “Reverse Cuthill McKee” (RCM)
. Such transformations are
outside the scope of this work, but the impact of non-consecutive
accesses can be explored in more detail using performance
models (see section 4).
Second, moving the elements of the RHS vector
x into a SIMD
register may come along with a large instruction overhead. Up to now,
x86-based CPU instruction sets (including AVX) do not provide a
“gather” instruction, and loading the elements of
x has to
be done with scalar loads. Thus, filling an AVX register with four
x requires at least five instructions in total (one
to load the four consecutively stored indices and four to fill the
vector registers with the values of
x; in practice, the number
is even larger since the individual slots of a SIMD register can
usually not be freely addressed). However, Intel’s MIC architecture
does provide a “gather” instruction. It can fetch multiple data
items residing in the same cache line from memory to a vector register
even if the addresses are not consecutive. This potentially enhances
performance for loading the elements of
x compared to scalar
loads. However, the actual benefit depends on the locality of the
matrix entries in a single row. In the worst case, i.e., if all
gathered items reside in different cache lines, one gather instruction
per load is required and the whole operation is basically scalar
again. Note that the adverse effects of instruction overhead will only be
visible if no other resource such as main memory bandwidth (the most
promising candidate for spMVM) already limits the attainable
performance. This is true on any architecture.
For large data sets, the spMVM is strongly memory-bound. The spMVM
kernels in Listings 2, 3, and
4 are characterized mainly by data streaming (arrays
col) with partially indirect access (RHS
x). Assuming no latency effects and infinitely fast
caches, it is possible to establish roofline-type performance models
where accounts for reading the matrix entries and column indices, is the traffic incurred by reading the RHS vector (including excess traffic due to insufficient spatial and/or temporal locality), and is the data volume for updating one LHS element. Assuming double precision matrix and vector data and four-byte integer indices we have and . The efficiency of the RHS data access is quantified by the parameter in . Hence, we get:
The value of is governed by a subtle interplay between the matrix structure and the memory hierarchy on the compute device: If there is no cache, i.e., if each load to the RHS vector goes to memory, we have . A cache may reduce the balance by some amount, to get . In the ideal situation when (with the average number of non-zero elements per column and for square matrices), each RHS element has to be loaded only once from main memory per spMVM444This corresponds to the case in .. The worst possible scenario occurs when the cache is organized in cache lines of length elements, and each access to the RHS causes a cache miss. In this case we have , with or on current processors. As already discussed in subsection 3.5, the locality of the RHS vector access and, consequently, the value of can be improved by applying matrix bandwidth reduction algorithms. Note also that, depending on the algorithm and the problem size, the RHS vector may reside in cache for multiple subsequent spMVM kernel invocations, although the matrix must still be fetched from memory. In this special case we have .
The CRS-based roofline model Equation 8 must be modified for the SELL-- data format. As discussed in subsection 3.4, additional data is loaded and processed if the row lengths vary inside a chunk. The reciprocal of the chunk occupancy quantifies the format-inherent average data traffic per non-zero matrix element. Note the excess traffic for only arises for the matrix value and column index and not for the RHS element. This is because all padded column indices are set to zero; thus, only the 0-th RHS element is accessed for all padded elements and the corresponding relatively high access frequency will ensure that this element stays in cache. The code balance for SELL-- is then
The roofline model can now be used to predict the maximum achievable spMVM performance:
Here, is the achievable memory bandwidth as determined by a suitable microbenchmark, e.g., one of the two benchmarks discussed in subsection 2.1. Using in Equation 10 we obtain the analogous expression for CRS format.
As a special case we focus on the scenario, which has been described above. On the Intel SNB processor, whose LLC of 20 MiB can (in theory) hold at least a single vector of all matrix sizes in Table 2, this is usually a valid assumption. Then the performance model reads:
For square matrices with a sufficiently large number of non-zeros per row () one finally arrives at the best attainable performance level for spMVM operations considered in this work:
Note that these estimates are based on some optimistic assumptions which may sometimes not hold in reality: Main memory bandwidth is the only performance limiting factor, data access in cache is infinitely fast, the cache replacement strategy is optimal, and no latency effects occur. Nevertheless,Equation 12 provides an upper bound for spMVM performance on all compute devices if the matrix data comes from main memory.
In general, when RHS accesses cannot be neglected, the code balance depends on , which can only be predicted in very simple cases. However, can be determined by measuring the used memory bandwidth or data volume of the spMVM kernel and setting the code balance equal to the ratio between the measured transferred data volume and the number of executed “useful” flops, . Note that this is only possible if the code is limited by memory bandwidth. For SELL-- we then obtain
which can be solved for :
The corresponding CRS values can again be retrieved by setting .
In our experiments all matrices and vectors are of
type and all indices are four-byte integers. The execution time
of a series of spMVM operations has been measured to account for
possible caching effects and to reduce the impact of finite timer
accuracy. The performance analysis always uses a full compute device
The OpenMP scheduling for the Intel architectures has been set following a simple heuristic based on the matrix memory footprint and its coefficient of variation (
, standard deviation divided by mean) regarding row lengths:
If the matrix fits in the LLC or we use
STATIC” scheduling, else we use “
Especially on the Intel Phi the OpenMP scheduling
may have significant impact on the performance and needs to be chosen
The clock frequency of the Intel SNB was fixed to 2.7 GHz, and has been chosen in accordance with the AVX register width. Eight OpenMP threads were used and Simultaneous Multi-Threading (SMT) was disabled.
For the Intel Phi a chunk height of has been selected to ensure both SIMD vectorization and alignment constraints following the discussion in subsection 3.3. Best performance was generally achieved on all 60 cores with three threads per core (Saule et al.  came to the same conclusion). The large unrolling factor of at least for the SELL-- kernel makes the loop body rather bulky and hard to efficiently vectorize by the compiler. Thus, the SELL-- kernel for Intel Phi has been implemented using MIC compiler intrinsics as shown in 5, ensuring efficient vectorization.
On the Nvidia K20, was set according to the hardware-specific warp size for optimal load coalescing and data alignment. For execution of the CUDA code, a CUDA block size of has been used unless otherwise noted. A single thread was assigned to each row.
The performance of the various data layouts across all hardware platforms was investigated by classifying the test matrices into three groups.Figure a shows a survey of the matrices for which the complete memory footprint of the spMVM data ( bytes) is larger than any LLC on all compute devices (i.e., ). In contrast to these strongly memory bound cases, we present in Figure b all matrices which can in theory run completely out of the LLC at least on the Intel Phi, which has the largest LLC of all. The last group as shown in Figure c is constituted by matrices with specific characteristics where SELL-- reveals its shortcomings.
The baseline performance has been obtained with the vendor-supplied sparse linear algebra libraries using the “standard” data format. Concretely, we used Intel MKL 11.0 with CRS for Intel Sandy Bridge/Xeon Phi (COO for the non-square matrix rail4284 as MKL’s CRS cannot handle non-square matrices) and Nvidia cuSPARSE 5.0 with HYB (cf. ) on the Nvidia K20.
In all groups, the “SELL-32-” bar shows the performance of the unified data layout for each architecture. The sorting scope has been chosen to the value in the range () where the average relative performance with respect to the baseline over all three architectures has its maximum.
“SELL--” results show the performance obtained with the hardware-specific (as described above) and the optimal in the same range as described above for each matrix and architecture.
The four corner case matrices (cf. subsection 2.2) have been chosen such that we can test the SELL-- scheme for the limits of small/large number of non-zeros per row () and high/low chunk occupancy () of the original (unsorted) matrix (see Table 2 for exact values). They are shown on the far left in Figure a. On Intel SNB the “native” CRS layout no longer has any advantage compared to the SIMD-vectorized SELL-- format. On the Intel Phi, SELL-- is far superior to CRS for all matrices. The advantage becomes most pronounced for the “low-” matrix Hamrle3, where CRS is extremely slow on this wide-SIMD architecture due to the problems discussed in subsection 3.2. SELL-- outperforms HYB in cases where is rather low. There are two possible reasons for that, depending on the threshold chosen by cuSPARSE for automatic partitioning of the HYB matrix: Either (for a low threshold) the overhead induced by the irregular (COO) part of the HYB matrix gets too large or (for a high threshold) the overhead from zero fill-in in the regular (ELL) part of the HYB matrix gets too large.
In general, SELL-- with optimal sorting attains best performance on all architectures, with highest impact (as compared to no sorting) for the matrices with worst chunk occupancy, e.g RM07R and kkt_power.
In case of the larger set of memory-bound test matrices from the “Williams group” (right part of Figure a) the SELL-- format also provides best performance for all matrices and architectures if an optimal sorting scope is used. For data sets that completely fit into the LLC of Intel platforms (Figure b), the SELL-- format substantially outperforms CRS. The much lower instruction overhead of vectorized SELL-- vs. (vectorized) CRS boosts performance on the Intel Phi by to for the matrices shown in Figure b. On the Intel SNB, SELL-- shows similar benefits for the matrix mac_econ_fwd500 which can be held in its LLC.
For the memory-bound cases in Figure a a rather constant, high performance level can be achieved for all “large-” matrices (; see section 4), e.g., RM07R () and ML_Geer (). This is in good agreement with our performance model: The maximum performance for this scenario on all compute devices can be estimated by setting in Equation 12. Choosing the best memory bandwidth measurement from Table 1 for each architecture, we find (Intel SNB), (Intel Phi), and (Nvidia K20). For most matrices with the Nvidia K20 and Intel Phi achieve around of this theoretical limit and the Intel SNB gets more than . For “low-” matrices, e.g., Hamrle3 and kkt_power, performance drops by a factor of roughly two or more for all architectures. This is also in line with the performance model, which will be discussed below in subsection 5.3. Of course, these performance models do not hold for cache bound matrices, as can be seen in Figure b.
The Intel Phi delivers a disappointing performance on kkt_power as compared to the other architectures for all data layouts. A high coefficient of variation (cf. Equation 15) of 1.05 implies that performance for this matrix suffers from load imbalance.
Worst accelerator performance is found for rail4284, where the small number of rows () does not provide enough parallelism for both architectures. Even for the simple copy benchmark (see subsection 2.1) the Nvidia K20 requires more than threads to hide memory and execution unit latencies. In these problematic cases, issuing multiple threads per row () on the GPGPU increases parallelism and thus performance. This has been demonstrated by the related “Sliced ELLR-T” format  and can be easily implemented in SELL-- as well. Using , the performance on Nvidia K20 can be increased by a factor of nine (from 0.5 GF/sto 4.5 GF/s). A reason for this still very low performance is the large coefficient of variation of 1.6 for this matrix.
Another poor-performing matrix on accelerators is webbase-1M. Its low in combination with a very small value indicate a highly irregular access pattern. In addition, this matrix has an extremely large coefficient of variation of 8.16 which signifies a large likelihood of load imbalance. The large is due to the fact that this matrix contains a single row which is fully occupied with non-zero elements while the average row length is merely 3.11. Also Choi et al.  found very low performance levels on previous Nvidia GPGPU generations for this matrix and omitted it in their further discussion. Due to the smaller number of threads the impact of load imbalance is much smaller on Intel SNB. Hence, the performance on this architecture meets the expectation from the performance model as described in subsection 5.3.
A small row count is also the main reason for the low performance of Intel
Phi and Nvidia K20 in the dense2 case (a dense
matrix). Note that with a sufficiently large (i.e., ) dense
matrix (as used in ),
much higher performance can be reached on all architectures.
Specifically, on Intel SNB we reach , on Intel Phi we get
STATIC” scheduling), and on Nvidia K20 we see
( with ).
Load imbalance and a small are also the reasons for the comparatively low accelerator performance for the scircuit matrix despite the small matrix size. In this case, 94 % of rows have less than 10 entries but the longest row has 353 entries. This leads to a dominance of the HYB data format over SELL-- for the Nvidia K20.
To conclude the performance discussion we are presenting Figure 4 which represents the relative performance benefit of the SELL-- over the baseline for all non-pathological test matrices.
The performance characteristics of spMVM with SELL-- are the result of a subtle interplay between the sorting scope , the RHS re-use factor , and the chunk occupancy . Figure 5 shows the impact of varying on , , and the performance using the kkt_power matrix on the Intel SNB. In principle the LLC of this processor is large enough to hold the RHS vector in this case and we should expect to be constant. However, the RHS data set of MiB would take up 80% of the LLC; competition with matrix and LHS data causes extra evictions in this case, and the RHS must be loaded more than once. Moreover, the memory bandwidth drawn by the spMVM (measured using hardware performance counters) is always subject to some fluctuations/inaccuracies. Thus determining via Equation 14 provides a qualitative rather than an exact quantitative picture.
Without sorting (), SELL-- is slower than the CRS due to the low . Additionally, is roughly equal to . This meets our expectations, because the RHS re-use factor should not change between the two matrix formats qualitatively. Due to (AVX), is constant for . Also, neither the SELL-- performance nor change within this range, which shows that sorting with a small scope does not disturb the RHS access for this particular matrix; such a behavior cannot be expected in the general case, however. When going to larger sorting scopes, we can observe that converges to one, as expected. At the same time, SELL-- performance increases and exceeds the CRS performance at . Simultaneously, stays on the same level until it increases sharply starting at . Hence, sorting the matrix rows with scopes smaller than this value does not negatively affect the RHS access pattern. Beyond this “threshold” the increase of is accompanied by a drop in SELL-- performance.
Finally, we validate that the SIMD-vectorized (GPGPU-friendly) SELL-- format is able to attain high performance on Intel SNB for all memory-bound matrices used in our work. According to the discussion in section 4, Equation 11 with is an upper performance limit on Intel SNB, where the basic model assumption holds that the complete RHS vector can stay in cache during a single spMVM. Using the achievable bandwidth numbers from Table 1, a maximum and minimum expected performance range as a function of is given in Figure 6 along with the SELL-4-opt performance numbers for all square matrices. We find very good agreement between the measurements and the model for all memory-bound cases. In particular, Figure 6 demonstrates that the low performance for webbase-1M () and Hamrle3 (), representing the two left-most stars in Figure 6, is caused by their short rows and can not be improved substantially by a different data format. The matrices that exceed the performance model have a memory footprint which easily fits into the LLC of the Intel SNB (scircuit and mac_econ_fwd500) or are close to it (qcd5_4 and rma10).
We have motivated the need for a unified storage format for general sparse matrices on modern compute devices. SELL--, which is “Sliced ELLPACK” combined with SIMD vectorization, was identified as the ideal candidate. Although originally designed for GPGPUs, SELL-- is well suited for all modern, threaded architectures with SIMD/SIMT execution such as the Intel Xeon Sandy Bridge, Intel Xeon Phi, and Nvidia Kepler. Moreover, SELL-- is applicable to to a wide range of matrix types. This is a major step towards performance portability of spMVM kernels, and enables the possibility of running spMVM-based algorithms on heterogeneous compute systems with the advantage of storing the matrix in a single format. For most matrices investigated there is no significant loss of performance compared to hardware-specific formats. For Intel Xeon Phi, SELL-- outperforms CRS on most matrices significantly, and may set a new standard sparse matrix data format on this architecture. By construction, SELL-- is ready to exploit future architectures which are expected to deliver performance mainly through wide SIMD/SIMT execution. Thus, SELL-- allows the straightforward, efficient use of hybrid programming models like OpenACC, OpenCL, or offload programming with the Xeon Phi, and is expected to be easily portable to future computer architectures.
In future work we will address the challenge of increasing accelerator performance of SELL-- for matrices with small row count or many non-zeros per row by adopting the well-known GPGPU strategy of running multiple threads per row. A next logical step would be the implementation of an MPI-enabled spMVM based on SELL-- for use on hybrid compute clusters. Additionally, the (automatic) selection of tuning parameters like the sorting scope or the number of threads covering a chunk should be considered. Another question which has to be answered is whether and to what extent SELL-- is suited for other numerical kernels besides spMVM.
We are indebted to Intel Germany and Nvidia for providing test systems for benchmarking. This work was supported (in part) by the German Research Foundation (DFG) through the Priority Programme 1648 “Software for Exascale Computing” (SPPEXA) under project ESSEX, by the Competence Network for Scientific High Performance Computing in Bavaria (KONWIHR) under project HQS@HPC, and by the U.S. Department of Energy (DOE).
We conduct the detailed performance analysis of various storage formats based on four matrices from The University of Florida Sparse Matrix Collection555http://www.cise.ufl.edu/research/sparse/matrices. The following descriptions are taken from the same source:
This matrix emerges from a CFD finite-volume discretization and represents a 3D viscous case with “frozen” turbulence.
This matrix arises from a non-linear optimization (Karush-Kuhn-Tucker) for finding the optimal power flow.
This matrix originates from a very large electrical network simulation.
This matrix has been obtained to find the deformed configuration of an axial-symmetric porous medium subject to a pore-pressure drawdown through a meshless Petrov-Galerkin discretization.