), but becoming a potential faster implementation for sparse deep learning. However, it requires very high sparsity of the model to achieve accelerated speed compared to the original dense implementations .
Dense matrix multiplication, i.e., or general purpose matrix multiplication (GEMM) has been well studied on GPUs to achieve high efficiency . However, multiplication of a sparse matrix to a dense matrix (SpDM), in which the sparse matrix is stored with memory-saving formats like compressed row storage (CRS) , is understudied, and it easily loses efficiency on modern GPUs. For example, the time cost of calculating the multiplication of a sparse matrix with sparsity of (i.e., of elements are zeros) to a dense matrix with single precision requires by using cuSPARSE on an Nvidia Tesla P100 GPU, while the corresponding dense algorithm by cuBLAS only requires .111Both cuSPARSE and cuBLAS are from the library of CUDA-8.0. In other words, though the sparse matrix can reduce the number of multiplication and accumulation operations (MACs) by (since a zero element times any numbers produces zeros that has no contribution to the final results, so such operations can be avoided.), the highly optimized cuBLAS is about faster than cuSPARSE in the above example. For a much higher sparsity of , cuSPARSE can be about faster than cuBLAS at the dimension of matrices on the P100 GPU. High sparsity requirement on SpDM makes it difficult to be deployed as the efficient implementation of matrix multiplication because of the inefficient algorithm design of the SpDM algorithm in cuSPARSE. In practical problems, on one hand, if the sparsity is not high enough, doing SpDM could result in very low efficiency, while using the dense form could get results faster if there is enough memory; on the other hand, if the sparsity is very high, using the dense form not only leads to low efficiency, but it also wastes memory. From our empirical studies of cuSPARSE and cuBLAS, the sparse algorithm of cuSPARSE requires the matrix sparsity to be larger than to outperform the dense counterpart of cuBLAS. One of our key observations of cuSPARSE is that it has many slow memory access that easily leaves the computational resources (i.e., cores) stale in its SpDM APIs. To this end, we would like to design an efficient SpDM algorithm to better utilize the GPU computational resources.
Only a small number of research works focus on high-performance SpDM algorithms for modern GPUs. The most relevant work is ,  and . On one hand, Ortega et al.  try to better optimize the GPU memory access pattern (i.e., coalesced memory access) to achieve higher efficiency. On the other hand, besides the optimization of coalesced memory access, Yang et al.  use the principles of row split  and merge path 
in sparse matrix-dense vector multiplication (SpMV) to design more efficient algorithms for SpDM on GPUs. Jiang et al. mainly re-order the row data and Parger et al.  propose the parameter tuning technique to optimize the performance of SpDM. However, in 
, the authors design their algorithms mainly for the cases that the dense matrices are tall-skinny, and it requires a heuristic to choose whether to use merge-based or row split for better performance. In this paper, we not only exploit the GPU algorithm optimization principles (e.g., coalesced memory access), but also revisit the popular roofline performance model on GPUs to analyze how to increase operational intensity, and then we propose an efficient SpDM algorithm. Our contributions are summarized as follows:
We design an efficient SpDM algorithm called GCOOSpDM on GPUs with several optimization techniques including coalescing memory access, bank conflict avoidance of the shared memory and high computation-to-memory ratios.
We evaluate the proposed algorithm on a large number of sparse matrices including the public dataset and randomly generated matrices, and the experimental results show that GCOOSpDM outperforms cuSPARSE 1.5-8 faster in a large proportion of matrices on Nvidia GPUs.
We conduct instruction-level analysis for the kernels of GCOOSpDM and cuSPARSE, and the profiled results confirm that our proposed algorithm uses much less slow memory access (DRAM and L2 cache) than cuSPARSE.
As compared to cuSPARSE, GCOOSpDM decreases the sparsity requirement from to in order to outperform dense implementation of cuBLAS.
The rest of the paper is organized as follows. Section II gives introductions to the preliminaries related to SpDM and GEMM. We present our proposed algorithm for efficient SpDM in Section III. The experimental evaluation and analysis are illustrated in Section IV. Section V introduces the related work, and finally we conclude this paper in Section VI.
A multiplication of two matrices and produces an result matrix , i.e.,
To simplify the analysis of the algorithms, we assume that the dimensions of A and B are both . The sparsity of matrix A is defined as the ratio of the number of zero elements over the total number of elements.
Ii-a The roofline model
The roofline model  is commonly used in performance modeling of multi-core/many-core architectures like GPUs . The term operational intensity (operations per byte of DRAM traffic) is defined to predict the performance of kernels. In the model, there is an upper bound of the GPU throughput when reaches some threshold, which indicates the program is computation-bound. If is smaller than the threshold, the GPU throughput is a linear function with respect to , which indicates the program is memory-bound. Using cuBLAS GEMM as an example, in Fig. 1, we compare the experimental throughput of dense matrix multiplication with the theoretical throughput from roofline model on two different Nvidia GPUs, GTX980 and Titan X.
Though GEMM in cuBLAS has achieved nearly optimal throughput on matrix multiplication, directly applying GEMM for sparse matrices could result in many useless calculations due to the large amount of zeros. The irregular non-zero elements in sparse matrices make the data access from global memory to registers become the bottleneck of matrix multiplication. In other words, each time of data reading from the sparse matrix, only a limited number of computational operations. Therefore, algorithms for SpDM are generally memory-bound, and for such problems, one should design the algorithm to increase to achieve higher efficiency.
Ii-B GPU memory hierarchy
From the roofline model, one should improve the memory access efficiency to fully utilize the computational power of GPUs. There are several types of memories in the GPU memory hierarchy. From fast to slow of access speed, it contains registers, the shared memory (or L1 cache), L2 cache and the global memory . The shared memory and global memory are two kinds of memories that can be flexibly manipulated by programming. In general, data that is repeatedly used could be put into the shared memory or registers for better utilization of GPU cores.
Ii-C COO: The coordinate storage format
Assume that the matrix is a row-major matrix. The coordinate storage format (COO)  is a simple storage scheme for sparse matrices. COO uses an array to store the values of all non zero elements. The coordinate information of each non zero element is sequentially stored in array and array respectively. Take a real-valued example of a sparse matrix as follows:
the COO format of A is represented by
Iii Efficient Algorithm Design
In this section, we describe the design of our proposed efficient SpDM algorithm on GPUs including the customized storage format for sparse matrices and its conversion from the dense ones. According to the above analysis in operations of SpDM on GPUs, we first design a new sparse format called grouped COO (GCOO), which is convenient for coalesced memory access and is useful to increase the operational intensity . Then we propose an efficient SpDM algorithm by using GCOO.
Iii-a GCOO: Grouped COO storage format
A similar format of GCOO is the sliced COO (SCOO) format proposed in , with which the authors achieved higher throughput on sparse matrix-vector multiplication (SpMV) on both CPUs and GPUs. In this paper, we bring the idea of SCOO to propose GCOO for matrix multiplication. The sparse matrix is partitioned to groups according to the number of columns , and each group is stored in the COO format, so we call it GCOO. For an matrix stored in the GCOO format, there are groups, and each group contains columns except the last one who has columns. If is divisible by , then the last group also has columns. In GCOO, each group is stored in the COO format, and COOs from all groups are concatenated into one array. Let group be stored in the COO format with , and , where . We have the stored values of GCOO with , and that are generated from the concatenation of , and respectively.
An example of grouping in matrix A is shown in Fig. 2. Matrix A is divided into to 2 groups. Group is represented by , and ; and group is represented by , and . Finally, two groups are concatenated into one array with an extra index array to indicate which positions are corresponding to related groups. Therefore, the final stored format of GCOO is as follows:
where is an auxiliary array to store the group indexes. It is noted that , and in GCOO are not the same as those of COO since a single group in GCOO is in a COO format. In order to easily access each group’s elements, we use an extra auxiliary array, , to store the number of non-zero elements in each group. In the above example, the values of should be:
In practice, GCOO spends slightly more memory space than COO and CSR, but it provides more convenient access of data with a higher probability. The comparison of memory consumption to store anmatrix with a sparsity of (note that ) is shown in Table I.
The main advantage of GCOO is to help reuse the data from slow memories (e.g., global memory and L2 cache). Specifically, if there exist two or more continuous non-zero elements in one group that are in the same row, then the fetched element from the dense matrix B can be reused in the register instead of being read from the slow memory again.
Iii-B Matrix conversion to GCOO
For the cases that the input matrices A and B are stored in the dense form, there would be an extra overhead in the format conversion to apply the SpDM algorithm. For example, cuSPARSE provides an API “cusparseSdense2csr” to convert the dense matrix to the CSR format so that one can apply the SpDM APIs. For our proposed GCOO, we also need to provide an efficient conversion scheme to convert the dense matrix to GCOO. We use two steps to convert the dense matrix to the GCOO storage.
Step 1: Count the number of non-zero elements. To convert a dense form of a matrix to the sparse form, one should first count the number of non-zero elements () of that matrix in order to allocate the memory according to the value of . As for GCOO, we have pre-grouped the matrix by pre-defined , so it is straightforward to calculate the non-zero elements in parallel for different groups such that the array can also be calculated. Therefore, in this step, , and can be calculated by scanning the original dense matrix.
Step 2: Store the non-zero elements to , and . First, the memories of , and are allocated according to , and then we can read the non-zero elements with their coordinate information and write them to , , and according to the indexes by in parallel.
The pseudocode of the matrix conversion on the GPU from the dense form to GCOO is shown in Algorithm 1.
Iii-C GCOOSpDM: an efficient SpDM algorithm
In the proposed algorithm GCOOSpDM, we focus on three factors that have major impact on the performance. 1) Data partition for the CUDA execution context . 2) The coalesced memory access of global memory on the sparse matrix A and the two dense matrices B and C. 3) When exploiting the faster memory on Nvidia GPUs with the shared memory, we guarantee that the access of the shared memory has no bank conflict. 4) After accessing a single element of the sparse matrix B, we strive to calculate more results for C, i.e., achieving higher operational intensity, so that we can achieve higher GFLOPS.
Data partition of matrices. In the context of CUDA, a thread is the smallest execution unit of instructions. A group of threads forms a thread block, which is executed in a stream multiprocessor (SM) of GPU. Multiple thread blocks form a grid, and some thread blocks are executed in parallel on different SMs at one time. Let denote the size of a thread block. In our algorithm, each thread block calculates elements of C separately, so a resulting matrix requires thread blocks. All threads in a thread block share a group of sparse data of A, but each thread reads continuous columns B to do the operations of multiplication and addition to the continuous columns of C. An example of data partition for and is shown in Fig. 3. In the grid, it has 6 thread blocks. Each thread block contains threads, and it calculates 8 elements of C. Each thread calculates elements of C.
Coalesced memory access. Three matrices including one sparse matrix A with the GCOO format and two dense arrays (B and C) are needed to interactive with the global memory. Irregular global memory access would result in performance degradation on modern GPUs, so we should read the input matrices (A and B) and write the output matrix C in a coalesced way.
First, we consider the sparse matrix A stored with the GCOO format. Since each group in GCOO of A is assigned to one thread block, we just need to consider the block level access of one group of GCOO, i.e., a COO format that has columns. The number of floating point operations is determined by the number of nonzero elements of A, so we scan COO to find the corresponding columns of B. Due to the sparse property, COO could not have many elements, which means we can load COO to the shared memory such that all the threads can read the data fast. Therefore, the threads in one thread block read elements of COO from the global memory to the shared memory in a coalesced way. After A has been put into the shared memory, it is no need to re-read the elements of A from the global memory.
Second, the dense matrix of B should be read-aware. The matrix B only needs to be accessed when a of COO has been read from the shared memory, so every thread reads the same , the corresponding column of B should be same while the rows should be different to keep all the threads busy and work balance. So threads need to read in the current block respectively. In order to support the coalesced memory read of B, the row elements should be in the continuous memory. It is easy to do this because we can just transpose B or store B in a column-major matrix such that the above elements are in the continuous memory.
Finally, for the result matrix C, we should only write the matrix once with the final result for each thread to achieve higher throughput. As discussed above, thread reads of A, and multiplies with the elements indexed by in B, so the write position of C should be . As a result, C should also be column-major or transposed for the coalesced memory writing.
None bank conflict access of the shared memory. The shared memory used in our algorithm is only proper to the sparse matrix of A with the COO format (in one thread block). The kernel allocates a fixed size of shared memory, and the threads in one thread block read non-zero elements from A each time. Since all the threads in one thread block need to read all elements of A to calculate the corresponding columns of C, all threads read the same element of A. Therefore, the data in the shared memory can be accessed by all threads in a broadcast way , which would not result in any bank conflict, and the broadcast access of the shared memory requires only a very small number of clock cycles to fetch the data.
High computation-to-memory ratio. Achieving a high operational intensity is very important to a high throughput. Regarding the multiplication and accumulation of each thread, each thread reads the shared memory of A to get (donated by ), and then multiplies (donated by ) of B. In such scenario, we have two opportunities to have more calculations with and since they have been loaded into the registers. The first chance is to find other element of B to be multiplied with , but the other element that can be multiplied with has been assigned to the other block, so this chance cannot be fulfilled. The second one is to find a next element of A who has the same column with the previous one while its row is different, i.e., . Therefore, we can search the next (since A has been loaded in the shared memory, the time cost of searching is low.) to reuse . If such an exists, then we can have times of multiplication and accumulation without an extra global memory (or L2 cache) access, which results in a higher
. For a uniform distributed sparse matrix with sparsity of, there could be non-zero elements in the same column.
According to the above four criteria, we conclude the GCOOSpDM algorithm with the following three steps.
Step 1. Each thread block iteratively reads the COO values into the shared memory such that all threads in this thread block can read the COO values for their rows. We exactly know the columns that we need to calculate in the current thread block.
Step 2. The thread scans the COO items from the shared memory, and the item contains , and . According to , the thread reads the element of B, and then performs the multiplication of , whose result is added to the local variable . I.e., .
Step 3. Since the current group of data is stored as the COO format, for the current element , its next element should have the same index if that column has more than one element. So we continue scanning the shared memory to check if there are elements that have the same such that we can reuse the element of .
The visualization of the algorithm executed with the CUDA programming model is shown in Fig. 3. On the grid level, there are 6 thread blocks, and each thread block calculates the results of sub-matrix with size of from rows of B, and columns (i.e., one group in GCOO) of A. On the thread block level, the GCOO data of sparse matrix are loaded into faster memory once (the shared memory) which is shared among all the threads in the thread block. On the thread level, each thread independently takes charge of computing elements of C, say the thread scans the shared memory to read , and , and then reads the values in column of B, which are multiplied by separately, and each result is accumulated to column of C. The algorithm of GCOOSpDM is shown in Algorithm 2.
In Algorithm 2, we first (line 1-10) initialize some local variables including the thread level indexes of output and COO for the current thread block. Then we iteratively scan a block of COO in the for-loop of line 11, and at each iteration, a thread block of COO values are loaded into the shared memory (line 12-15). After that each value of COO in the shared memory is read by all the threads in one thread block, and the corresponding value in B is also read to calculate the result (line 21-26). Instead of continuing the above step, we keep the value of in the register, and scan the shared COO to check whether we can reuse so that less memory operations are required (line 28-36). By this way, we can achieve higher operational intensity, i.e., is reused to do more floating point calculations. At the end, the local results of each thread are written back to C that is stored in the global memory with corresponding indexes (line 38-39). Note that both reading of matrix A and matrix B from the global memory is in a coalescent way, the result writing to matrix C is also coalescent. In term of access of the shared memory, it broadcast the data to all the threads in a warp with a small number of cycles.
Iv Evaluation and Analysis
To show the effectiveness of our proposed algorithm, we do varies of experiments across three Nvidia GPU cards (i.e., GTX 980, GTX Titan X Pascal and Tesla P100) using two kinds of data. The first one is the public sparse matrix dataset  which has different patterns of matrices, and the second one is randomly generated matrices whose zero-valued elements have a uniform distribution.222Codes of GCOOSpDM and scripts of performance evaluation can be found in https://github.com/hclhkbu/gcoospdm. And the raw data of our experimental results can be found in: https://github.com/hclhkbu/gcoospdm/tree/master/results. The characteristics of tested GPUs are shown in Table II. And the software installed is CUDA-8.0.
|SMs cores per SM||16128||28128||5664|
|Memory Bandwidth (GB/s)||224||433||732|
Iv-a Results on public sparse matrices
We use the public sparse matrices in . Since we only consider the schemes of square matrices, we pick up all the square matrices in the dataset to evaluate the performances of GCOOSpDM and cuSPARSE. The chosen dataset contains 2694 matrices, whose sparsity is in the range of , and their dimensions are in the range of . The performance comparison between GCOOSpDM and cuSPARSE is shown in Fig. 4, where is used to denote the execution time of . We first compare the overall performance of our algorithm with cuSPARSE on the 2694 matrices, and we then choose 14 types of matrices from varies of applications to compare the performance of the algorithms.
Overall performance. In the 2694 tested matrices, there are about matrices that GCOOSpDM outperforms cuSPARSE on the P100 GPU, and there are more than matrices that GCOOSpDM achieves better performance than cuSPARSE on both GTX980 and TitanX. The average speedups are , and on GTX980, TitanX and P100 respectively. Moreover, the maximum speedups of GCOOSpDM are , and on GTX980, TitanX and P100 GPUs respectively. By contrast, on the matrices that cuSPARSE is better than GCOOSpDM on the P100 GPU, cuSPARSE only outperforms GCOOSpDM about on average. On GTX 980 and Titan X GPUs, there are about cuSPARSE outperforming GCOOSpDM about . cuSPARSE performs better on the P100 GPU than GTX 980 and TitanX GPUs mainly because the P100 GPU has a much higher memory bandwidth than the other two GPUs as shown in Table II.
|human_gene1||22283||2.49e-02||Undirected Weighted Graph|
14 types of matrices. It can be seen that GCOOSpDM does not always outperform cuSPARSE. To further understand the main reasons, we select 14 types of matrices that have different structures and non-zero patterns from a range of areas to analyze their performance differences. The details of the selected matrices are shown in Table III. To normalize the algorithm performances, we use effective GFLOPS to measure the algorithms as the following Equation
The performance comparison is shown in Fig. 5. On three matrices (“nemeth11”, “plbuckle” and “fpga_dcop_01”), GCOOSpDM is worse than cuSPARSE due to the non-zero distribution of the matrices. On these three matrices, the non-zero elements are mainly located on the diagonal of the matrices, such that there is little opportunity to reuse the pre-fetched value of (i.e., line 30 will intermediately hold and no further calculations for current ), but it still spends extra overheads to search A.
Iv-B Random sparse matrices
We randomly generate square matrices whose dimension are in the range of with a step size of . For each size of a matrix, we generate the elements with the sparsity in two ranges (i.e, at a step and at a step). In total, there are matrices with uniformly distributed non-zero elements for evaluation.
Overall performance. The performance comparison between GCOOSpDM and cuSPARSE using the randomly generated matrices is shown in Fig. 6. Our GCOOSpDM algorithm outperforms cuSPARSE in , and matrices on GTX980, TitanX and P100 GPUs respectively, and the average speedups are , and respectively. Particularly, the maximum speedups on the three GPUs are , and respectively. On the cases that cuSPARSE is better GCOOSpDM, they only occupy a very small proportion (less than ), and the average performance ratio is only around , which indicates very close performance on less than cases.
Time vs. sparsity. As we have shown the efficiency of GCOOSpDM in large range of matrices and sparsity, we want to study further about the performance related to the sparsity . We take two matrices with medium () and large () dimensions to show the relationship between performance and sparsity. The range of sparsity is kept at . Here we also put the time cost of the dense algorithm from cuBLAS into comparison so that we can understand under what sparsity GCOOSpDM can outperform cuBLAS. The results for these two sizes of matrices on GTX980, TitanX and P100 GPUs are shown in Fig. 10, 11 and Fig. 12, respectively. On one hand, it can be seen that cuBLAS has a constant time cost when the sparsity of matrix increases since the dense algorithm does not consider zero values. On the other hand, the sparse algorithms of cuSPARSE and GCOOSpDM tend to have a linear speedup when the sparsity increases. Given the two specific dimensions of matrices, GCOOSpDM outperforms cuSPARSE with all sparsity. When the sparsity becomes larger than some thresholds, the sparse algorithm would have advantages than the dense one. However, cuSPARSE needs the sparsity be up to to outperform cuBLAS, while our proposed algorithm GCOOSpDM can outperform cuBLAS with sparsity larger than . In summary, the GCOOSpDM algorithm is more applicable for matrix multiplication on GPUs than cuSPARSE and cuBLAS under sparsity larger than to achieve higher performance on current GPUs.
Performance vs. matrix size. To further show the sensitivity of the algorithm to the matrix size, we demonstrate the throughput (GFLOPS) in a range of matrix dimensions (i.e., ) at two sparsity and . The experimental results with sparsity of and are in Fig. 10, 11 and 12 on three different GPUs. On the three tested GPUs, GCOOSpDM outperforms cuSPARSE with different values of and two sparsity. For small matrices (e.g., ), cuBLAS still outperforms GCOOSpDM since it takes only a small number of cycles in calculating small matrices while GCOOSpDM needs extra overheads on memory allocation and matrix conversion. Given the sparsity of and , GCOOSpDM achieves similar performance as (or slightly better than) cuBLAS. With the sparsity of , cuSPARSE achieves close performance with cuBLAS, while GCOOSpDM outperforms cuBLAS up to times.
Iv-C Breakdown of time costs
In this subsection, assume that given A and B are both in the dense form, while A is of high sparsity, we would like to present the time costs of matrix conversion and the kernel calculation to finish the matrix multiplication using the sparse algorithm. The different overheads are summarized into three categories: memory allocation for sparse matrix storage, matrix conversion from the dense form to the sparse form, and SpDM kernel calculation. We summarize the first two categories as an extra overhead (EO), and the third one as the real time cost of kernel calculation (KC). The metrics of EO and KC are used to compare GCOOSpDM and cuSPARSE. Instead of using three GPUs, we only choose a TitanX GPU as our analysis platform, since three GPUs should have similar time distribution. Similar to the previous subsection, we use two sizes of matrices (i.e., and ) with sparsity of for comparison. The results are shown in Fig. 13. It can be seen that EO has only a small proportion of the total time, and both GCOOSpDM and cuSPARSE have a very close overhead of EO. The dominated part is the execution time of the kernel that calculates the matrix multiplication.
Iv-D Instruction analysis
In this subsection, we compare the instruction distributions of cuSPARSE and GCOOSpDM and explore how the matrix dimension and the sparsity take effects on them. The instruction distribution is the runtime statistics of kernel instructions executed on the real GPU hardware. Not only does it help reveal the major performance bottleneck of the GPU kernel, but also determine some quantitative relationships between instructions and kernel performance.
We use nvprof333http://docs.nvidia.com/cuda/profiler-users-guide to collect the runtime instructions of different types, including single-precision floating-point operations, DRAM memory access, L2 cache access, shared memory access and L1/Texture memory access. We use the TitanX GPU as our testbed in the profiling experiments. The other two GPU platforms, GTX980 and P100, can be analyzed with the same experimental methodology.
We conduct two sets of random sparse matrix experiments on cuSPARSE and GCOOSpDM respectively. First, we fix the matrix sparsity as and scale the matrix dimension from to . This setting helps exploit how affects the instructions of those two algorithms. Second, we fix the matrix dimension as and scale the matrix sparsity from to . This setting helps exploit how affects the instructions of those two algorithms. Furthermore, we can also witnesses the difference of instruction distributions of cuSPARSE and GCOOSpDM under the same experimental setting. The results are demonstrated in Fig. 14, in which denotes the number of DRAM memory access transactions, denotes the number of L2 cache access transactions, denotes the number of shared memory access transactions, denotes the number of L1/Texture memory access transactions. We find that the DRAM memory access transactions of both two algorithms only take a very few percentage of total number of memory access transactions. Recall that the DRAM memory has the highest access latency and lowest throughput in the GPU memory hierarchy. Avoidance of very frequent DRAM memory access helps decrease the data fetch overhead of the GPU kernel execution. Both cuSPARSE and GCOOSpDM have well-organized data access patterns to utilize L2 cache and on-chip cache (including shared memory and L1/Texture cache). However, the major parts of memory access instructions of those two algorithms are different. takes great majority in cuSPARSE, while , and take approximately the same partitions in GCOOSpDM. GCOOSpDM has higher utilizations of on-chip cache of GPUs than cuSPARSE so that it generally outperforms cuSPARSE on randomly generated sparse matrices, which confirms the experimental results in Fig. 6.
We then focus on how and influence the numbers of those major memory access instructions. The above two figures in Fig. 14 show the effects of on cuSPARSE and GCOOSpDM respectively, while the bottom two show the effects of . We observe that of cuSPARSE and , and of GCOOSpDM all indicate quadratically increasing trends with respect to . It is reasonable since the element number of the output matrix C is , each of which needs nearly equal workloads of one vector dot product operation. However, the effects of show a few differences. of cuSPARSE performs a nearly quadratically decreasing trend with respect to , while , and of GCOOSpDM show a nearly linearly decreasing trend. Those observations are also reflected in the performance changing behaviors with respect to and , as illustrated in Fig. 15. On one hand, as the matrix size increases, the performance of both cuSPARSE and GCOOSpDM demonstrates similar quadratically increasing trends, which meets changing behaviors of their dominating memory instructions. On the other hand, as matrix sparsity increases, the performance of cuSPARSE shows an approximately quadratically decreasing trend, while that of GCOOSpDM shows a linearly decreasing trend. They are also similar to those changing behaviors from exploring the effects of to the dominating memory instructions of those two algorithms.
V Related work
Multiplication of sparse matrices to dense vectors (SpMV) on GPUs have been well studied (e.g., ). Even SpDM can be implemented by multiple SpMVs, the performance could be bad due to a large number of kernel invokes if the matrix is with a large dimension. However, some optimization principles can be applied for SpDM. For example, Yang et al.  use split row  and merged path  to design SpDM algorithms particularly for tall-skinny matrices.
Regarding the SpDM algorithm analysis, Greiner et al.  propose an I/O model to interpret the lower bound of efficient serial algorithms. Cache oblivious dense and sparse matrix algorithms are presented by Bader et al. for multi-core CPUs . Performance benchmarks  are conducted to evaluate the efficiency of different sparse matrix formats for SpDM. Koanantakool et al.,  introduce the communication-avoiding SpDM algorithms that are applied in distributed memory systems. Recent work in designing the row reordering technique to achieve better data temporal locality  and the dynamic parameter tuning  to improve the SpDM performance on GPUs.
Vi Conclusion and Future Work
Sparse-dense matrix-matrix multiplication is commonly used in many scientific computing areas, while designing such algorithms on modern GPUs is non-trivial due to the irregular structure of the sparse matrix. In this paper, we propose an efficient sparse matrix-dense matrix multiplication algorithm on GPUs, called GCOOSpDM. The main optimization techniques used in our algorithm are the coalesced global memory access, proper usage of the shared memory, and reuse the data from the slow global memory. The experimental results show that our proposed algorithm outperforms the vendor-based library: cuSPARSE several times on both the public sparse dataset and randomly generated matrices on three recent Nvidia GPUs (i.e., GTX 980, Titan X Pascal, and Tesla P100). We also analyze in depth the performance improvement on instruction-level to understand why GCOOSpDM performs better than cuSPARSE. The key observation of the instruction-level analysis is that the reduced number of global memory access contributes a lot to the performance gain.
It is difficult for a single algorithm to fit all structures of matrices, sparsity and different types of GPUs. Auto-tune algorithms play an important role for algorithms to find efficient configuration or implementations in different cases. We would like to consider the auto-tune scheme to set proper and for our GCOOSpDM algorithm in the future work, and try to extend the GCOO storage format to the multiplication of two sparse matrices. =0mu plus 1mu
-  (2016) Performance, design, and autotuning of batched GEMM for GPUs. In International Conference on High Performance Computing, pp. 21–38. Cited by: §I.
-  (2008) Cache oblivious dense and sparse matrix multiplication based on Peano curves. In Proceedings of the PARA, Vol. 8. Cited by: §V.
-  (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the conference on high performance computing networking, storage and analysis, pp. 18. Cited by: §I, §II-C, §V.
-  (2009) Practical random linear network coding on GPUs. In International Conference on Research in Networking, pp. 573–585. Cited by: §I.
-  (2012) The sliced COO format for sparse matrix-vector multiplication on CUDA-enabled GPUs. Procedia Computer Science 9, pp. 57–66. Cited by: §III-A.
-  (2011) The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38 (1), pp. 1. Cited by: §IV-A, §IV.
Compressed row storage.
Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, Eds. Philadelphia: SIAM. Cited by: §I.
-  (2013) Performance evaluation of algorithms for sparse-dense matrix product. In Proceedings of the International MultiConference of Engineers and Computer Scientists, Vol. 1. Cited by: §V.
-  (2014) Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 769–780. Cited by: §V.
-  (2010) The I/O complexity of sparse matrix dense matrix multiplication. In Latin American Symposium on Theoretical Informatics, pp. 143–156. Cited by: §V.
-  (2017) Auto-tuning strategies for parallelizing sparse matrix-vector (SpMV) multiplication on multi-and many-core processors. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pp. 713–722. Cited by: §V.
-  (2020) A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 376–388. Cited by: §I, §V.
-  (2011) Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model. Computer Physics Communications 182 (6), pp. 1201–1207. Cited by: §II-A.
-  (2016) Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In Parallel and Distributed Processing Symposium, 2016 IEEE International, pp. 842–853. Cited by: §V.
A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. Journal of Parallel and Distributed Computing 107, pp. 37–56. Cited by: §II-A.
-  (2012) Autotuning GEMM kernels for the fermi GPU. IEEE Transactions on Parallel and Distributed Systems 23 (11), pp. 2045–2057. Cited by: §I.
-  (2013) Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10. Cited by: §I.
-  (2015) . In , pp. 806–814. Cited by: §I.
-  (2018) G-CRS: GPU accelerated cauchy reed-solomon coding. IEEE Transactions on Parallel and Distributed Systems 29 (7), pp. 1484–1498. Cited by: §I.
-  (2011) Multi-level optimization of matrix multiplication for GPU-equipped systems. Procedia Computer Science 4, pp. 342–351. Cited by: §I.
-  (2017) Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28 (1), pp. 72–86. Cited by: §II-B.
-  (2014) Benchmarking the memory hierarchy of modern GPUs. In IFIP International Conference on Network and Parallel Computing, pp. 144–156. Cited by: §II-B.
-  (2016) Merge-based parallel sparse matrix-vector multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 58. Cited by: §I, §V.
Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119. Cited by: §I.
-  (2010) An improved MAGMA GEMM for Fermi graphics processing units. The International Journal of High Performance Computing Applications 24 (4), pp. 511–515. Cited by: §I.
-  (2010) Programming guide. Cited by: §III-C, §III-C.
-  (2013) FastSpMM: an efficient library for sparse matrix matrix product on GPUs. The Computer Journal 57 (7), pp. 968–979. Cited by: §I.
-  (2020) spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 362–375. Cited by: §I, §V.
-  (2017) Speeding up convolutional neural networks by exploiting the sparsity of rectifier units. arXiv preprint arXiv:1704.07724. Cited by: §I.
-  (2019) A distributed synchronous SGD algorithm with global top-k sparsification for low bandwidth networks. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 2238–2247. Cited by: §I.
-  (2017) meProp: sparsified back propagation for accelerated deep learning with reduced overfitting. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3299–3308. Cited by: §I.
-  (2001) All-pairs shortest paths computation in the BSP model. In International Colloquium on Automata, Languages, and Programming, pp. 178–189. Cited by: §I.
-  (2010) A matrix approach to tomographic reconstruction and its implementation on GPUs. Journal of Structural Biology 170 (1), pp. 146–151. Cited by: §I.
-  (2008) Benchmarking GPUs to tune dense linear algebra. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pp. 1–11. Cited by: §I, §II-B.
Learning intrinsic sparse structures within long short-term memory. arXiv preprint arXiv:1709.05027. Cited by: §I.
-  (2009) Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 52 (4), pp. 65–76. Cited by: §I, §II-A.
Demystifying tensor cores to optimize half-precision matrix multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2020, Rio de Janeiro, Brazil, May 20-24, 2019, Cited by: §I.
-  (2018) Design principles for sparse matrix multiplication on the GPU. In International European Conference on Parallel and Distributed Computing (Euro-Par), Cited by: §I, §V.
-  (2017) Understanding the GPU microarchitecture to achieve bare-metal performance tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 31–43. Cited by: §I, §II-A.