I Introduction
Convolution neural network (CNN) has become an crucial role in machine learning, which facilitated the development of the medical image classification[1, 2], natural language processing[3, 4] and recommender system[5]
etc. Improving the execution speed of CNN is of great significance for promoting the development of machine learning. The convolution operation time dominates the total execution time. Therefore, the importance of convolution and its huge computation cost demonstrates that we need to optimized for highperformance. The convolution operation in convolution neural network refers to the process in which the convolution kernel samples on the feature map. In the sampling process, the convolution kernel is required to carry out a weighted summation operation on the sampling area, and the entire feature map is sampled according to the stride size. The process of convolution operation requires a lot of multiplication and addition, so optimizing the convolution process is a vital task. From the hardware perspective, designing an FPGA or ASIC architecture suitable for convolution computation to achieve acceleration is a feasible strategy
[6].Furthermore, Tensor Cores have been incorporated into the new Volta GPU Architecture, which delivers several times the convolution performance over the previous architecture
[7]. From the software perspective, a series of algorithmic level optimization techniques were developed, such as im2colbased method converted the convolution operation to matrix multiplication[11], FFTbased method transforms convolution computation into multiplication computation by frequency domain transformation[9], Winogradbased method uses ancient Chinese mathematical methods to reduce the number of multiplication and increase the speed of calculation[10]. Above of these software acceleration methods have been integrated into CuDNN library by NVIDIA, which is currently the best GPU deep learning acceleration library
[8]. Also, a series of methods to optimize batch and tiling to accommodate smallscale convolution matrix multiplication[12], as well as parallel optimization methods are adapted to GPUs to speedup convolution operation[13].For the traditional optimization method of convolution operation, it is usually designed based on the versatility of the algorithm and does not combine the characteristics and development direction of the deep neural network. Since network pruning operations and RELU activation function operations are joint operations in deep neural networks, this results in a large number of zero values in the network. For the feature map that needs to be calculated for the convolution operation, it can be found in the previous study that the zero value of feature map can achieve of 0.7 after multiple epochs of the depth networks. The calculation of these zero values, however, for the convolution result is redundant
[14]. In other words, if we can skip the zero value calculation in the convolution operation, this will reduce multiplication and addition by 50%. Therefore, for this reason, many efforts have focused on reducing the calculation of zero value in neural networks[18], such as SqueezeFlow employs a PTOSsparse dataflow that removes the ineffective computations while maintaining the regularity of CNN computations[15], SCNN uses a novel data stream that maintains sparse weights and activation with compression coding, eliminating unnecessary data transfers and reducing storage requirements[16], [17] proposed a new FPGA hardware architecture that uses algorithmic predetermined structured sparsity to significantly reduce memory and computational requirements, etc. These efforts have much promoted the acceleration of deep neural networks. However, some work does not apply to all network models, and some work designed new architectures cannot be quickly integrated into existing chips, so there is a limited substantial contribution to the development of deep neural networks. Besides, some computational algorithms for compressing convolutions on the CPU are being proposed[14, 18, 19, 31, 32], but for general deep neural networks, GPU computing is universal, so it has certain limitations. For the development trend of deep neural networks, the pursuit of precision in deep neural networks leads to dozens of iterations, so the weight in the network and the large sparsity in the feature map are inevitable, so convolution algorithms that can be widely applied through sparse design are feasible.To solve the above the circumstances, we propose a novel method to accomplish the convolution operation, which based on GPUs to implements multithreaded computing. Our contributions can be summarized as follows:

A novel store format for hightsparsity feature map, which makes nonzero values continuous on GPUs.

Propose a convolution algorithm based on block compression and shared memory, which calculates the convolution by skipping zero and reduces the amount of computation.

Provides a dataset with multiple common network model feature maps that can be used for convolutional computational optimization research.

Evaluate the proposed algorithm and compare it with mostly related works on GPUs. The result shows that our method can achieve 3.5X speedup by CuDNN.
The rest of this paper can be organized as follows: Section II presents the background of optimization for convolution operation. Section III introduces the motivation of our convolution method. Section IV describe the details of the proposed method. Section V introduces the new dataset of multiple common network model feature maps and evaluates the proposed method, respectively. Section VI discuss the related work. Section VII concludes this paper.
Ii Background
In this section, we will introduce the elementary knowledge about GPUs and CUDA platform. Moreover, we also elaborate on some work of convolution operation on GPU.
Iia Graphic Processing Units and CUDA platform
As computing demands continue to increase, data is required to be calculated in a limited amount of time, so the demand for parallel computing is also increasing. In turn, the performance requirements of the processor have also been greatly improved. In order to meet the needs of parallel computing, Graphic Processing Units have begun to be used in a large number of data processing industries.
GPU is a Streaming Multiprocessor (SM) array structure, and each SM contains multiple Streaming Processes (SP) [22]. Therefore, data can be processed in parallel with multiple SPs. Each SM can access a register file that works at the same speed as the SP[23], and accessing its storage unit requires little waiting time. Further, the Fermi architecture, for each SM contains L1cache, all SMs shared an L2cache. Each SM has a shared memory that is similar to the cache in the CPU[24], but its data replacement is wholly controlled by the engineer, with no corresponding data to replace the hardware logic.
For the SP inside the SM, the shared memory can be accessed, and the access is shared only within the SM. The L1cache in each SM shares a 64K memory segment with shared memory[25]. It is also worth noting that for the number of threads in the SM higher than one, there should be a synchronization mechanism for access to the shared memory to avoid the generation of thread branches[26].
Constant memory is a form of virtual address for global memory and is a type of readonly memory that is often limited to 64K. Besides, constant memory also supports broadcasting a single value to each thread in the warp. Global memory is a storage structure for data communication between the GPU and the CPU. The CPU can write to the global memory and access it through the PCIE bus. However, the global memory access time is limited. It is also a fundamental optimization method to process global memory data and utilize shared memory[27, 26].
Compute Unified Device Architecture (CUDA) is a computing platform designed by NVIDIA in conjunction with the GPU architecture. Using CUDA platform to develop parallel tasks makes GPU programming more acceptable and promotes the development of parallel tasks. CUDA platform combines threads into a thread network. A thread network consists of multiple thread blocks, each of which shares a shared memory. Multiple thread blocks can form a grid. Besides, due to the GPU’s characteristics, access to memory is usually in the order of the warp, generally a warp size of 32 threads[26, 27].
IiB Convolution operation
The GPU is very mature for the correlation calculation of the matrix. The hardware architecture of the SP array can be used to perform related calculations on the matrix so that the speed required for a specific calculation can be satisfied.
In recent years, convolution calculations have also been converted to matrix multiplication calculations. As shown in Fig. 1, the values in the convolution window are expanded to one row of the matrix, and the convolution kernel is expanded into vectors. In doing so, the convolution calculation is converted to a matrix multiplication calculation, and the value of the matrix multiplication each time the tile is calculated corresponds to a value that the convolution will result in, so that the calculation speed is substantially increased[28, 29].
In recent years, there have been many optimizations for GPUs for convolution in neural networks, which we will discuss in Section VI.
Iii Motivation
Convolutional neural networks is an important machine learning tool that extracts the characteristic information of input data through multilayer convolution. Convolution is a complex task, however, which requires sampling the entire feature map. Wherefore, it requires the convolution kernel to do multiplication operations and addition operations in the sample area, which operate throughout the feature map, so the number of additions and multiplications is enormous. For the size of feature map is and kernel size is , a convolution operation of stride is requires additions and multiplications.
Besides, for the feature map of the depth convolution layer in the deep neural network becomes smaller, and the size of the feature map is usually or , etc. Since GPUs computing convolution calculation generally converted into matrix multiplication, the number of threads is small due to the limited size of the feature map, and the GEMM also has limited effect on small matrix multiplication.
As mentioned in section I, the sparsity of the deep feature map is relatively large due to the existence of RELU activation and pruning operations. We can see from Fig.
2, the sparsity of the feature map that will enter the convolution layer can reach more than 0.7 in the deep network. When converted to matrix multiplication, the indeed calculated matrix is more sparse. The usual calculation method will calculate these zero values, which is redundant for the final convolution result. In other word, the useless calculations will account for more than 70%, so we need to avoid this part of the calculation.To address the above challenges, we reduce the multiplication and addition operations in the convolution operation employing sparse compression. In order to reduce the time consumption in the compression process, we transform the traditional matrix conversion operation into a compression operation, and the time comparison of the im2col method used by caffe
[20] is not much different. For the deep feature map size is limited, under the limited thread, the speed comparison has 40% improvement. After compression, the convolution calculation is converted into sparse matrix vectormultiplication calculation (SpMV)[21], which reduces the calculation of redundant zero values.The convolutional compression process increases the consumption of memory access. We limit the time consumption by partitioning ideas and rational allocation of shared memory and designing new storage formats. The details are described in detail in section IV.
Iv Proposed Convolution method
In this section, we will introduce our convolution method for convolution neural network. Next, we will introduce it into two subsections. The first subsection describes an innovative feature map storage format suitable for GPUs computing, under the design characteristics of the convolution calculation. In the second subsection, we will describe our calculation algorithm in detail.
Iva Novel storage format for GPUs computing
Converting the feature map used for the convolution operation into matrix that can be used for the sparse matrix multiplication operation requires format conversion of the original feature map. However, the traditional sparse matrix storage format is not suitable for data storage of convolution operations, such as [19] work done is converted to CSR format, which consumes many times accessing global memory. Therefore, we designed a sparse storage format for feature maps for GPUs computing.
As shown in Fig. 3, the feature map is vertically divided into three large convolution blocks according to the size of the stride and the high of convolution kernel , and one convolution block corresponds to one row of final convolution result. In each convolution block, according to the horizontal convolution kernel moves, it is divided into three convolution windows, and each convolution window corresponds to one convolution result.
We use to describe the novel storage format. As shown in Fig. 3, we store the three nonzero values in window into and store the corresponding values in the convolution kernel in . The nonzero values of all remaining convolution windows are stored in turn, along with the corresponding convolution kernel values. Besides, the number of nonzero values in each convolution window needs to be stored in . In other words, stores nonzero values, stores kernel values, and stores the number of nonzeros values.
In contrast to the CSR storage format, we reduce the storage of useless data index values and store the values of the convolution kernel that are useful for convolution results. In this way, the storage format stores only its nonzero value for the feature map, which reduces 50% accesses to global memory. Besides, the number of calculations required for each thread is specified based on the value in . If there is no nonzero value in a thread, it should be stored as for markup.
In the process from the feature map to sparse memory map, the number of memory accesses is increased. Theoretically, the conversion time is increased. Therefore, we use the method of blocking and shared memory in the algorithm to reduce time consumption. Blocking is primarily used to improved data locality by enhancing the reuse of data in GPUs. Each block corresponds to a thread block on the GPUs. The resources in the block are shared, and the threads in each thread block correspond to a convolution window. The feature map in global memory is downloaded into the sparse memory matrix in the shared memory of the block. Improve the access speed, and the nonzero value is continuous, increase the locality, and facilitate the use of the next calculation.
As shown in Fig. 4. To reduce to access memory time in the calculation, we load the nonvalue of feature map to shared memory. Feature map is stored in global memory as a single array for continuous access by adjacent threads. However, this advantage is only limited to when the convolution stride is 1.
For a single feature map, you need to allocate blocks, which contain threads. In the algorithm implementation, we need to judge the data with the index value of , is the relative position in the convolution window , if it is nonzero, store the position in to the index value of . Simultaneously, the value of the storage location convolution kernel is stored in the corresponding position in , and store into .
As shown in Algorithm 1, a singlethreaded algorithm is described for converting an input feature graph to the storage format described above. Where and are declared Shared memory, the whole process is a process of transforming the feature graph accessed from global memory into the new format of shared memory storage. The fourth line of code sets the starting address that this thread needs to access. The fifth line begins to determine if it is a nonzero value, and if it is a nonzero value, it is stored in the corresponding vector and vector. After the end of the loop, store the position where the nonzero value ends in the vector, and store 1 if there is no nonzero value.
IvB Blocking and sparsity convolution algorithm
In this subsection, we will describe the algorithm for sparse convolution calculation based on the above storage format and use Fig. 3 as an example to analyze the reduced computational multiplication and addition times.
In the previous subsection, we obtained a compressed feature matrix containing of feature data and containing kernel data, and a vector containing the amount of data. As shown in Fig. 5, we can consider and as two matrices according to the values in , and multiply the matrix, that is, reduce the redundant zero value calculation and obtain the lossless convolution result.
According to the tiling method, a row in the compressed feature matrix (A) and a column of the corresponding kernel matrix (B) are calculated in one thread to obtain a value in a convolution result matrix (C). This approach is suitable for the computational features of GPUs.
A thread block corresponds to threads, and each thread in the thread block accesses the compression matrix in the shared memory and calculates the data according to the matrix multiplication method to obtain the value of the convolution result. A thread block can get the value of one row in the final convolution result. After parallel calculation by multiple thread blocks, the value of the entire convolution result can be obtained.
As shown in Fig. 3 and Fig. 5, in a convolution block, the conventional algorithm requires 24 additions and 27 multiplications. Our algorithm only requires 7 additions and 10 multiplications. Save 70% multiplication and 58% addition calculation compared to the traditional algorithm. Therefore, the method of convolution sparse calculation greatly reduces the amount of calculation on the GPUs, improves the calculation speed, saves time, and improves the performance of the convolutional neural network.
As shown in algorithm 2, the algorithm pseudocode of a single thread for a convolution result is described. The first line of code is to access the corresponding vector in the shared memory to determine whether a nonzero value is stored in the stored procedure. If no nonzero value is stored (when ptr=1), it is immediately judged that no convolution is needed. This time the convolution value is 0. If a nonzero value is stored, we need to access the values in and in shared memory in turn and perform a multiplyadd operation. The convolution result is finally obtained.
IvC Algorithm extension analysis
In this paper, we describe in detail the implementation of a sparse block compression algorithm for a single feature map. However, in the actual application process, the method of calculating a single feature map at one time is incorrect, because including multiple feature maps in one layer requires convolution calculation. So the feature map required to be calculated can be increased in an integrated manner. This approach increases the number of computed GPU threads and increases parallelism. Our method can also be extended to the above method. Since the amount of calculation in a single thread is reduced, the calculation speed is also much improved in the multithreaded calculation. In the next section, we will illustrate the experimental effects of our new algorithm on various models and compare them with CuDNN. Our algorithm code is open source and is located at https://git@github.com:milk2we/bell.git.
V Experiments and dataset
Network  Layer  Size  Sparsity  New  CuDNN  Speedup 
LeNet  Conv2  1111  0.95  1.096  1.691  1.54 
AlexNetC  Conv3  66  0.9  0.804  1.64  2.04 
AlexNetI  Conv4  55  0.9  4.435  15.103  3.40 
GoogLeNet  Inception4a.1  1414  0.9  2.034  4.938  2.42 
GoogLeNet  Inception4a.2  1414  0.9  1.017  2.469  2.42 
GoogLeNet  Inception4e.3  1414  0.9  3.602  12.880  3.57 
GoogLeNet  Inception5a.1  77  0.95  2.733  6.579  2.40 
GoogLeNet  Inception5a.2  77  0.9  1.831  4.122  2.25 
GoogLeNet  Inception5b.3  77  0.95  4.421  15.168  3.43 
GoogLeNet  Inception4a.7  77  0.95  1.576  3.284  2.08 
In this section, we will describe the following sections. First, we introduce a new dataset that we provide for convolution optimization. Then we introduce our experimental environment, and then we will introduce the speed comparison experiment, memory consumption comparison experiment, and power consumption comparison experiment.
Va New dataset for convolution optimization
We provide this new dataset for convolution optimization, which is a collection of all input feature maps of the convolutional layer. It is data obtained by having a picture of a cat through the entire network architecture, ensuring the authenticity of the data. Therefore, using this data set to optimize the convolution calculation, compared with the previous method of modifying the already integrated network framework, the former is more convenient and faster and improves work efficiency.
The current dataset contains all the feature maps of VGG19 that require convolution calculations. It also contains a filelist of dataset convolution layer file name and a size file for all feature maps. This data set open source and placed in https://git@github.com:milk2we/feature_map_dataset.git. In future work, we will continue to add more network models.
VB Experiment running GPU environment
OS  CPU  GPU  Memory 

ubuntu18.04  Intel(R) Xeon(R) CPU E5v3 @ 2.40GHz  GeForce GTX 1080T  128G 
CUDA  CuDNN  
Version  10.0  7.6.1 
As shown in Table. II, the operating system environment is ubuntu18.04, the CPU is Intel(R) Xeon(R) CPU E5 v3 @ 2.40GHz. The GPU is GeForce GTX 1080T, and the memory is 128G. Besides, the running CUDA environment is CUDA10.0 and the corresponding version of CuDNN7.6.1.
VC Convolution calculation speed comparison experiment
In this part, we conducted a speed comparison test. First, we recorded a singlelayer velocity comparison experiment on the singlelayer feature map and CuDNN in some network models. Then we used the data set to carry out the VGG19 convolution speed comparison experiment and analyzed the experimental record.
As shown in Table. I, we can see that the convolution speed of some layers can be up to 3.5X speedup than CuDNN. For feature maps in deep networks, the size is , which is very small for the initial input feature map. However, the traditional GEMM is not suitable for matrix multiplication calculations with small feature maps. Therefore, our algorithm performs further compression for small feature mapping according to the principle of large sparsity, which reduces the calculation amount of a single thread, thereby improving the calculation speed.
As shown in Figure 6, in the VGG19 model, our algorithm always has very good speed as the depth of the network deepens. Compared with the CuDNN algorithm, which can achieve 2X speedup, and it is also very faster than other algorithms. In all, compared with the CuDNN algorithm, our algorithm can achieve up to 2.9X speedup, and the entire network running time can achieve 2.3X speedup.
In addition, we propose a quantized value . Thus the larger the sparsity value, the smaller the value Size, and the larger the value, the higher the depth of its network. As shown in Fig. 8, we can see our algorithm by experiment, which is proportional to . In other words, our algorithm is suitable for convolution calculations of deep networks with small feature map and large sparsity.
As shown in Fig. 7(a) and Fig. 7(b), we performed experiments with steps of 2 and 3 using the VGG19 feature map dataset. Our method can achieve 1.8X and 1.75X speedup by CuDNN. Therefore, it can be seen from the experimental results that our algorithm has a good advantage in calculation speed. It is suitable for convolution calculation of deep networks with large sparsity and small feature map.
VD Convolution calculation memory consumption comparison experiment
The technology of compressed storage computing is used, so the memory used in the calculation process is also improved in depth convolution layer of convolution neural network. As shown in Fig. 9, our algorithm saves 35% of memory consumption compared to CuDNN, and saves 17% of memory for im2col. However, for some convolutional layers with low sparsity, the memory usage is relatively high, and there are certain limitations on the use of shared memory. Therefore, designing a more efficient storage format is the focus of our future work.
Vi related works
Convolution operation is an important operation for many deep neural network in a broad range of domains[30]. Meanwhile, many research works focus on its optimization from the algorithm level and architecture level. Some optimization techniques have been integrated into CuDNN.
im2col+GEMM As mentioned in Section II, the im2col algorithm is a computational convolution algorithm GEMM based on line expansion combined with a variety of scenarios, such as the previous version of Cudnn and the open source framework Caffe. This method can get a good acceleration effect according to GEMM, but because of the limitations of GEMM for small matrix, the algorithm has reached the bottleneck[28, 29].
FFT
Fast Fourier Transform (FFT) is a computational tool commonly used for signal analysis, such as digital signal processing lamps. Fast Fourier Transform is a fast method for calculating discrete Fourier transform samples (called time series) of a series of data. It uses this relationship to convert the time domain correlation calculation to the frequency domain calculation between the time domain and the frequencydomain and converts the timedomain convolution calculation into the frequency domain matrix multiplication calculation
[9].Winograd Winograd algorithm is a method based on the Winograd kernel algorithm, which was initially proposed to calculate in matrix multiplication, which dramatically reduces the time complexity of matrix multiplication. This algorithm is superior to the small kernel and smallbatch algorithms because they compute the smallest arithmetic complexity convolution data on the small input kernel. The use of small blocks also reduces the size of the algorithm’s workspace, making the algorithm more efficient[10].
The above software optimization methods are all absorbed into CuDNN[8], which makes CuDNN more advantageous for convolution calculation.
Vii conclusion
Convolution operations have a wide range of applications in machine learning, such as image recognition. However, due to the inherent nature of the convolution operation, its computational effect on the GPU is not ideal. The existing optimization method calculates a large number of zero values in the input feature map, which is redundant for the final convolution result. Therefore, we skipped these zero values and designed a new storage format to reduce the number of accesses to global memory when the feature map was transferred to the GPUside shared memory. In addition, the locality principle of the data is also utilized, which further improves performance. Since the calculation amount of a single thread is reduced, the calculation time of a single thread is greatly improved, and the calculation effect on the small feature map is obvious. The final experimental results show that the VGG19 model has a 2.3X faster calculation than CuDNN. In addition, the deep convolutional layer meter for some models can increase the speed of 3.5X.
References
 [1] Han Z, Wei B, Zheng Y, et al. ”Breast cancer multiclassification from histopathological images with structured deep learning model”. Scientific reports, 2017, 7(1): 4172.
 [2] Sui X, Zheng Y, Wei B, et al. ”Choroid segmentation from optical coherence tomography with graphedge weights learned from deep convolutional neural networks.” Neurocomputing 237 (2017): 332341.
 [3] He, Yonghao, et al. ”Crossmodal retrieval via deep and bidirectional representation learning.” IEEE Transactions on Multimedia 18.7 (2016): 13631377.

[4]
He Y, Xiang S, Kang C, et al. ”Disan: Directional selfattention network for rnn/cnnfree language understanding.” ThirtySecond AAAI Conference on Artificial Intelligence. 2018.
 [5] Zheng, Lei, Vahid Noroozi, and Philip S. Yu. ”Joint deep modeling of users and items using reviews for recommendation.” Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017.
 [6] Zhang C, Li P, Sun G, et al. ”Optimizing fpgabased accelerator design for deep convolutional neural networks” Proceedings of the 2015 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays. ACM, 2015: 161170.
 [7] NVIDIA. 2018. CUDA Documentation. http://docs.nvidia.com/cuda/ cublas/index.html. (2018)
 [8] Chetlur S, Woolley C, Vandermersch P, et al. cudnn: Efficient primitives for deep learning[J]. arXiv preprint arXiv:1410.0759, 2014.
 [9] Mathieu, Michael, Mikael Henaff, and Yann LeCun. ”Fast training of convolutional networks through ffts.” arXiv preprint arXiv:1312.5851 (2013).

[10]
Lavin, Andrew, and Scott Gray. ”Fast algorithms for convolutional neural networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
 [11] Jia, Yangqing. Learning semantic image representations at a large scale. Diss. UC Berkeley, 2014.
 [12] Li, Xiuhong, et al. ”A coordinated tiling and batching framework for efficient GEMM on GPUs.” Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. ACM, 2019.
 [13] Zhang, Chen, et al. ”Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks.” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems (2018).
 [14] Shi S, Chu X. Speeding up convolutional neural networks by exploiting the sparsity of rectifier units[J]. arXiv preprint arXiv:1704.07724, 2017.
 [15] Li, Jiajun, et al. ”SqueezeFlow: A Sparse CNN Accelerator Exploiting Concise Convolution Rules.” IEEE Transactions on Computers (2019).
 [16] Parashar, Angshuman, et al. ”Scnn: An accelerator for compressedsparse convolutional neural networks.” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017.
 [17] Dey, Sourya, et al. ”Accelerating training of deep neural networks via sparse edge processing.” International Conference on Artificial Neural Networks. Springer, Cham, 2017.
 [18] Liu, Baoyuan, et al. ”Sparse convolutional neural networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
 [19] Fan S, Yu H, Lu D, et al. CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network[J]. IEEE Access, 2019.
 [20] Jia Y, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014: 675678.
 [21] Xu W, Zhang H, Jiao S, et al. Optimizing sparse matrix vector multiplication using cache blocking method on Fermi GPU[C]//2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. IEEE, 2012: 231235.
 [22] Lindholm E, Nickolls J, Oberman S, et al. NVIDIA Tesla: A unified graphics and computing architecture[J]. IEEE micro, 2008, 28(2): 3955.
 [23] Hsieh K, Ebrahimi E, Kim G, et al. Transparent offloading and mapping (TOM): Enabling programmertransparent neardata processing in GPU systems[C]//ACM SIGARCH Computer Architecture News. IEEE Press, 2016, 44(3): 204216.
 [24] Hong S, Kim H. An analytical model for a GPU architecture with memorylevel and threadlevel parallelism awareness[C]//ACM SIGARCH Computer Architecture News. ACM, 2009, 37(3): 152163.
 [25] Nickolls J, Dally W J. The GPU computing era[J]. IEEE micro, 2010, 30(2): 5669.
 [26] Kirk D. NVIDIA CUDA software and GPU parallel computing architecture[C]//ISMM. 2007, 7: 103104.

[27]
Zhou Y, Tan Y. GPUbased parallel particle swarm optimization[C]//2009 IEEE Congress on Evolutionary Computation. IEEE, 2009: 14931500.
 [28] Lee C L, Chao C T, Lee J K, et al. Accelerate DNN Performance with Sparse Matrix Compression in Halide[C]//Proceedings of the 48th International Conference on Parallel Processing: Workshops. ACM, 2019: 14.
 [29] Rovder S, Cano J, O Boyle M. Optimising Convolutional Neural Networks Inference on LowPowered GPUs[J]. 2019.
 [30] Wan X, Zhang F, Chu Q, et al. Highperformance blobbased iterative threedimensional reconstruction in electron tomography using multiGPUs[C]//BMC bioinformatics. BioMed Central, 2012, 13(10): S4.
 [31] Li C, Yang Y, Feng M, et al. Optimizing memory efficiency for deep convolutional neural networks on GPUs[C]//SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2016: 633644.
 [32] Advances in Neural NetworksISNN 2017: 14th International Symposium, ISNN 2017, Sapporo, Hakodate, and Muroran, Hokkaido, Japan, June 21 C26, 2017, Proceedings[M]. Springer, 2017.