Log In Sign Up

Blocking and sparsity for optimization of convolution calculation algorithm on GPUs

by   Weizhi Xu, et al.
NetEase, Inc

Convolution neural network (CNN) plays a paramount role in machine learning, which has made significant contributions, such as medical image classification, natural language processing, and recommender system. The success convolution neural network achieved excellent performance with fast execution time. Due to the convolution operation dominate the total operation time of Convolution neural network. In this paper, we propose a novel convolution method of Graphic Processing Units (GPUs), which reduce the convolution operation time and improve the execution speed approximately 2X than the state of the art convolution algorithm. Our work based on the observation is that the sparsity of the input feature map of convolution operation is relatively large, and the zero value of the feature map is redundancy for convolution result. Therefore, we skip the zero value calculation and improve the speed by compressing the feature map. Besides, the shape of the feature map for the deep network is small, and the number of threads is limited. Therefore, for a limited number of threads, it is necessary to reduce the amount of calculation to increase the calculation speed. Our algorithm has a good effect on the convolution operation of the feature map of the deep network with large sparsity and small size. In this work, our contributions can be summarized as follows: 1) A novel store format for hight-sparsity feature map. 2) A novel convolution algorithm based on block compression and Shared memory is proposed. 3) A feature map data-set for convolution algorithm optimization. 4) We performed a single-layer convolution comparison experiment with CuDNN for different models, and it is best to achieve 3.5X speedup. We also implemented the algorithm on the VGG-19 model, which can achieve 1.3X∼2.9X speedup in deep convolution operation, and the entire network can achieve 2.3X speedup.


Performance optimization of convolution calculation by blocking and sparsity on GPU

Convolution neural network (CNN) plays a paramount role in machine learn...

HNMTP Conv: Optimize Convolution Algorithm for Single-Image Convolution Neural Network Inference on Mobile GPUs

Convolution neural networks are widely used for mobile applications. How...

ILP-M Conv: Optimize Convolution Algorithm for Single-Image Convolution Neural Network Inference on Mobile GPUs

Convolution neural networks are widely used for mobile applications. How...

PCNNA: A Photonic Convolutional Neural Network Accelerator

Convolutional Neural Networks (CNN) have been the centerpiece of many ap...

GASL: Guided Attention for Sparsity Learning in Deep Neural Networks

The main goal of network pruning is imposing sparsity on the neural netw...

Efficient and Generic 1D Dilated Convolution Layer for Deep Learning

Convolutional neural networks (CNNs) have found many applications in tas...

Representation Learning on Unit Ball with 3D Roto-Translational Equivariance

Convolution is an integral operation that defines how the shape of one f...

I Introduction

Convolution neural network (CNN) has become an crucial role in machine learning, which facilitated the development of the medical image classification[1, 2], natural language processing[3, 4] and recommender system[5]

etc. Improving the execution speed of CNN is of great significance for promoting the development of machine learning. The convolution operation time dominates the total execution time. Therefore, the importance of convolution and its huge computation cost demonstrates that we need to optimized for high-performance. The convolution operation in convolution neural network refers to the process in which the convolution kernel samples on the feature map. In the sampling process, the convolution kernel is required to carry out a weighted summation operation on the sampling area, and the entire feature map is sampled according to the stride size. The process of convolution operation requires a lot of multiplication and addition, so optimizing the convolution process is a vital task. From the hardware perspective, designing an FPGA or ASIC architecture suitable for convolution computation to achieve acceleration is a feasible strategy


Furthermore, Tensor Cores have been incorporated into the new Volta GPU Architecture, which delivers several times the convolution performance over the previous architecture

[7]. From the software perspective, a series of algorithmic level optimization techniques were developed, such as im2col-based method converted the convolution operation to matrix multiplication[11], FFT-based method transforms convolution computation into multiplication computation by frequency domain transformation[9], Winograd-based method uses ancient Chinese mathematical methods to reduce the number of multiplication and increase the speed of calculation[10]

. Above of these software acceleration methods have been integrated into CuDNN library by NVIDIA, which is currently the best GPU deep learning acceleration library

[8]. Also, a series of methods to optimize batch and tiling to accommodate small-scale convolution matrix multiplication[12], as well as parallel optimization methods are adapted to GPUs to speedup convolution operation[13].

For the traditional optimization method of convolution operation, it is usually designed based on the versatility of the algorithm and does not combine the characteristics and development direction of the deep neural network. Since network pruning operations and RELU activation function operations are joint operations in deep neural networks, this results in a large number of zero values in the network. For the feature map that needs to be calculated for the convolution operation, it can be found in the previous study that the zero value of feature map can achieve of 0.7 after multiple epochs of the depth networks. The calculation of these zero values, however, for the convolution result is redundant

[14]. In other words, if we can skip the zero value calculation in the convolution operation, this will reduce multiplication and addition by 50%. Therefore, for this reason, many efforts have focused on reducing the calculation of zero value in neural networks[18], such as SqueezeFlow employs a PT-OS-sparse dataflow that removes the ineffective computations while maintaining the regularity of CNN computations[15], SCNN uses a novel data stream that maintains sparse weights and activation with compression coding, eliminating unnecessary data transfers and reducing storage requirements[16], [17] proposed a new FPGA hardware architecture that uses algorithmic pre-determined structured sparsity to significantly reduce memory and computational requirements, etc. These efforts have much promoted the acceleration of deep neural networks. However, some work does not apply to all network models, and some work designed new architectures cannot be quickly integrated into existing chips, so there is a limited substantial contribution to the development of deep neural networks. Besides, some computational algorithms for compressing convolutions on the CPU are being proposed[14, 18, 19, 31, 32], but for general deep neural networks, GPU computing is universal, so it has certain limitations. For the development trend of deep neural networks, the pursuit of precision in deep neural networks leads to dozens of iterations, so the weight in the network and the large sparsity in the feature map are inevitable, so convolution algorithms that can be widely applied through sparse design are feasible.

To solve the above the circumstances, we propose a novel method to accomplish the convolution operation, which based on GPUs to implements multi-threaded computing. Our contributions can be summarized as follows:

  • A novel store format for hight-sparsity feature map, which makes non-zero values continuous on GPUs.

  • Propose a convolution algorithm based on block compression and shared memory, which calculates the convolution by skipping zero and reduces the amount of computation.

  • Provides a data-set with multiple common network model feature maps that can be used for convolutional computational optimization research.

  • Evaluate the proposed algorithm and compare it with mostly related works on GPUs. The result shows that our method can achieve 3.5X speedup by CuDNN.

The rest of this paper can be organized as follows: Section II presents the background of optimization for convolution operation. Section III introduces the motivation of our convolution method. Section IV describe the details of the proposed method. Section V introduces the new data-set of multiple common network model feature maps and evaluates the proposed method, respectively. Section VI discuss the related work. Section VII concludes this paper.

Ii Background

In this section, we will introduce the elementary knowledge about GPUs and CUDA platform. Moreover, we also elaborate on some work of convolution operation on GPU.

Ii-a Graphic Processing Units and CUDA platform

As computing demands continue to increase, data is required to be calculated in a limited amount of time, so the demand for parallel computing is also increasing. In turn, the performance requirements of the processor have also been greatly improved. In order to meet the needs of parallel computing, Graphic Processing Units have begun to be used in a large number of data processing industries.

GPU is a Streaming Multiprocessor (SM) array structure, and each SM contains multiple Streaming Processes (SP) [22]. Therefore, data can be processed in parallel with multiple SPs. Each SM can access a register file that works at the same speed as the SP[23], and accessing its storage unit requires little waiting time. Further, the Fermi architecture, for each SM contains L1-cache, all SMs shared an L2-cache. Each SM has a shared memory that is similar to the cache in the CPU[24], but its data replacement is wholly controlled by the engineer, with no corresponding data to replace the hardware logic.

For the SP inside the SM, the shared memory can be accessed, and the access is shared only within the SM. The L1-cache in each SM shares a 64K memory segment with shared memory[25]. It is also worth noting that for the number of threads in the SM higher than one, there should be a synchronization mechanism for access to the shared memory to avoid the generation of thread branches[26].

Constant memory is a form of virtual address for global memory and is a type of read-only memory that is often limited to 64K. Besides, constant memory also supports broadcasting a single value to each thread in the warp. Global memory is a storage structure for data communication between the GPU and the CPU. The CPU can write to the global memory and access it through the PCI-E bus. However, the global memory access time is limited. It is also a fundamental optimization method to process global memory data and utilize shared memory[27, 26].

Compute Unified Device Architecture (CUDA) is a computing platform designed by NVIDIA in conjunction with the GPU architecture. Using CUDA platform to develop parallel tasks makes GPU programming more acceptable and promotes the development of parallel tasks. CUDA platform combines threads into a thread network. A thread network consists of multiple thread blocks, each of which shares a shared memory. Multiple thread blocks can form a grid. Besides, due to the GPU’s characteristics, access to memory is usually in the order of the warp, generally a warp size of 32 threads[26, 27].

Ii-B Convolution operation

The GPU is very mature for the correlation calculation of the matrix. The hardware architecture of the SP array can be used to perform related calculations on the matrix so that the speed required for a specific calculation can be satisfied.

Fig. 1: Block1 of Convolution turns into matrix multiplication.

In recent years, convolution calculations have also been converted to matrix multiplication calculations. As shown in Fig. 1, the values in the convolution window are expanded to one row of the matrix, and the convolution kernel is expanded into vectors. In doing so, the convolution calculation is converted to a matrix multiplication calculation, and the value of the matrix multiplication each time the tile is calculated corresponds to a value that the convolution will result in, so that the calculation speed is substantially increased[28, 29].

In recent years, there have been many optimizations for GPUs for convolution in neural networks, which we will discuss in Section VI.

Iii Motivation

Convolutional neural networks is an important machine learning tool that extracts the characteristic information of input data through multi-layer convolution. Convolution is a complex task, however, which requires sampling the entire feature map. Wherefore, it requires the convolution kernel to do multiplication operations and addition operations in the sample area, which operate throughout the feature map, so the number of additions and multiplications is enormous. For the size of feature map is and kernel size is , a convolution operation of stride is requires additions and multiplications.

Besides, for the feature map of the depth convolution layer in the deep neural network becomes smaller, and the size of the feature map is usually or , etc. Since GPUs computing convolution calculation generally converted into matrix multiplication, the number of threads is small due to the limited size of the feature map, and the GEMM also has limited effect on small matrix multiplication.

Fig. 2: The sparsity of VGG-19

As mentioned in section I, the sparsity of the deep feature map is relatively large due to the existence of RELU activation and pruning operations. We can see from Fig.

2, the sparsity of the feature map that will enter the convolution layer can reach more than 0.7 in the deep network. When converted to matrix multiplication, the indeed calculated matrix is more sparse. The usual calculation method will calculate these zero values, which is redundant for the final convolution result. In other word, the useless calculations will account for more than 70%, so we need to avoid this part of the calculation.

To address the above challenges, we reduce the multiplication and addition operations in the convolution operation employing sparse compression. In order to reduce the time consumption in the compression process, we transform the traditional matrix conversion operation into a compression operation, and the time comparison of the im2col method used by caffe

[20] is not much different. For the deep feature map size is limited, under the limited thread, the speed comparison has 40% improvement. After compression, the convolution calculation is converted into sparse matrix vector-multiplication calculation (SpMV)[21], which reduces the calculation of redundant zero values.

The convolutional compression process increases the consumption of memory access. We limit the time consumption by partitioning ideas and rational allocation of shared memory and designing new storage formats. The details are described in detail in section IV.

Iv Proposed Convolution method

In this section, we will introduce our convolution method for convolution neural network. Next, we will introduce it into two subsections. The first subsection describes an innovative feature map storage format suitable for GPUs computing, under the design characteristics of the convolution calculation. In the second subsection, we will describe our calculation algorithm in detail.

Iv-a Novel storage format for GPUs computing

Converting the feature map used for the convolution operation into matrix that can be used for the sparse matrix multiplication operation requires format conversion of the original feature map. However, the traditional sparse matrix storage format is not suitable for data storage of convolution operations, such as [19] work done is converted to CSR format, which consumes many times accessing global memory. Therefore, we designed a sparse storage format for feature maps for GPUs computing.

Fig. 3: The size of feature map is , and kernel size is , and the stride is 1. Novel storage format for feature map of block1.

As shown in Fig. 3, the feature map is vertically divided into three large convolution blocks according to the size of the stride and the high of convolution kernel , and one convolution block corresponds to one row of final convolution result. In each convolution block, according to the horizontal convolution kernel moves, it is divided into three convolution windows, and each convolution window corresponds to one convolution result.

We use to describe the novel storage format. As shown in Fig. 3, we store the three non-zero values in window into and store the corresponding values in the convolution kernel in . The non-zero values of all remaining convolution windows are stored in turn, along with the corresponding convolution kernel values. Besides, the number of non-zero values in each convolution window needs to be stored in . In other words, stores non-zero values, stores kernel values, and stores the number of non-zeros values.

In contrast to the CSR storage format, we reduce the storage of useless data index values and store the values of the convolution kernel that are useful for convolution results. In this way, the storage format stores only its non-zero value for the feature map, which reduces 50% accesses to global memory. Besides, the number of calculations required for each thread is specified based on the value in . If there is no non-zero value in a thread, it should be stored as for markup.

In the process from the feature map to sparse memory map, the number of memory accesses is increased. Theoretically, the conversion time is increased. Therefore, we use the method of blocking and shared memory in the algorithm to reduce time consumption. Blocking is primarily used to improved data locality by enhancing the reuse of data in GPUs. Each block corresponds to a thread block on the GPUs. The resources in the block are shared, and the threads in each thread block correspond to a convolution window. The feature map in global memory is downloaded into the sparse memory matrix in the shared memory of the block. Improve the access speed, and the non-zero value is continuous, increase the locality, and facilitate the use of the next calculation.

Fig. 4: Thread access condition with a stride size of 1

As shown in Fig. 4. To reduce to access memory time in the calculation, we load the non-value of feature map to shared memory. Feature map is stored in global memory as a single array for continuous access by adjacent threads. However, this advantage is only limited to when the convolution stride is 1.

For a single feature map, you need to allocate blocks, which contain threads. In the algorithm implementation, we need to judge the data with the index value of , is the relative position in the convolution window , if it is non-zero, store the position in to the index value of . Simultaneously, the value of the storage location convolution kernel is stored in the corresponding position in , and store into .

2:  for  to  do
3:     for  to  do
5:        if  then
9:        end if
10:     end for
11:  end for
12:  if  then
14:  else
16:  end if
Algorithm 1 Storage Format Algorithm

As shown in Algorithm 1, a single-threaded algorithm is described for converting an input feature graph to the storage format described above. Where and are declared Shared memory, the whole process is a process of transforming the feature graph accessed from global memory into the new format of shared memory storage. The fourth line of code sets the starting address that this thread needs to access. The fifth line begins to determine if it is a non-zero value, and if it is a non-zero value, it is stored in the corresponding vector and vector. After the end of the loop, store the position where the non-zero value ends in the vector, and store -1 if there is no non-zero value.

Iv-B Blocking and sparsity convolution algorithm

In this subsection, we will describe the algorithm for sparse convolution calculation based on the above storage format and use Fig. 3 as an example to analyze the reduced computational multiplication and addition times.

In the previous subsection, we obtained a compressed feature matrix containing of feature data and containing kernel data, and a vector containing the amount of data. As shown in Fig. 5, we can consider and as two matrices according to the values in , and multiply the matrix, that is, reduce the redundant zero value calculation and obtain the lossless convolution result.

Fig. 5: Matrix multiplication tiling method. The same color is calculated for the same thread, and the different colors are divided into one thread block.

According to the tiling method, a row in the compressed feature matrix (A) and a column of the corresponding kernel matrix (B) are calculated in one thread to obtain a value in a convolution result matrix (C). This approach is suitable for the computational features of GPUs.

A thread block corresponds to threads, and each thread in the thread block accesses the compression matrix in the shared memory and calculates the data according to the matrix multiplication method to obtain the value of the convolution result. A thread block can get the value of one row in the final convolution result. After parallel calculation by multiple thread blocks, the value of the entire convolution result can be obtained.

As shown in Fig. 3 and Fig. 5, in a convolution block, the conventional algorithm requires 24 additions and 27 multiplications. Our algorithm only requires 7 additions and 10 multiplications. Save 70% multiplication and 58% addition calculation compared to the traditional algorithm. Therefore, the method of convolution sparse calculation greatly reduces the amount of calculation on the GPUs, improves the calculation speed, saves time, and improves the performance of the convolutional neural network.

1:  if  then
3:  else
4:     for  to  do
6:     end for
7:  end if
Algorithm 2 Convolution Algorithm

As shown in algorithm 2, the algorithm pseudo-code of a single thread for a convolution result is described. The first line of code is to access the corresponding vector in the shared memory to determine whether a non-zero value is stored in the stored procedure. If no non-zero value is stored (when ptr=-1), it is immediately judged that no convolution is needed. This time the convolution value is 0. If a non-zero value is stored, we need to access the values in and in shared memory in turn and perform a multiply-add operation. The convolution result is finally obtained.

Iv-C Algorithm extension analysis

In this paper, we describe in detail the implementation of a sparse block compression algorithm for a single feature map. However, in the actual application process, the method of calculating a single feature map at one time is incorrect, because including multiple feature maps in one layer requires convolution calculation. So the feature map required to be calculated can be increased in an integrated manner. This approach increases the number of computed GPU threads and increases parallelism. Our method can also be extended to the above method. Since the amount of calculation in a single thread is reduced, the calculation speed is also much improved in the multi-threaded calculation. In the next section, we will illustrate the experimental effects of our new algorithm on various models and compare them with CuDNN. Our algorithm code is open source and is located at

V Experiments and data-set

Network Layer Size Sparsity New CuDNN Speedup
LeNet Conv2 1111 0.95 1.096 1.691 1.54
AlexNetC Conv3 66 0.9 0.804 1.64 2.04
AlexNetI Conv4 55 0.9 4.435 15.103 3.40
GoogLeNet Inception4a.1 1414 0.9 2.034 4.938 2.42
GoogLeNet Inception4a.2 1414 0.9 1.017 2.469 2.42
GoogLeNet Inception4e.3 1414 0.9 3.602 12.880 3.57
GoogLeNet Inception5a.1 77 0.95 2.733 6.579 2.40
GoogLeNet Inception5a.2 77 0.9 1.831 4.122 2.25
GoogLeNet Inception5b.3 77 0.95 4.421 15.168 3.43
GoogLeNet Inception4a.7 77 0.95 1.576 3.284 2.08
TABLE I: Single layer acceleration effect table
Fig. 6: VGG-19 convolution calculation time change graph

In this section, we will describe the following sections. First, we introduce a new data-set that we provide for convolution optimization. Then we introduce our experimental environment, and then we will introduce the speed comparison experiment, memory consumption comparison experiment, and power consumption comparison experiment.

V-a New data-set for convolution optimization

We provide this new data-set for convolution optimization, which is a collection of all input feature maps of the convolutional layer. It is data obtained by having a picture of a cat through the entire network architecture, ensuring the authenticity of the data. Therefore, using this data set to optimize the convolution calculation, compared with the previous method of modifying the already integrated network framework, the former is more convenient and faster and improves work efficiency.

The current data-set contains all the feature maps of VGG-19 that require convolution calculations. It also contains a file-list of data-set convolution layer file name and a size file for all feature maps. This data set open source and placed in In future work, we will continue to add more network models.

V-B Experiment running GPU environment

ubuntu18.04 Intel(R) Xeon(R) CPU E5v3 @ 2.40GHz GeForce GTX 1080T 128G
Version 10.0 7.6.1
TABLE II: Experiment running GPU environment

As shown in Table. II, the operating system environment is ubuntu18.04, the CPU is Intel(R) Xeon(R) CPU E5 v3 @ 2.40GHz. The GPU is GeForce GTX 1080T, and the memory is 128G. Besides, the running CUDA environment is CUDA-10.0 and the corresponding version of CuDNN-7.6.1.

V-C Convolution calculation speed comparison experiment

In this part, we conducted a speed comparison test. First, we recorded a single-layer velocity comparison experiment on the single-layer feature map and CuDNN in some network models. Then we used the data set to carry out the VGG-19 convolution speed comparison experiment and analyzed the experimental record.

(a) Vgg-19 convolution calculation time consumption on stride is 2.
(b) Vgg-19 convolution calculation time consumption on stride is 3.
Fig. 7: Vgg-19 convolution calculation time consumption on different stride.

As shown in Table. I, we can see that the convolution speed of some layers can be up to 3.5X speedup than CuDNN. For feature maps in deep networks, the size is , which is very small for the initial input feature map. However, the traditional GEMM is not suitable for matrix multiplication calculations with small feature maps. Therefore, our algorithm performs further compression for small feature mapping according to the principle of large sparsity, which reduces the calculation amount of a single thread, thereby improving the calculation speed.

As shown in Figure 6, in the VGG-19 model, our algorithm always has very good speed as the depth of the network deepens. Compared with the CuDNN algorithm, which can achieve 2X speedup, and it is also very faster than other algorithms. In all, compared with the CuDNN algorithm, our algorithm can achieve up to 2.9X speedup, and the entire network running time can achieve 2.3X speedup.

Fig. 8: Sparse, size ratio change and acceleration ratio on VGG-19

In addition, we propose a quantized value . Thus the larger the sparsity value, the smaller the value Size, and the larger the value, the higher the depth of its network. As shown in Fig. 8, we can see our algorithm by experiment, which is proportional to . In other words, our algorithm is suitable for convolution calculations of deep networks with small feature map and large sparsity.

As shown in Fig. 7(a) and Fig. 7(b), we performed experiments with steps of 2 and 3 using the VGG-19 feature map data-set. Our method can achieve 1.8X and 1.75X speedup by CuDNN. Therefore, it can be seen from the experimental results that our algorithm has a good advantage in calculation speed. It is suitable for convolution calculation of deep networks with large sparsity and small feature map.

V-D Convolution calculation memory consumption comparison experiment

Fig. 9: Deep network memory consumption

The technology of compressed storage computing is used, so the memory used in the calculation process is also improved in depth convolution layer of convolution neural network. As shown in Fig. 9, our algorithm saves 35% of memory consumption compared to CuDNN, and saves 17% of memory for im2col. However, for some convolutional layers with low sparsity, the memory usage is relatively high, and there are certain limitations on the use of shared memory. Therefore, designing a more efficient storage format is the focus of our future work.

Vi related works

Convolution operation is an important operation for many deep neural network in a broad range of domains[30]. Meanwhile, many research works focus on its optimization from the algorithm level and architecture level. Some optimization techniques have been integrated into CuDNN.

im2col+GEMM As mentioned in Section II, the im2col algorithm is a computational convolution algorithm GEMM based on line expansion combined with a variety of scenarios, such as the previous version of Cudnn and the open source framework Caffe. This method can get a good acceleration effect according to GEMM, but because of the limitations of GEMM for small matrix, the algorithm has reached the bottleneck[28, 29].


Fast Fourier Transform (FFT) is a computational tool commonly used for signal analysis, such as digital signal processing lamps. Fast Fourier Transform is a fast method for calculating discrete Fourier transform samples (called time series) of a series of data. It uses this relationship to convert the time domain correlation calculation to the frequency domain calculation between the time domain and the frequency-domain and converts the time-domain convolution calculation into the frequency domain matrix multiplication calculation


Winograd Winograd algorithm is a method based on the Winograd kernel algorithm, which was initially proposed to calculate in matrix multiplication, which dramatically reduces the time complexity of matrix multiplication. This algorithm is superior to the small kernel and small-batch algorithms because they compute the smallest arithmetic complexity convolution data on the small input kernel. The use of small blocks also reduces the size of the algorithm’s workspace, making the algorithm more efficient[10].

The above software optimization methods are all absorbed into CuDNN[8], which makes CuDNN more advantageous for convolution calculation.

Vii conclusion

Convolution operations have a wide range of applications in machine learning, such as image recognition. However, due to the inherent nature of the convolution operation, its computational effect on the GPU is not ideal. The existing optimization method calculates a large number of zero values in the input feature map, which is redundant for the final convolution result. Therefore, we skipped these zero values and designed a new storage format to reduce the number of accesses to global memory when the feature map was transferred to the GPU-side shared memory. In addition, the locality principle of the data is also utilized, which further improves performance. Since the calculation amount of a single thread is reduced, the calculation time of a single thread is greatly improved, and the calculation effect on the small feature map is obvious. The final experimental results show that the VGG19 model has a 2.3X faster calculation than CuDNN. In addition, the deep convolutional layer meter for some models can increase the speed of 3.5X.


  • [1] Han Z, Wei B, Zheng Y, et al. ”Breast cancer multi-classification from histopathological images with structured deep learning model”. Scientific reports, 2017, 7(1): 4172.
  • [2] Sui X, Zheng Y, Wei B, et al. ”Choroid segmentation from optical coherence tomography with graph-edge weights learned from deep convolutional neural networks.” Neurocomputing 237 (2017): 332-341.
  • [3] He, Yonghao, et al. ”Cross-modal retrieval via deep and bidirectional representation learning.” IEEE Transactions on Multimedia 18.7 (2016): 1363-1377.
  • [4]

    He Y, Xiang S, Kang C, et al. ”Disan: Directional self-attention network for rnn/cnn-free language understanding.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

  • [5] Zheng, Lei, Vahid Noroozi, and Philip S. Yu. ”Joint deep modeling of users and items using reviews for recommendation.” Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017.
  • [6] Zhang C, Li P, Sun G, et al. ”Optimizing fpga-based accelerator design for deep convolutional neural networks” Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015: 161-170.
  • [7] NVIDIA. 2018. CUDA Documentation. cublas/index.html. (2018)
  • [8] Chetlur S, Woolley C, Vandermersch P, et al. cudnn: Efficient primitives for deep learning[J]. arXiv preprint arXiv:1410.0759, 2014.
  • [9] Mathieu, Michael, Mikael Henaff, and Yann LeCun. ”Fast training of convolutional networks through ffts.” arXiv preprint arXiv:1312.5851 (2013).
  • [10]

    Lavin, Andrew, and Scott Gray. ”Fast algorithms for convolutional neural networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

  • [11] Jia, Yangqing. Learning semantic image representations at a large scale. Diss. UC Berkeley, 2014.
  • [12] Li, Xiuhong, et al. ”A coordinated tiling and batching framework for efficient GEMM on GPUs.” Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. ACM, 2019.
  • [13] Zhang, Chen, et al. ”Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2018).
  • [14] Shi S, Chu X. Speeding up convolutional neural networks by exploiting the sparsity of rectifier units[J]. arXiv preprint arXiv:1704.07724, 2017.
  • [15] Li, Jiajun, et al. ”SqueezeFlow: A Sparse CNN Accelerator Exploiting Concise Convolution Rules.” IEEE Transactions on Computers (2019).
  • [16] Parashar, Angshuman, et al. ”Scnn: An accelerator for compressed-sparse convolutional neural networks.” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017.
  • [17] Dey, Sourya, et al. ”Accelerating training of deep neural networks via sparse edge processing.” International Conference on Artificial Neural Networks. Springer, Cham, 2017.
  • [18] Liu, Baoyuan, et al. ”Sparse convolutional neural networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
  • [19] Fan S, Yu H, Lu D, et al. CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network[J]. IEEE Access, 2019.
  • [20] Jia Y, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014: 675-678.
  • [21] Xu W, Zhang H, Jiao S, et al. Optimizing sparse matrix vector multiplication using cache blocking method on Fermi GPU[C]//2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. IEEE, 2012: 231-235.
  • [22] Lindholm E, Nickolls J, Oberman S, et al. NVIDIA Tesla: A unified graphics and computing architecture[J]. IEEE micro, 2008, 28(2): 39-55.
  • [23] Hsieh K, Ebrahimi E, Kim G, et al. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems[C]//ACM SIGARCH Computer Architecture News. IEEE Press, 2016, 44(3): 204-216.
  • [24] Hong S, Kim H. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness[C]//ACM SIGARCH Computer Architecture News. ACM, 2009, 37(3): 152-163.
  • [25] Nickolls J, Dally W J. The GPU computing era[J]. IEEE micro, 2010, 30(2): 56-69.
  • [26] Kirk D. NVIDIA CUDA software and GPU parallel computing architecture[C]//ISMM. 2007, 7: 103-104.
  • [27]

    Zhou Y, Tan Y. GPU-based parallel particle swarm optimization[C]//2009 IEEE Congress on Evolutionary Computation. IEEE, 2009: 1493-1500.

  • [28] Lee C L, Chao C T, Lee J K, et al. Accelerate DNN Performance with Sparse Matrix Compression in Halide[C]//Proceedings of the 48th International Conference on Parallel Processing: Workshops. ACM, 2019: 14.
  • [29] Rovder S, Cano J, O Boyle M. Optimising Convolutional Neural Networks Inference on Low-Powered GPUs[J]. 2019.
  • [30] Wan X, Zhang F, Chu Q, et al. High-performance blob-based iterative three-dimensional reconstruction in electron tomography using multi-GPUs[C]//BMC bioinformatics. BioMed Central, 2012, 13(10): S4.
  • [31] Li C, Yang Y, Feng M, et al. Optimizing memory efficiency for deep convolutional neural networks on GPUs[C]//SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2016: 633-644.
  • [32] Advances in Neural Networks-ISNN 2017: 14th International Symposium, ISNN 2017, Sapporo, Hakodate, and Muroran, Hokkaido, Japan, June 21 C26, 2017, Proceedings[M]. Springer, 2017.