1 Introduction
Deep convolutional neural networks (DCNNs) have adequately advanced diverse intelligent applications such as image classification [1] and object detection [2]
. While sophisticated neural networks are effective at continuously improving the accuracy of intelligent tasks, the computational complexity, as well as the storage requirement, is also drastically increased, leading to an obstacle for their applicability. Realtime applications, such as video surveillance, have a strict constraint of processing time, and embedded applications, such as virtual reality, a limitation of memory usage. As such, the research of accelerating and compressing neural networks becomes an inevitable trend. On the one hand, fast convolution algorithms for neural networks, such as FFT (fast Fourier transform)
[3] and Winograd’s minimal filtering algorithm [4], are proposed, which reduce the number of computations in convolution by exploiting the correspondence between the convolution and the scalar multiplication. On the other hand, there are several approaches focus on compressing neural networks by using quantization techniques [5][6], which represents original floatingpoint values by lowprecision (e.g., 8bit integer) codes. The quantized values can be executed under lowprecision computation that has the potential of speedup [7]. Unfortunately, these two kinds of techniques cannot be directly combined because the data transformation of fast convolution algorithms disturbs the quantized values, which eliminates the gain of lowprecision quantization. In this paper, we address the above problem by proposing Lance (Lowprecision quAntized wiNograd Convolution for neural nEtworks), which applies quantization methods into the Winograddomain.2 Preliminary and Formalization
2.1 Convolution Computation
Convolution is a crucial operation in deep convolutional neural networks, which extracts features from input images by applying several filters and obtains the results into output images (also known as feature maps). We denote the input images as , the convolution filters as , and the output images as
. A typical convolution layer of deep neural networks (stride and padding are omitted) can be presented by:
(1) 
is the element on row and column in the input channel of input image, is the element on row and column in the channel of filter, and is the element on row and column in the output channel of output image. For an entire image/filter pair, the equation can be expressed as:
(2) 
where represents the 2D correlation (refer to [4]).
2.2 Lowprecision Quantization
Lowprecision quantization techniques represent the values of original data by lowprecision quantized codes [8]. In general, the process of the convolution with quantization has three steps: Firstly, converting the values of images and filters to quantized values by using a quantization function ; Then, performing the lowprecision computation with quantized values; Finally, converting the values of quantized results to output feature maps using a dequantization function . Thus, the quantized convolution can be formalized as:
(3) 
where represents the quantized 2D correlation which can be calculated under lowprecision computation.
2.3 Winograd Convolution
Similar to [4], we use to denote the Winograd convolution of an output with an filter. requires multiplications [9], which equals to the number of input elements, whereas the standard convolution requires multiplications. For the 2D Winograd convolution , the basic block of output is an patch and input an patch which extracted from input images. An input image is divided into several patches (with stride and padding if necessary) and the corresponding output patches are merged in an output feature map. Let the input patch is , the filter is , and the output patch is , the Winograd algorithm can be written as:
(4) 
where represents the Hadamard product. , , and are the transformation matrices. and are the transformed filter and transformed input patch in the Winograddomain, respectively. By using the matrices and , is obtained. The detailed algorithm can be seen in [4].
3 Proposed Approach
3.1 A Motivating Example
Let as an example (i.e. Winograd convolution with a input , a filter , and a output ), which is widely used in modern neural networks. Figure 1(a) shows the bruteforce approach of combining the Winograd convolution with quantization techniques. In quantized neural networks, the input and filters of a convolution layer are quantized to lowprecision codes, as the green matrix and gray matrix. However, the transformation operations of the Winograd convolution disturb the quantized codes, which means that the values of transformed matrices are indeterminate or cannot be represented by existing quantized codes in the Winograddomain (the red matrix and blue matrix). For demonstration purposes, we use a naive 2bit linear quantization method (quantized codes{}). Let the original fullprecision input is . By using the function , the quantized input:
(5) 
The transformed input matrix:
(6) 
As can be seen, the values of the transformed matrix cannot be represented by the abovementioned 2bit lowprecision codes. What’s more, if we use fullprecision data types to present the transformed result, however, the Hadamard product cannot use the potential of lowprecision computation.
Remarks. The example of bruteforce approaches shows that the values of quantized input images and filters are disturbed by the Winograd transformations, which cannot be directly used in the Winograddomain. What we need is a method that can combine the Winograd algorithm with quantization techniques to achieve more with their advantages.
3.2 LowPrecision Quantized Winograd Convolution
As shown in Figure 1(b), we propose Lance, which applies quantization techniques into the Winograddomain, to explore the potential of lowprecision computation. In our algorithm, we use a uniform linear quantization function to quantize a fullprecision value to a lowprecision code :
(7) 
where indicates the bits of lowprecision data type, is the matrix that the value belongs to, such as input patch matrix and filter matrix . The quantized code can be recovered to original fullprecision by a dequantization function :
(8) 
Overall, the approach of the quantized Winograd convolution can be formalized as:
(9) 
As such, the Hadamard product with quantized operands, and , can be calculated under lowprecision computing, which addresses the problems in the bruteforce approach.
Algorithm 1 describes our quantized Winograd algorithm for a convolution layer. The input of Lance are an image and filters , and the output feature maps, which initialized by zero. Here, we assume the batch size of the neural network . The approach for larger batch size, , remains the same. The transformation matrices, , , and , are generated based on Chinese Remainder Theorem with given sizes of the input and filters. Firstly, the number of filters, patches, and channels are obtained from input data (Lines 11). The input image is divided and recorded in (Line 1). Each input patch and filter are transformed by using and (Lines 11). The transformed values are quantized to lowprecision codes (Lines 11). The Hadamard product of quantized transformed data, and , is calculated by lowprecision computing (Line 1). The output patch is transformed from the sum of Hadamard product results by using (Lines 11). Finally, the output feature map is obtained, which is merged by output patches (Lines 11).
3.3 Implementation on Graphics Processing Units
We describe an efficient implementation of Lance on graphics processing units (GPUs), which are commonly used to accelerate intelligent applications in recent years.
Data Layout We use NHWC, a common format in deep learning applications, as the data layout of Lance, where N denotes the batch size, H denotes the height of input images, W denotes the width of input images, and C denotes the channel of input images, respectively. The same pixels in different channels are stored contiguously, which can be processed simultaneously, and the parallelizability of these pixels is not influenced by the bitwidths of data types. Therefore, the NHWC format is more suitable for lowprecision data types.
Data Transformation and Hadamard Product For transformation of data, each thread calculates the transformed data of a patch with single channel and the quantization or dequantization of transformed patches can be parallelized. The Hadamard product is computed by using lowprecision general matrix multiplication (GEMM) [4], which can be performed by effective specific instructions on GPUs. We implement the lowprecision GEMM by leveraging the WMMA [10] APIs of CUDA C++.
3.4 Embedding into Convolutional Neural Networks
Figure 2 depicts the approach of applying our Lance in neural networks. The convolution is replaced with our lowprecision Winograd convolution and corresponding transformation and quantization operations are inserted. We train the modified model by simulated quantization [7]. For inference, the transformation and quantization operations of weights in Lance are computed offline just once, which can decrease the runtime overhead.
4 Experiments
We evaluate the performance of Lance on representative image classification datasets with different scales, including SVHN [11], CIFAR [12], and ImageNet 2012 [13]. The main traditional convolution layers of deep neural networks are replaced with our quantized Winograd convolution layers and the hyperparameters of training are the same for a specific network architecture. The inference experiments are performed on a latest NVIDIA GPU, Tesla T4.
SVHN  CIFAR10  

ConvNetS  ConvNet  VGGnagadomi  
WI  ACC  WI  ACC  WI  ACC 
3232  0.9572  3232  0.8912  3232  0.9042 
88  0.11%  88  1.04%  88  0.08% 
44  0.30%  44  2.43%  44  3.48% 
Table 1 illustrates the result of ConvNetS [14] on the SVHN dataset and the results of ConvNet [14] and VGGnagadomi [15] on the CIFAR10 dataset. The accuracy of ConvNetS on SVHN is slightly increased, which is possibly due to that the lowprecision computation decreases the overfitting of neural networks. As can be seen, the accuracy loss of ConvNet, as well as VGGnagadomi, on CIFAR10 is trivial with 8bit quantization.
ConvNet  

BNN  BWN  TWN  LANCE (Ours)  FULL  
WI  11  132  232  44  432  88  3232 
ACC  0.454  0.862  0.871  0.8669  0.8919  0.9016  0.8912 
VGGnagadomi  
BNN  BWN  TWN  LANCE (Ours)  FULL  
WI  11  132  232  44  432  88  3232 
ACC  0.734  0.874  0.887  0.8694  0.8908  0.9034  0.9042 
Table 2 shows the accuracies under different quantization methods, including BNN [16], BWN [17], and TWN [18]. As can be seen, our 88 quantization outperforms other methods, which even uses fullprecision input and lowprecision weights, and the computation of 8bit data can be accelerated by lowprecision computing on GPUs.
Figure 3 illustrates the results of ConvPoolCNN [19] on the CIFAR100 dataset. The experimental results are colored according to the accuracy loss. In this experiment, we test the quantized Winograd convolution with different bitwidths of input and weights. As illustrated, the accuracy loss of neural networks decreases with more bitwidths.
WI  88  77  66  55  44 

TOP1 ACC  0.15%  0.04%  0.21%  0.45%  3.00% 
TOP5 ACC  0.04%  0.03%  0.15%  0.55%  2.26% 
We also test a variation of ResNet18 [20, 21] on the ImageNet, which is a very large dataset. As shown in Table 3, the top1 accuracy varies less than 0.2% and the top5 accuracy varies less than 0.1% with 88 quantization.
Figure 4 depicts the speedup of our method. Our 8bit quantized Winograd convolution improves the performance by up to 2.40 over the fullprecision and 4.39 over the cuDNN implicit GEMM convolution. The speedup increases with more filters and larger input size. In general, the time of linear quantization operations is further less than the product computation. We note that the performance of some layers are not improved, which is due to their very small size and the overhead of quantization cannot be neglected.
Disscussion. The experimental results confirm the efficiency of our Lance. By using 88 linear quantization, the performance of neural networks is significantly improved with trivial accuracy loss on datasets with different scales. Using nonlinear quantization methods may improve the performance, which remains as our future work.
5 Related Work
The efficient implementations of the Winograd convolution have been designed on different devices, such as mobile and edge devices [22, 23]. The specific hardware for Winograd convolution has also been proposed [24, 25, 26]. Moreover, several researches focus on increasing the sparsity of the Winograd convolution using neural network pruning methods [21, 27, 28], which is complementary to our work.
6 Conclusion
In this paper, we proposed Lance, an efficient quantized Winograd convolution algorithm on graphics processing units. The experimental results show that the Lance fully exploits the potential of lowprecision computation by embedding quantization techniques, achieving significant speedup with trivial accuracy loss.
References
 [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
 [3] Michael Mathieu, Mikael Henaff, and Yann LeCun, “Fast training of convolutional networks through ffts,” arXiv preprint arXiv:1312.5851, pp. 1–9, 2013.

[4]
Andrew Lavin and Scott Gray,
“Fast algorithms for convolutional neural networks,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 4013–4021.  [5] Song Han, Huizi Mao, and William J Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, pp. 1–14, 2015.
 [6] Yunhui Guo, “A survey on methods and theories of quantized neural networks,” arXiv preprint arXiv:1808.04752, pp. 1–17, 2018.
 [7] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko, “Quantization and training of neural networks for efficient integerarithmeticonly inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713.
 [8] Jian Cheng, Peisong Wang, Gang Li, Qinghao Hu, and Hanqing Lu, “Recent advances in efficient computation of deep convolutional neural networks,” Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 64–77, 2018.
 [9] Shmuel Winograd, Arithmetic complexity of computations, vol. 33, Siam, 1980.
 [10] NVIDIA, “Nvidia turing gpu arichitecture,” 2018.
 [11] Yuval Netzer, Wang Tao, Adam Coates, Alessandro Bissacco, and Andrew Y Ng, “Reading digits in natural images with unsupervised feature learning,” Nips Workshop on Deep Learning & Unsupervised Feature Learning, pp. 1–9, 2011.
 [12] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Computer Science Department, University of Toronto, Tech. Rep, vol. 1, pp. 1–60, 01 2009.
 [13] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 [14] Yuxin Wu et al., “Tensorpack,” https://github.com/tensorpack/, 2016.
 [15] Nagadomi, “Code for kagglecifar10 competition, 5th place,” https://github.com/nagadomi/kagglecifar10torch7, 2014.

[16]
Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin,
“Extremely low bit neural network: Squeeze the last bit out with
admm,”
in
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018, pp. 1–16. 
[17]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua
Bengio,
“Quantized neural networks: Training neural networks with low
precision weights and activations,”
The Journal of Machine Learning Research
, vol. 18, no. 1, pp. 6869–6898, 2017.  [18] Fengfu Li, Bo Zhang, and Bin Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, pp. 1–5, 2016.
 [19] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, pp. 1–14, 2014.
 [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [21] Xingyu Liu, Jeff Pool, Song Han, and William J. Dally, “Efficient sparsewinograd convolutional neural networks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018, pp. 1–10.
 [22] Athanasios Xygkis, Lazaros Papadopoulos, David Moloney, Dimitrios Soudris, and Sofiane Yous, “Efficient winogradbased convolution kernel implementation on edge devices,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, pp. 1–6.
 [23] Partha Maji, Andrew Mundy, Ganesh Dasika, Jesse Beu, Matthew Mattina, and Robert Mullins, “Efficient winograd or cooktoom convolution kernel implementation on widely used mobile cpus,” in EMC2 Workshop at HPCA 2019, HPCA.EMC2, 2019, pp. 1–5.
 [24] Chen Yang, YiZhou Wang, XiaoLi Wang, and Li Geng, “A reconfigurable accelerator based on fast winograd algorithm for convolutional neural network in internet of things,” in 2018 14th IEEE International Conference on SolidState and Integrated Circuit Technology (ICSICT). IEEE, pp. 1–3.
 [25] Liqiang Lu and Yun Liang, “Spwa: an efficient sparse winograd convolutional neural networks accelerator on fpgas,” in 2018 55th ACM/ESDA/IEEE Design Automation Conference, DAC. IEEE, 2018, pp. 1–6.
 [26] Haonan Wang, Wenjian Liu, Tianyi Xu, Jun Lin, and Zhongfeng Wang, “A lowlatency sparsewinograd accelerator for convolutional neural networks,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 1448–1452.
 [27] Jiecao Yu, Jongsoo Park, and Maxim Naumov, “Spatialwinograd pruning enabling sparse winograd convolution,” arXiv preprint arXiv:1901.02132, pp. 1–12, 2019.
 [28] Yoojin Choi, Mostafa ElKhamy, and Jungwon Lee, “Jointly sparse convolutional neural networks in dual spatialwinograd domains,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2792–2796.
Comments
There are no comments yet.