Deep Convolutional Neural Networks (CNNs) have shown excellent performance on many machine learning tasks but has been plagued by the huge amount of computations for a long time. Recently, CNNs are increasingly larger to achieve better accuracy but also increase the amount of computations.E.g., AlexNet , has multiplications, VGG-16  contains multiplications, and SENet-154  contains multiplications. The massive amount of computations blocks the application of CNNs. Therefore, reducing the multiplications in CNNs is essential and meaningful.
Lavin lavin2016fast applied Winograd’s minimal filtering algorithm  to half reduce the number of multiplications in convolutions. Unfortunately, Winograd’s minimal filtering algorithm is only effective on kernels with stride 1. When the kernels are larger than , the transform matrices of Winograd’s minimal filtering algorithm will introduce much more decimal multiplications causing precision and efficiency problems. Another problem is that Winograd’s minimal filtering algorithm cannot be used on convolutions with stride 1. Since convolutions of kernel size larger than and stride 1 are frequently used in CNNs, these two restrictions severely limited the application of Winograd’s minimal filtering algorithm.
To tackle the above drawbacks, we propose the Decomposable Winograd Method (DWM) to extend the Winograd’s minimal filtering algorithm into the cases of large kernels and stride 1. First, for kernels larger than 3, DWM decomposes the original kernel into several kernels smaller than 3, on which we can apply Winograd’s minimal filtering algorithm separately. Thus, for large kernels situation, DWM can still reduce the number of multiplications by 50% and keep the numerical accuracy the same as the original convolution. Second, for stride larger than 1 situation, DWM splits the kernels into several parts to apply Winograd’s minimal filtering algorithm. E.g
., with stride 2, DWM is equal to split the polynomials of Winograd into odd ones and even ones and compute them respectively. Therefore, DWM break through the kernel size and stride restriction of Winograd’s minimal filtering algorithm. DWM has the advantages of both computation and numerical accuracy. DWM can efficiently reduce the multiplications of regular convolutions. As shown in Figure1(a), with the kernel size increasing from 3 to 11, the speedup of Winograd’s minimal filtering algorithm decreases seriously, while the speedup of DWM keep to 2 . Furthermore, DWM can keep the final result totally equivalent to the result of regular convolutions, which makes it suitable to be applied to actual products. As shown in Figure 1(b), with the kernel size increasing from 3 to 11, the accuracy error of DWM keep small but the error of Winograd’s minimal filtering algorithm increases quickly. These advantages of DWM enables the fast exploring of larger kernel size and larger stride value in CNNs for high performance and accuracy and even the potential for new CNNs. Experiments show that DWM achieves acceleration while keeping the numerical error under E-07, which is close to the numerical accuracy of FP32 convolution.
The contribution of this paper has three aspects:
We identify the limitation of the Winograd’s minimal filtering algorithm that it suffers from significantly increased FLOPs and numerical accuracy problem for kernel size larger than 3x3 and fails on convolution with stride larger than 1.
We propose a novel DWM method which decomposes kernels with large size or large stride to several small kernels with stride as 1 for further apply Winograd method, so that DWM can reduce the number of multiplications while keeping the numerical accuracy.
We evaluate the proposed DWM method on convolutions with kernel size varying from 3 to 11 and stride from 1 to 2. Experimental results show that DWM is able to support all kinds of convolutions with a speedup of 2, without affecting the numerical accuracy.
So far, with the success of convolutional neuron networks, many researchers focus on accelerate convolution using linear algebra property. Cong and Xiao cong2014minimizing save 47% amount of multiplications by utilizing the linear algebra property at the sub-matrix block level. Mathieu et al. madisetti1997digital first proposed the method of using Fast Fourier Transform (FFT) to reduce the computation of convolution operations. After that, the FFT algorithm was refined by Vasilache vasilache2014fast. Their two GPU implementations called cuFFT and fbfft outperformed the convolution library made by NVIDIA at that time. Lavin lavin2016fast exploited several element-level linear algebra methods to reduce the number of multiplications, including Winograd’s minimal filtering algorithm and FFT. Nowadays, cuDNN 
, a state-of-art deep learning library, includes both the Winograd algorithm and the FFT algorithm as their convolution implementation.
Some researchers made their efforts to overcome the defects of the Winograd algorithm. Some of them tried to overcome the incompatibility of the Winograd algorithm and model sparsification. Li li2017enabling proposed a method to sparse native Winograd coefficients and obtains sparsity level beyond 90% with only 0.1% accuracy loss. Liu liu2018efficient moved ReLU operation into the Winograd transformation and then pruned the weights in the Winograd domain. This approach reduced the number of multiplications bywith loss of accuracy less than 0.1%. Choi choi2018compression proposed a prune principle to make the data keep sparse after the Winograd transformation. Some researchers attempted to solve the numerical accuracy and kernel size problem. Barabasz barabasz2019winograd investigated a wider range of Winograd algorithms for DNNs, which significantly improve floating-point accuracy in many cases. In fp16, this approach has given up to 6.5 times better image recognition accuracy in one important case. Vincent vincent2017improving decreased the numerical error of large tile Winograd algorithm by selecting the polynomial points. Meng and Brothers meng2019efficient extended the Winograd algorithm to larger tiles by introducing complex numbers during the Winograd transformation. Other researchers focused on the hardware implementation of the Winograd algorithm. Due to the low memory ceiling of GPU hardware, the Winograd algorithm can be used to speed up CPU convolution operations and achieves 5 to 25-fold improvement in throughput compared to previous state-of-art implementations . Besides, for the mobile CPU acceleration, the Winograd algorithm achieves up to 60% performance improvements in the full network compared to im2col/im2row based optimization techniques .
Different from the researches mentioned above, our methods focus on the Winograd algorithm itself instead of the combination of the Winograd algorithm and other methods such as sparsification. By extending the Winograd algorithm to a much wider situation, we efficiently reduce the number of multiplications while keeping the calculation’s numerical accuracy stably high.
Preliminary on the Winograd Algorithm
The Winograd Algorithm
As an equivalent problem of multi-dimensional FIR filters problem, convolution can be implemented more efficiently using Winograd minimal filtering algorithm . Denoting the result of computing outputs with an -tap FIR filter as , the corresponding convolution algorithm for it requires multiplications.
The original Winograd algorithm is derived from the relationship between polynomial multiplication and 1-D convolutions using the Chinese Remainder Theorem(CRT) . For the fixed and , the whole algorithm contains three fixed transformation matrices: , and .
Considering the original 1-D situation, the element filter and element input signal can be represented as polynomials ():
then the result of convolution can be obtained by calculating the coefficients of polynomial multiplication
Applying CRT, we can get three transformation matrices , and , and the process of doing convolution can be formulated as the following:
where denotes the convolution output and denotes element-wise multiplication. For 2-D convolutions, we can nest the F(m, r) with itself, and then get
From equation (4), we can derive the gradient of neuron (denoted as ) and the gradient of weight (denoted as
) of Winograd algorithm using the chain rule:
where is the gradient passed from the next layer.
Large Kernel Size
The benefit of the Winograd algorithm comes from the simplicity of transformation matrices. For example, applying the Winograd algorithm, the transformation matrices of are shown as follows:
However, considering as an example, the transformation matrices becomes something like
The huge number of decimals in transformation matrices makes the Winograd transformation not only high consumption, but also less accurate.
Another problem of the Winograd algorithm is that it cannot be applied to stride 1 convolutions, for it is derived from polynomial multiplication which indicates stride 1 convolutions. Therefore, although the Winograd algorithm can implement convolutions much more efficiently, it is always used on and stride 1 convolutions only.
The Decomposable Winograd Method
In this section, we propose a series of techniques called the Decomposable Winograd Method(DWM) to apply the Winograd algorithm mentioned above on more general cases like larger kernels and stride 1.
Large Kernel Size
As we denote before, represents the 1-D convolution filter and thus represents the size of convolution filter. Observing the derivation process above, we find that when becomes larger, in equation (1) can be split into several small polynomials, to which we can apply the original Winograd algorithm:
Then from we can get
We can apply or on each separately, with half multiplication reduced. As for 2-D convolution, we can split the large kernel into small parts, and then apply Winograd algorithm to each part separately. The whole process is illustrated by Figure 3, which shows that we can process a common large kernel convolution in five steps:
Splitting. Split the convolution kernel into several parts whose kernel size belows , and then prepare the input signal by slicing the redundant edges. This method is shown in Figure 1(a).
Transformation. Apply corresponding Winograd transformation and (in or ) on each part.
Calculation. Do element-wise multiplication and channel-wise summation.
Detransformation. Do the inverse transformation to change the intermidiate results to spatial domain.
Aggregation. Sum the calculation results of each part, which gives the final result that equivalent to the original convolution.
For example, when , we can split it into two parts
and . Then we get:
For , we can apply to it, and for , we can apply . When it comes to a 2-D convolution case, we can split the kernel into 4 parts:, , and , just as the method shown in Figure 1(a). This method’s advantage is that using and instead of larger ones not only reduces the multiplications of Winograd transformation efficiently but also keeps the Winograd algorithm’s accuracy because of the few amounts of multiplications in transformation matrices. We will further illustrate this advantage in the experiment part.
Normal polynomial multiplication indicates stride 1 convolution, but we can surmount this barrier by grouping the polynomial into several parts. Denoting convolution stride as , 1-D convolution kernel with elements can be split into
The input signal with elements can be split into similarly.
Then we get several stride 1 convolutions which can be represented by polynomials multiplications and . By applying Winograd algorithm to them, we have reduced the multiplications of 1-D convolutions with stride 1 successfully. For 2-D convolutions, we nest the 1-D convolution methods. This process also contains five parts: splitting, transformation, calculation, detransformation and aggregation, which is similar to Figure 3. The splitting method is illustrated by Figure 1(b).
For instance, when we are dealing with stride 2 convolutions, we can group the convolution kernel and input signal by their degree’s parity, and then get
In a 2-D case, a stride 2 convolution on activation can be splitted into doing four stride 1 convolutions: , , and . Details are shown in Figure 1(b).
Occasionally, we need to do a stride 2 convolution with large kernel size, then we can combine the two techniques mentioned above. By applying these two techniques, we can optimize all kinds of convolutions with the Winograd algorithm, reducing the number of multiplications by half.
Using equation (5), we can apply DWM in the process of calculating gradients, and this will also reduce the number of multiplications by half. From SGD algorithm, we can derive that
where denotes weight in Winograd form and denotes iteration. We should notice that
, so we keep weight W in the normal form (or called ’spatial domain’) instead of Winograd form (or called ’frequency domain’) during the training process in order to make DWM be the equivalent substitute of normal convolution.
|Kernel Size||H/W||Channel||Filters||FP32||FP16||Winograd FP32||Winograd FP16||DWM FP32||DWM FP16|
Considering DWM, the two techniques mentioned above, we can derive the corresponding back propagation rules. Denoting output as , the two techniques have the same form in the aggregation step:
From the derivative rules we know that for part :
This means that we don’t need to store each output for the backward. Furthermore, it is easy to derive that in the splitting step, the backward acts as follows:
where i denotes different parts produced by DWM.
Comparison and Discussion
We are not the first one who tries to solve the large kernel size problem that appeared during applying the Winograd algorithm. Lu et al. 
tried to implement the Winograd algorithm on FPGAs, and they solved large kernel problems by padding the kernel. However, the padded elements will be filled with non-zero values after the Winograd transformation, which means that paddings will bring lots of extra calculations. For DWM, we find a way that precisely separates the convolution operations without padding. This method avoids any redundant floating-point multiplications during Winograd transformation, and thus achieves the best acceleration without any numerical accuracy loss.
Furthermore, with the rise of neural architecture search (NAS), convolutions with larger kernel sizes such as or have become more popular than ever. Networks like ProxylessNAS  show that different computation environments will breed different neural architectures. For example, when we want to implement our neural networks on GPUs rather than on mobile CPUs, large kernel sizes may be more suitable choices than small ones because of the computation parallelism of GPUs. From beginning to the end, the architecture of neural networks follows the most popular hardware, not the other way round. Hence, we believe that reducing the computation overhead of neural networks is worthy in any case.
All the results were tested on NVIDIA V100 GPU. We have implemented DWM on TensorFlow
. On both platforms, the implemented DWM performs like a plug-and-play operator, which makes it convenient to use during inference and training. We tested the numerical accuracy of the algorithms by doing convolution with the standard normal distribution random numbers on one single layer, and then calculate the mean squared error (MSE) with FP64 convolution results. The numerical accuracy on a single layer was tested on both two platforms, and the results were the same. For all single layer tests, the batch size is set to 256 and the layers are the same padded. We measured the accuracy and FLOPs of networks based on PyTorch. The traditional model architectures are gained from TorchVision, and the ProxylessNAS
analysis is based on the official PyTorch implementation. The Networks’ accuracy was measured on ImageNet 2012.
Numerical Accuracy of Single Layer
We estimated the mean squared error (MSE) between several methods and the FP64 results by doing a forward convolution. The input signal and convolution weights are random numbers generated by standard normal distribution using Numpy, set seed 11. We assumed two application situations: larger feature map (set) with fewer channels (set ) and smaller feature map (set ) with more channels (set ), which is consistent with the reality. As shown in Table 1, (1) DWM has better numerical accuracy than the traditional Winograd algorithm almost in all situations. It is obvious that the traditional Winograd algorithm faced with a serious numerical accuracy problem as the kernel size grows up. When the kernel size is or larger, the error of the FP32 Winograd algorithm approaches FP16’s, which may cause accuracy problems. By contrast, DWM’s numerical error stays at a low level, which is close to the result of FP32, meaning that it can be applied to all kinds of convolution operations without any problem. (2) When using FP16, the traditional Winograd algorithm may get an overflow. This may be caused by the intermediate results of the Winograd transformation. When kernel size becomes larger, the transformation matrices will be filled with large numbers, and these numbers may make the results of matrix multiplication become too large to be represented in FP16. This also shows the advantages of using DWM instead of the traditional Winograd algorithm. Furthermore, the FP16 DWM is implemented without any optimization, which means the numerical accuracy of it still can be improved. (3) In some situations, DWM FP32 is more accurate than FP32 ones. This is reasonable because DWM consumes fewer multiplications, which may make it has better numerical accuracy.
FLOPs Estimation on Single Layer
We calculated the FLOPs of convolutions with different kernel sizes and 2 kinds of stride. Due to the decimals in the transformation matrix in the traditional Winograd algorithm, the FLOPs caused by transformation cannot be ignored, and thus we only eliminate the influence of and which can be easily implemented by shifting. As shown in Table 3, we can get the following results: (1) DWM cost less computation than the traditional Winograd algorithm in all situations. As the kernel size becomes larger, the FLOPs of traditional Winograd algorithm increases heavily, and most FLOPs concentrates on the Winograd transformation because the transformation process becomes a non-sparsity matrix multiplication. On the contrary, the speedup of DWM keeps steady for its simple transformation matrices. (2) DWM speeds up stride 2 convolutions by 2 , which cannot be achieved by the traditional method. Not surprisingly, due to the splitting method of DWM, the speedup of stride 2 convolution still holds stably. These advantages lead to stably speedup on almost all kinds of convolutions.
We also tested the actual runtime of convolution operations based on naive implementations of different kinds of convolutions. The result was tested by nvprof, an NVIDIA profiling tool. According to Table 4, we can conclude that: (1) DWM outperforms the original Winograd algorithm, especially in large kernel situation. (2) DWM is faster than cuDNN in some situations. Actually, when the size of the feature map increases to around 100, cuDNN performs better than DWM (not presented due to the space left). The instability of cuDNN may be caused by some other accelerating algorithms. Hence, both DWM and cuDNN have their advantages. Furthermore, the naive DWM implementation can be optimized by kernel fusion, soft pipeline and some other optimization techniques unlike the simple traditional 3x3 Winograd algorithm. Since we mainly focuses on speed up Winograd through the algorithm aspect in this paper, the implementation optimization of DWM is our future work.
Total Analysis on Networks
Finally, we analyzed several representative networks. This analysis includes the comparison of total FLOPs of the network and the top-1 accuracy of inference on the ImageNet 2012 dataset. The top-1 accuracy of the two accelerating algorithms is tested using FP16. As shown in Table 2, we can conclude that: (1) the top-1 acc is very close. This result is not surprising, because Table 1 shows that only large kernel size will make DWM and the traditional Winograd algorithm different. Thus, the networks most of which consist of kernel convolutions like will get similar inference results. (2) When there are more large kernels in the network such as in AlexNet and or in GoogLeNet and Inception-v3, DWM performs better.
However, in some cases such as ResNet-152, most of the computation is produced by kernels. Worse, when it comes to some modern NAS architectures, the calculation is concentrated on kernels because of the separable convolution. Over 90% amount of calculation of convolutional neural networks is caused by convolution operations. Hence, most of the architectures are designed or searched based on separable convolutions to reduce the amouont of calculation. Although separable convolutions can cut the FLOPs down effectively, it reduces the representation of the original neural networks. When the computing power grows up and more fast convolution algorithms are invented, FLOPs will not be the main consideration of architecture designing.
In this paper, we propose a novel DWM to extend Winograd’s minimal filtering algorithm to a wide and general convolutions. Winograd’s minimal filtering algorithm has been widely used to reduce the number of multiplications for faster processing. However, it has the drawbacks of sufferring from significantly increased FLOPs and numerical accuracy problem for kernel size larger than 3x3 and failing on convolution with stride larger than 1, so it is only effective on convolutions with kernel size as 3x3 and stride as 1. To solve this problems, we propose DWM to break through the limitation of original Winograd’s minimal filtering algorithm on convolutions of large kernel and large stride. DWM decomposes kernels with large size or large stride to several small kernels with stride as 1 for further applying Winograd method, so that DWM can reduce the number of multiplications while keeping the numerical accuracy. Experimental results show that the proposed DWM is able to support all kinks of convolutions with a speedup of 2, without affecting the numerical accuracy. These good properties of DWM enables the fast exploring of larger kernel size and larger stride value in CNNs for high performance, accuracy and even the potential for new CNNs.
This work is partially supported by the National Key Research and Development Program of China (under Grant 2017YFA0700902), the NSF of China (under Grants 61432016, 61532016, 61672491, 61602441, 61602446, 61732002, 61702478, 61732007, 61732020 and 61906179), Beijing Natural Science Foundation (JQ18013), the 973 Program of China (under Grant 2015CB358800), National Science and Technology Major Project (2018ZX01031102), the Transformation and Transfer of Scientific and Technological Achievements of Chinese Academy of Sciences (KFJ-HGZX-013), Key Research Projects in Frontier Science of Chinese Academy of Sciences (QYZDB-SSW-JSC001) , Strategic Priority Research Program of Chinese Academy of Science (XDB32050200, XDC01020000) and Standardization Research Project of Chinese Academy of Sciences (BZ201800001).
-  (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: Setup.
Deep tensor convolution on multicores. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 615–624. Cited by: Related Works.
-  (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: Comparison and Discussion, Table 2, Setup.
-  (2014) Cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759. Cited by: Related Works.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: Table 2.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: Introduction.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: Table 2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Introduction, Table 2.
-  (2016) Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021. Cited by: The Winograd Algorithm.
-  (2017) Evaluating fast algorithms for convolutional neural networks on fpgas. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 101–108. Cited by: Comparison and Discussion.
-  (2019) Efficient winograd or cook-toom convolution kernel implementation on widely used mobile cpus. arXiv preprint arXiv:1903.01521. Cited by: Related Works.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: Setup.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: Setup.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Introduction.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: Table 2.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Table 2.
-  (1980) Arithmetic complexity of computations. Vol. 33, Siam. Cited by: Introduction, Related Works, The Winograd Algorithm.