Convolutional neural networks (ConvNets) have emerged as a widespread machine learning method for many application domains, soon after the demonstration of its superior performance for classification and localization tasks in the ImageNet(Deng et al., 2009) competition in 2012 (Krizhevsky et al., 2012). Since convolutional layers are computationally intensive and dominate the total execution time of modern deep ConvNets (Szegedy et al., 2015; Krizhevsky et al., 2012; Montufar et al., 2014; Simonyan and Zisserman, 2014), many efforts have been made to improve the performance of the convolutional primitives for CPUs (Zlateski et al., 2016a; Vanhoucke et al., 2011; Budden et al., 2016; fal, 2016; Zlateski and Seung, 2017), GPUs (Chetlur et al., 2014; Mathieu et al., 2013; Vasilache et al., 2014; neo, 2018) or both (Zlateski et al., 2016b).
An important class of optimization is to reduce the number of calculations required for a convolution. Initially, several efforts used FFT–based convolutions to reduce the required computations for GPUs (Mathieu et al., 2013; Vasilache et al., 2014) and CPUs (Zlateski et al., 2016a, b). Recently Lavin et al. (Lavin and Gray, 2016) proposed a Winograd–based method for performing convolutions, and demonstrated that it can save more floating point operations than FFT, especially for small 2D kernels (e.g. ). This prompted a large number of implementations to employ Winograd–based convolutions. For example, Nervana (neo, 2018) and Nvidia’s cuDNN (Chetlur et al., 2014) implemented Winograd–based convolution for GPUs. FALCON (fal, 2016), LIBXSMM(lib, 2018) and Intel MKL-DNN (mkl, 2016) provided CPU implementations of Winograd-based convolutions. Budden et al. (Budden et al., 2016) extended the 2D Winograd-based convolution to arbitrary dimensions and kernel sizes. A highly optimized implementation for modern many–core CPUs was presented by (Jia et al., 2018). However, the community has not put in similar optimizing efforts into FFT–based implementations.
This paper presents results of comparing Winograd–based with FFT–based convolutions on modern CPUs. We have extended the highly optimized Winograd–based implementation for Intel Xeon Phi processors (Jia et al., 2018), to support arbitrary modern multi– and many–core CPUs (that support the AVX2 or the AVX512 instruction set). Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One is based on the regular FFT algorithm (Regular–FFT). Another is also FFT–based but uses Gauss’ multiplication method (Gauss-FFT). All implementations are open–sourced at (del, 2018).
We have compared all three implementations with two popular ConvNets (VGG and AlexNet) on 10 different systems with modern CPUs. Our results show that, contrary to the popular belief, the regular–FFT or Gauss–FFT implementations are generally faster than the Winograd implementation. And in some cases, they are substantially faster.
To understand the experimental results, we have used the Roofline performance model (Williams et al., 2008) to analyze each algorithm and its implementation in detail by carefully analyzing its computational phases. Such analysis provided two key explanations. First, Winograd–based approach, in most cases, requires fewer floating point operations than FFT–based approach because it works with real numbers instead of complex numbers. However, due to its numerical instability, the Winograd method can use only very small transform sizes (Lavin and Gray, 2016; Vincent et al., 2017; Budden et al., 2016). The FFT–based convolutions do not suffer from such instabilities, allowing for arbitrary large tile sizes. Large tile sizes allow the FFT–based approach to reduce a large number of redundant or unnecessary computations. Thus, in certain scenarios, the FFT–based method requires fewer operations than the Winograd–based one.
Second, our analysis considers not only the number of floating point operations, but also the total amount of data movements (to and from memory) and their costs, the arithmetic intensity (operations per moved byte), the processor’s speed, as well as the memory bandwidth. We also analyze the effects of cache sizes, which determine the arithmetic intensity of both methods, and indirectly affect the running times.
Large arithmetic intensity allows for better utilization of hardware systems whose compute–to–memory ratios are increasing over time, because of the imbalance of the evolution between processor speeds and memory bandwidths. The speeds for computations typically improve much faster than memory bandwidths (Wulf and McKee, 1995) 111For instance, the TFLOPS Intel Knights Landing processor (Jeffers et al., 2016) has a compute–to–memory ratio of , whereas the latest Skylake Xeon and SkylakeX family of processors have reached the ratio in the range between to almost .. This benefits the FFT–based method more, as its arithmetic intensity is generally higher due to computing with complex numbers.
Our analysis suggests that whether the Winograd–based or a FFT–based approach is faster depend on the specific convolutional layer and the particular system it is executed on. However, on average, the FFT–based approaches outperform the Winograd–based one with commonly used neural networks, with the margin increasing as the system’s compute–to–memory ratio increases.
The findings in this paper challenge the current belief that Winograd–based convolutions are better in nearly all practical cases.
2.1. Winograd– and FFT– Based Convolutions
Winograd – Based Convolution
As recently illustrated by Lavin et al. (Lavin and Gray, 2016), “valid” convolution of discrete signals, the main operation used in modern convolutional neural networks, can be performed using Winograd’s minimal filtering algorithm (Winograd, 1980) via
Where and are 1D signals; represents element–wise multiplication; and , and are special matrices, derived from Vandermonde matrices for Homogeneous Coordinate polynomials (Vincent et al., 2017). By convention, Winograd convolutions have the matrices , and in real–space. In “valid” convolution the filter slides across every “valid” location in the filtered images – such that the filter is fully contained inside the image. When the size of the filter is , will have a length of , and we refer to the method above as Winograd convolution .
FFT – Based Convolution
The convolution theorem states that a convolution can be performed using Fourier transforms via
Here, and are Fourier and inverse Fourier transforms. In the discrete case, and
need to have the same number of elements, which can be accomplished by padding zeros to the shorter signal.
Discrete Fourier transform (DFT) results in a circular (also known as cyclic) convolution. The result of the “valid” convolution can be extracted from the last elements of the circular convolution.
“Valid” convolution using discrete FFTs can also be regarded as a special case of the Eq 1, where the matrices , and are in complex space and are derived from Vandermonde matrices with polynomial points being the roots of unity (Vincent et al., 2017). and perform, implicitly zero–padded (to size of ), DFT transforms of and , and computes the last
elements of the inverse DFT transform. Using the FFT algorithm allows for efficient computation of matrix–vector products with matrices, and . We refer to this special case as Regular–FFT .
Both Winograd and FFT convolutions can easily be extended to an arbitrary number of dimensions (Budden et al., 2016). –dimensional convolution is performed via
Here, the operation is short for , where
represents tensor–matrix mode–n multiplication as defined in(Kolda and Bader, 2009; Budden et al., 2016). For the 2D case, , and . The formula above reduces to
Which is consistent with Lavin et al. (Lavin and Gray, 2016).
2.2. Winograd– and FFT– Based Convolution Layers
A convolutional layer transforms an input tuple of images into an output tuple of images. A batch of inputs yielding a batch of outputs is processed at the time via
Where and denote the number of input/output images (also called channels or feature–maps). and are the (arbitrary dimensional) input and output images of the –th batch. In total, convolutions are performed.
Using Winograd or Regular FFT convolution, the output images are computed via
Note that we can have different sizes for matrices , and for each dimension.
and assume a particular size of () and () along the –th dimension. For larger image sizes, the convolution is performed using the overlap–add method (OLA) (Rabiner and Gold, 1975). With OLA, the input images are divided into tiles with sizes of , and an overlap of along the –th dimension. Considering tiles at the same location from all the input images, tiles of size of the output images are computed using the formula above.
The main savings in computation in both the Winograd and FFT methods comes from the fact that both the kernel transforms , and image (tiles) transforms can be precomputed and reused many times. The computation is dominated by computing the dot products – accumulation of element–wise product inside the square brackets in Eqn. 6. Computing all the dot products in Eqn. 6 is an equivalent problem to matrix multiplications, with real matrices for the case of Winograd and complex matrices for the case of Regular FFT convolution.
2.3. Gauss’ Multiplication of Complex Numbers
In the Regular FFT convolution, the computation is dominated by complex matrix multiplications, where each complex number pair multiplication requires 4 real multiplications, when computed directly, and 3 real multiplications when using Gauss’ multiplication algorithm (MacLaren, 1970; Lavin and Gray, 2016).
Using Gauss’ multiplication algorithm, the product of two complex numbers and is computed by first computing , and . The real part of the result is then equal to and the imaginary part to . Similarly, an element–wise product of two complex tensors ( and ) can be performed using three element–wise products of real–valued tensors.
For the Regular–FFT convolutional layer, element–wise product of complex tensors representing the image (tile) transforms and kernel transforms are performed, and each tensor is reused many times (Eqn. 6). After performing a transform of an image tile, which yields a complex tensor , a real–valued tensor can be computed and stored alongside and . Similarly, tensors and can be computed during kernel transforms and stored alongside ( does not have to be stored). Each element–wise products of complex tensors can then be replaced with three independent element–wise products of real–valued tensors.
The resulting three real tensors are implicitly converted back to a single complex tensor during the computation of inverse transform ().
Computing all the dot products in Eqn. 6 is then performed using three real–valued matrix multiplications instead of a single complex matrix multiplication, reducing the number of required operations by 25%.
We refer to the FFT method using Gauss’ multiplication as Gauss–FFT ()
Both Winograd and FFT approaches perform computations in four distinct stages: input transform, kernel transform, element–wise computation, and inverse transform. The first two stages convert the images/kernels from a spatial domain into Winograd or FFT domain. The third stage is equivalent to matrix multiplications. The inverse transform stage converts the results back to the spatial domain.
We extended the publicly available Winograd implementation from (wCo, 2018; Jia et al., 2018), which was highly optimized for many–core CPUs, to support arbitrary AVX2 and AVX512 multi–core CPUs. We used it as a base for our FFT implementations. We reused most of the code (90%) in order to leverage the same optimization methods.
In order to achieve high utilization of the hardware, both Winograd and FFT methods use the identical optimizations as proposed in (Jia et al., 2018), including software prefetching, memory and cache blocking, using aligned vector data access and interleaving memory access with computation.
We adopted the data layout proposed in (Jia et al., 2018; Jeffers et al., 2016; Zlateski and Seung, 2017) for input images, kernels and output, where images are interleaved in memory for easy vectorization. In (Jia et al., 2018) was used due to the size of the AVX512 vector register, we keep it to regardless of the vector register size, as is the cache–line width (16 32–bit floats), to facilitate efficient utilization of the memory subsystem.
For data hand–off between the four stages of the algorithm, streaming stores to main memory were used, since the data will not be used in near future. This saves memory bandwidth and avoids cache pollution.
Overall, all three implementations achieved high utilization of the available hardware, as discussed in the following sections.
To perform transforms of the input images and kernels, as well as the output images, the implementation of (Jia et al., 2018) provides C++ codelets that perform transforms of tiles at the same time. The codelets are created by generating Winograd transforms using Wincnn (win, 2018), after which a computation graph is created and optimized, yielding codelets that utilize AVX512 instructions to transform 16 tiles at the time.
For the FFT–based implementations, the codelets were replaced by C++ codelets generated using “genfft” supplied with FFTW (Frigo and Johnson, 1998). “Genfft” was modified so that it can generate codelets that perform implicitly zero–padded FFT transforms, as well as computing only a subset of elements of the inverse transform. Multidimensional (forward) transforms were implemented by combining codelets performing implicitly zero–padded real–to–complex transforms along one dimension, and ones performing complex–to–complex transforms along other dimensions. Backward transforms combined complex–to–complex transform codelets and complex–to–real ones. Different from existing FFT–based convolutions, which limit transform size to small prime factors (Zlateski et al., 2016b) or only numbers that are powers of two (Mathieu et al., 2013; Appleyard, 2018), our implementations support arbitrary sizes.
For the element–wise stage, where matrix–matrix multiplications are performed, the implementation of (Jia et al., 2018) provides JIT routines for real–matrix multiplications optimized for AVX512 instruction set. Following the same principles of (Jia et al., 2018) we implemented JIT real–matrix multiplication routines optimized for the AVX2 instruction set, as well as complex–number multiplication routines for both AVX512 and AVX2 instruction sets, which are required for the Regular–FFT method.
Parallelization Through Static Scheduling
Each of the stages of our algorithm is parallelized using static scheduling originally proposed in (Zlateski and Seung, 2017), using the generalized implementation provided by (Jia et al., 2018). To achieve optimal performance, each core is assigned roughly the same amount of computation. The work is then executed using a single fork–join routine.
4. Performance Comparisons
|Xeon Phi 7210||64||4506||512||512 KB||409.6||11|
|Xeon Gold 6148||20||3072||512||1 MB||128||24|
|Xeon Platinum 8124M||18||3456||512||1MB||115.2||30|
|Xeon Phi 7210||48||4506||512||512 KB||102.4||33|
|Xeon Phi 7210||64||4506||512||512 KB||102.4||39.11|
To compare running times of the FFT and Winograd methods, we benchmarked our FFT–based implementations and the improved Winograd implementation of Jia et al. (Jia et al., 2018) by using the two most popular ConvNets, AlexNet (Krizhevsky et al., 2012) and VGG (Simonyan and Zisserman, 2014), which are frequently used for benchmarking (ima, 2018). We benchmarked the time required for the forward propagation of each distinct layers of the two networks.
Additional, publicly available, implementations were included for reference – The Winograd implementations provided by LIBXSMM (lib, 2018) and MKL-DNN (mkl, 2016), and the direct convolution provided by MKL-DNN (mkl, 2016). To the best of our knowledge, no other implementation of FFT–based methods for CPUs, beside our own, is currently available.
Both Winograd and FFT methods can work with arbitrary transform sizes. Generally, larger transform size decrease the number of required operations. However, the numerical inaccuracy of Winograd convolutions increases exponentially with transform (tile) sizes (Pan, 2016; Jia et al., 2018; Lavin and Gray, 2016). In practice, there is an upper bound on the transform size for which the Winograd produces satisfactory results. All major vendors, such as FALCON, MKL-DNN, LIBXSMM, and cuDNN (fal, 2016; lib, 2018; mkl, 2016; Appleyard, 2018) implement the Winograd algorithm with transform sizes less than or equal to 222With the transform sizes of , the average numerical error of the Winograd method on benchmarked layers was , which is similar to the error of direct–convolution When the transform size is increased to , the error is increased to , which was expected (Jia et al., 2018). The numerical error of the FFT–based method was no larger than , regardless of the tile size.. For these tile sizes, the numerical error of computation is comparable to the one that implements convolution directly.
We follow the same convention, and consider Winograd convolutions only with tile sizes of at most . However, we allow the FFT methods to have an arbitrary large tile size, as they don’t suffer from such numerical instability.
Running Time Experiments
The benchmarks were performed on a total of 10 systems showed in Tbl. 1.
In Fig. 1 we show detailed results on one of the systems. Note that both LIBXSMM’s and MKL-DNN’s Winograd implementation support only kernel sizes of , and thus can not use the Winograd method for the second layer of AlexNet.
The Winograd–based method outperformed in only 3 out of 12 layers, whereas the a FFT–based method outperformed on 6 layers; and on 3 layers, they had roughly the same performances. More importantly, in the cases when the FFT methods outperformed, they did it with a larger margin, and on the layers that require more computation time. This suggests that the FFT methods can allow for significant savings in the overall computation time, when compared to the Winograd. For instance, the time spent on all convolutional layers in AlexNet using the Winograd method would consume 58.79 milliseconds, whereas the Regural–FFT method requires only 31.96 milliseconds; that is a speedup of 1.84x.
Additionally, in Fig. 2 we show the normalized running time on all other AVX512 systems (neither LIBXSMM, nor MKL-DNN support the AVX2 instruction set). The results for each individual layer are scaled based on the slowest implementation, as we are only interesting in the relative performances. Detailed results are available in Appendix. In all cases, our two FFT–based implementations as well as the modified Winograd implementation of (Jia et al., 2018) outperformed other publicly available implementations. Thus, for the rest of the paper, we will only focus on these three implementations.
FFT transform sizes
An important observation was that the optimal transform sizes for the FFT method were not always powers of two; they were 27 for VGG1.2, 25 for VGG2.1 and VGG2.2, 21 for VGG 3.1 and VGG3.2, 16 for VGG4.1 and VGG4.2, and 9 for VGG5.1/5.2. For AlexNet, the optimal sizes were 31 for the second layer, and 15 for all other layers. This is counter intuitive, as the common belief is that optimal FFT transforms should use sizes that only contain factors that are small primes (Frigo and Johnson, 1998; Zlateski et al., 2016b) or transforms with sizes equal to the powers of two, which is suggested by the two GPU FFT–based implementations, i.e., fbfft (Mathieu et al., 2013; Vasilache et al., 2014) and CuDNN (Chetlur et al., 2014).
While an FFT–method outperformed the Winograd more often than not, the relative performances varied among different layers and systems. In the rest of the paper we focus on the theoretical analysis of all the methods. We would like to understand why our findings are not aligned with the popular belief that the Winograd method should be strictly faster.
5. Performance Analysis
Our experimental observations suggested that some of the stages in both Winograd– and FFT–based approach have relatively low utilization of the system’s available FLOPS. In most, but not all, cases, these were the transform stages, which have relatively small amount of compute, but access relatively large amount of data, suggesting that their running time is bound by the memory bandwidth, and not the computational capabilities of the CPU.
For this reason, we used the Roofline (Williams et al., 2008) performance model to analyze Winograd– and FFT– based convolutions.
Roofline Performance Model
is an ISA oblivious, throughput oriented, performance model used to estimate performance of an application running on processing units, and is often used to predict performances on CPUs, GPUs, TPUs (Tensor Processing Units), etc.
It is suitable for analyzing particular methods on systems where the memory bandwidth often becomes the constraining resource. It estimates the performance bound of an algorithm as a function of its arithmetic intensity (AI), which is defined as the ratio of total floating–point operations (FPO) and the total data movement (DM) in bytes (AI= FPO/DM) between different levels of the memory hierarchy. For each level of memory hierarchy, the performance ceiling (attainable FLOPS) is determined by:
Where Peak FLOPS is defined as the maximal number of floating operations per second of a given system, and MB as system’s peak memory bandwidth. When plotted, the performance ceiling line resembles a roofline.
Here, we are interesting in the DM between the highest level of on–chip, core–exclusive cache (typically L2 for CPUs) and off–chip, shared memory 333The term off–chip refers to the main memory of the system, typically large, but much lower throughput and higher latency than the on–chip caches. The HBW MCDRAM of the Knights Landing processor would be considered off–chip.. In a systems where the L2 is shared among a small number of cores (e.g. Intel Xeon Phi series share L2 cache between two cores.), the L2 is assumed to be equally divided for exclusive usage among the cores.
DM then accounts for all of regular and streaming stores to main memory, regular loads from main memory, and pre–fetches from main memory to either L1 or L2.
Our analysis does not consider the presence of higher level, shared caches, as on modern processors, the sizes of such caches are very limited, and can not fit the working set of a typical convolutional layers.
Additional performance ceilings for lower levels of cache, are not necessary, since in both Winograd–based and FFT–based convolutional algorithms the computation is either bounded by transfers to and from main memory, or the processor peak FLOPS, as shown in the following sections.
As the performance ceiling is set by algorithm’s arithmetic intensity (Eqn. 7), its running time can be estimated by:
Here, CMR is the system’s compute–to–memory ratio, defined as the ratio of it’s Peak FLOPS and MB – the memory bandwidth. FPO represents the total number of floating point operations required, and DM (the total amount of data movements in bytes), as defined above. The running time is compute bound when , in which case it can be computed as ; otherwise, the running time is memory bound, and can be computed as .
Estimating running time
For the Wingorad– and FFT–based convolutional layers, where the computation is composed of several sequential stages (), each with a unique arithmetic intensity (AI), the running time is calculated by accumulating the running times of each stage :
5.1. Relative Performances
We are interested in the relative performance between Winograd and FFT methods. We define as the ratio of the running times of an algorithm and an algorithm .
A speedup greater than one indicates that the algorithm is faster, and smaller than one means that the algorithm is faster.
While Eqn. 9 estimates the running time of an algorithm assuming perfect utilization of the hardware, which is rarely possible in practice. However, the Eqn. 10 is also valid when the compared algorithms are equally optimized, meaning that they utilize the same percentage of the hardware capabilities.
In the Eqn. 10 the value of AI will differ between Winograd and FFT based approaches, and will also depend on the amount of available cache size (Jia et al., 2018). Therefore, when comparing performance of two algorithms, the relative performance on a given system will depend on the CMR ratio and the amount of available cache, but not on the absolute values of the system’s compute speed or memory throughput.
Detailed analysis on obtaining the values of AI, DM, and FPO are presented in the Appendix A.
5.2. The Accuracy of the Theoretical Estimates
In Fig. 3, we plot the estimated theoretical speedup of the two FFT–based methods over the Winograd–based one. The solid lines represent the theoretical estimates of the relative performances as a function of the system’s CMR. The color of the line represents the amount of available L2 cache. The lines are drawn in a semi–translucent fashion, as they overlap in many places.
The empirical results from the benchmarks described in Section 4 are overlaid; each cross–hair represents the measurement of the relative performances on a single system. The coordinate of the cross-hairs represents the system’s compute–to–memory (CMR) ratio (see Tbl. 1), and the color represents the L2 cache sizes.
Our empirical results are consistent with our theoretical estimates. The overall relative root mean square error (rRMSE) was 0.079 for Regular–FFT vs Winograd and 0.1 for Gauss–FFT vs Winograd. The fitness 444 was for Regular–FFT vs Winograd and for Gauss–FFT vs Winograd.
We have also measured the system utilization of each stage of the three methods. While the utilization varied across benchmarked layers and systems, on average, during the compute bound stages, 75% of theoretical peak FLOPS were attained; in memory bound stages slightly more than 85% of the theoretical memory bandwidth was achieved. This resulted in a somehow lower effective CMR, which is consistent with the results shown in Fig. 3. The empirical results are slightly shifted to the left (towards lower values of CMR), when compared to the theoretical predictions, which assume equal utilization of both the FLOPS and the memory bandwidth.
Optimality of Tile Transforms
use heuristics to minimize the number of operations for performing transforms, and might not be optimal. However, as their AIs are much smaller than the CMRs of modern systems, the computation is memory bound, and the running time of the transform stages will depend solely on the amount of data movement. The largest AI of the FFT transforms is 5.55, and for Winograd 2.38, much lower than CMRs of the modern CPUs. The Xeon Phi Knights Landing processor has the CMR of 11 (due to on–chip fast MCDRAM), and the Xeon Server processor family has CMRs in the range of 20–40. Hence, our theoretical analysis yield identical estimates as if the optimal number of operations were provided.
This is consistent with our experimental findings, where in some cases, tile sizes that were large primes, such as 31, were optimal. Using such sizes, the images could be divided into overlapping tiles with minimal overhead (minimal padding of the original image), which reduces both the number of required operations, and the amount of required data movements.
This paper presents experimental and analytical findings that FFT–based convolutions are, on average, faster than Winograd–based convolutions on modern CPUs.
In contrast to the popular belief that the Winograd–based method is more efficient, our experimental results of a highly optimized Winograd–based implementation and two similarly optimized FFT-based implementations on modern CPUs show that the FFT convolutions are, more often than not, faster than Wingorad ones for most commonly used ConvNet layers.
Our analysis using the Roofline model shows that whether the Winograd– or FFT–based approach is faster depends on both the layer and target hardware. However, with the tendency of increasing compute–to–memory ratio of systems with modern CPUs, the FFT–based methods tend to be faster than Winograd.
While, we primarily focused on modern multi– and many–core CPUs, which generally have larger caches, but smaller memory bandwidths than modern GPUs, we believe that our performance analysis can be applied to GPUs. Future work might include implementing and benchmarking efficient GPU–based implementations and validating performance analysis based on the Roofline model.
- fal (2016) 2016. FALCON Library: Fast Image Convolution in Neural Networks on Intel Architecture. "https://colfaxresearch.com/falcon-library/". (2016).
- mkl (2016) 2016. Intel(R) Math Kernel Library for Deep Neural Networks. "https://github.com/01org/mkl-dnn". (2016).
- wCo (2018) 2018. N-Dimensional Winograd–based convolution framework. https://bitbucket.org/poozh/ond-winograd. (2018).
- ima (2018) Accessed: 08-15-2018. Imagenet Winners Benchmarking. https://github.com/soumith/convnet-benchmarks. (Accessed: 08-15-2018).
Intel® Nervana reference deep learning framework.https://github.com/NervanaSystems/neon. (Accessed: 08-15-2018).
- lib (2018) Accessed: 08-15-2018. LIBXSMM. https://github.com/hfp/libxsmm. (Accessed: 08-15-2018).
- win (2018) Accessed: 08-15-2018. Wincnn. "https://github.com/andravin/wincnn". (Accessed: 08-15-2018).
- del (2018) Accessed: 08-15-2018. Winograd and FFT Convolution. https://bitbucket.org/jiazhentim/winograd-fft-conv/src/master/. (Accessed: 08-15-2018).
Jeremy Appleyard. Accessed:
Optimizing Recurrent Neural Networks in cuDNN 5.https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/. (Accessed: 08-15-2018).
- Budden et al. (2016) David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, and Nir Shavit. 2016. Deep Tensor Convolution on Multicores. arXiv preprint arXiv:1611.06565 (2016).
- Chetlur et al. (2014) Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248–255.
- Frigo and Johnson (1998) Matteo Frigo and Steven G Johnson. 1998. FFTW: An adaptive software architecture for the FFT. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, Vol. 3. IEEE, 1381–1384.
- Gannon et al. (1988) Dennis Gannon, William Jalby, and Kyle Gallivan. 1988. Strategies for Cache and Local Memory Management by Global Program Transformation. In Proceedings of the 1st International Conference on Supercomputing. Springer-Verlag, London, UK, UK, 229–254. http://dl.acm.org/citation.cfm?id=647970.761024
- Heinecke et al. (2016) Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: accelerating small matrix multiplications by runtime code generation. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 981–991.
- Heinecke et al. (2015) Alexander Heinecke, Hans Pabst, and Greg Henry. 2015. Libxsmm: A high performance library for small matrix multiplications. Poster and Extended Abstract Presented at SC (2015).
- Jeffers et al. (2016) James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.
- Jia et al. (2018) Zhen Jia, Aleksandar Zlateski, Fredo Durand, and Kai Li. 2018. Optimizing N-Dimensional, Winograd-Based Convolution for Manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
- Kolda and Bader (2009) Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455–500.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
- Lavin and Gray (2016) Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013–4021.
- Li et al. (2015) Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 76.
- MacLaren (1970) M Donald MacLaren. 1970. The art of computer programming. Volume 2: Seminumerical algorithms (Donald E. Knuth). SIAM Rev. 12, 2 (1970), 306–308.
- Mathieu et al. (2013) Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851 (2013).
- Montufar et al. (2014) Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In Advances in neural information processing systems. 2924–2932.
- Pan (2016) Victor Y Pan. 2016. How bad are Vandermonde matrices? SIAM J. Matrix Anal. Appl. 37, 2 (2016), 676–694.
- Rabiner and Gold (1975) Lawrence R Rabiner and Bernard Gold. 1975. Theory and application of digital signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p. (1975).
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
- Vanhoucke et al. (2011) Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. 2011. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, Vol. 1. 4.
- Vasilache et al. (2014) Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014).
- Vincent et al. (2017) Kevin Vincent, Kevin Stephano, Michael Frumkin, Boris Ginsburg, and Julien Demouth. 2017. On Improving the Numerical Stability of Winograd Convolutions. (2017).
- Williams et al. (2008) Samuel Williams, David Patterson, Leonid Oliker, John Shalf, and Katherine Yelick. 2008. The roofline model: A pedagogical tool for auto-tuning kernels on multicore architectures. In Hot Chips, Vol. 20. 24–26.
- Winograd (1980) Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam.
- Wulf and McKee (1995) Wm A Wulf and Sally A McKee. 1995. Hitting the memory wall: implications of the obvious. ACM SIGARCH computer architecture news 23, 1 (1995), 20–24.
- Zlateski et al. (2016a) Aleksandar Zlateski, Kisuk Lee, and H Sebastian Seung. 2016a. ZNN–A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines. In Parallel and Distributed Processing Symposium, 2016 IEEE International. IEEE, 801–811.
- Zlateski et al. (2016b) Aleksandar Zlateski, Kisuk Lee, and H Sebastian Seung. 2016b. ZNNi: maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 854–865.
- Zlateski and Seung (2017) Aleksandar Zlateski and H Sebastian Seung. 2017. Compile-time optimized and statically scheduled ND convnet primitives for multi-core and many-core (Xeon Phi) CPUs. In Proceedings of the International Conference on Supercomputing. ACM, 8.
Appendix A Detailed Analysis
For simplicity, we assume 2D and isotropic images and kernels, as well as computation being performed with 32–bit floating point numbers (4 bytes per number). Extension to non–isotropic and –dimensional images and kernels, as well as lower precision floats, is straightforward. The benchmarked implementations, described in Section 3 support an arbitrary number of dimensions, as well as non–isotropic kernels.
We follow the same notation as in Section 2, and for an arbitrary layer, use Winograd convolution , Regular–FFT convolution and Gauss–FFT convolution . Here, can take an arbitrary value, whereas has to equal to the sizes of the kernels in the layer.
We proceed to present the details of all three methods and estimate the total number of operations and data movement required in each stage, from which we also calculate the arithmetic intensity of each of the three method.
a.1. Input Transform Stage
In Winograd and both FFT approaches, each of the images ( batches, each with input channels) is divided into overlapping tiles, which are then transformed. Each tile has the size of with , and there is an overlap of pixels between adjacent tiles along both dimensions.
Each image of size will thus be divided into tiles. If necessary, the image will be implicitly zero–padded. This yields a total of image tiles.
Number of required operations
Each tile () is transformed via , which is a composition of two matrix–matrix multiplications. Here is the –th tile of the –th channel in batch , and is its transform. Both and have the size , requiring a total of operations ( per matrix–matrix multiplication). Operations are on real numbers for the Winograd approach, and on complex numbers for the FFT approaches.
The transforms can, however, be performed with much fewer operations. In the Winograd method, the matrices are sparse, and contain pairs of columns with similar numbers, allowing for reuse of many intermediate results. In the FFT methods, instead of matrix multiplications, 2D real–to–complex DFT transforms are performed, which require much fewer operations ( instead of ). Here, the constant can vary significantly based on the size of the transform performed (Frigo and Johnson, 1998). It takes small values, when the size of the transform is a power of two, and larger values when the transform size contains large prime factors.
Instead of using a closed–form expression that gives bounds on the number of operations required, we counted the number of operations in real, optimized, implementations and stored them in a lookup tables. For Winograd, we used the Winograd matrix generator (win, 2018), as proposed in (Jia et al., 2018; Budden et al., 2016) and further reduced the number of operations using the simple optimizer provided by (Jia et al., 2018). For the FFT methods, we used the FFT codelet generator “genfft” from the FFTW library (Frigo and Johnson, 1998). We denote the number of operations required for transforming an image tile as for the Winograd, for Regular–FFT, and for Gauss–FFT. The pre–computed lookup tables are included in our open–sourced repository.
The total number of operations required for performing all the transforms for all three methods are given in Tbl. 2.
|FLOPS||Input image transform|
|DM||Input image transform|
|AI||Input image transform|
The tile sizes are relatively small, much smaller than available cache; thus, once a tile is fetched from main memory, all the computation is done in–cache. Additionally, the overlapping regions between tiles need to be fetched from memory only once, as they can be stored in cache and reused for adjacent tiles.
The total data that has to be moved from main memory to cache is thus bytes – reading each of the images of size only once. Each of the transformed tiles is stored back to main memory. The size for each transformed tile will depend on the method.
In Winograd convolution, transformed tiles are real tensors, and require a single 32–bit float per element, for a total of bytes.
The FFT methods perform 2D FFT transforms of each tile. As the FFTs are performed on real tensors, the resulting transforms will be complex, conjugate–symmetric (along one of the dimensions). Only complex numbers need to be stored. The Regular–FFT requires two real numbers per complex number, for a total of bytes per tile, and the Gauss–FFT requires three reals per complex number, a total of bytes per tile.
The total amount of data movement for the three methods, as well as their arithmetic intensities (AIs), is shown in Tbl. 2
a.2. Kernel Transform Stage
In all methods, each of the kernels (with size ) is transformed via . The matrix has size .
Computing the transform of a kernel is an equivalent procedure to performing the input transform of a zero–padded kernel, to match the size of the input tiles. In practice, the transforms are computed with implicit zero–padding. As in the input stage, we have pre–computed the number of operations required for transforming a kernel in all three methods: for the Winograd, for Regular–FFT. For the Gauss–FFT method, , as we need two extra operations per complex number as described in Sec. 2.3.
All three methods need to fetch all kernel data from memory; read a total of bytes, and store transformed kernels back to main memory. The transformed kernels have size , and are stored in the same fashion as the transformed input tiles, requiring total of , , and bytes for the Winograd, Regular–FFT and Gauss–FFT methods respectively.
The total number of operations, data movements, and AIs for transforming all kernels of a particular layer, using any of the three methods, are given in Tbl. 2.
a.3. Element–wise Products
Having all transformed input and kernel tiles, the pre–transformed output tiles are computed via
Here, all of the pre–transformed output tiles (), transformed input tiles , and transformed kernels have the same size . Computing an element of the pre–transformed output tile at a location depends only on elements of the transformed input tiles and transformed kernels at the same location, and is computed via:
Note that the equation above can be interpreted as multiplication of a matrix of size with a matrix of size resulting in a matrix of size . For layers of modern ConvNets, values of are much larger than , which results in multiplication of tall and skinny matrices (Jia et al., 2018; Lavin and Gray, 2016). Such matrix multiplications have to be performed for each element location of .
a.3.1. Operations Required
In the Winograd method, real matrices are multiplied, requiring a total of operations for the whole layer, where is the number of tiles per image.
For the FFT based methods, complex matrices are multiplied. However, only locations of the pre–transformed output tiles need to be computed. As both the transformed input tiles and transformed kernels are conjugate symmetric, the pre–transformed output tiles will also be conjugate symmetric.
In the Regular–FFT method, real multiply–adds are required for performing a complex multiply–add. This gives us a total of FLOPs required for the whole layer. Gauss–FFT reduces complex matrix multiplications to 3 real matrix multiplications of matrices of the same size, totaling FLOPs required for the whole layer.
a.3.2. Data Movement
In the Winograd method, real matrix multiplications are performed (for the rest of the section we will omit the subscript , for clarity). In the Regular–FFT method, complex matrix multiplications are performed, with the same sizes. The Gauss–FFT method replaces one complex matrix multiplication with three real matrix multiplications; thus, a total of real matrix multiplications are performed.
For modern ConvNets, the matrices , and may not fit in cache, in which case standard cache–blocking approaches are employed (Gannon et al., 1988; Heinecke et al., 2015; Heinecke et al., 2016; Jia et al., 2018; Li et al., 2015), and might require some data to be moved to and/or from the main memory multiple times.
In the standard cache–blocking technique, (of size ) is subdivided into equally sized matrices of size ,
To minimize transfers to and from main memory, a sub–matrix of is kept in cache. A small sub–matrix of , with size of , is fetched from memory, multiplied with the sub–matrix of stored in–cache, and accumulated to the appropriate sub–matrix of . Here is a small number, required for efficient in–cache computation (Gannon et al., 1988; Heinecke et al., 2015; Heinecke et al., 2016; Jia et al., 2018; Li et al., 2015).
This requires transferring a total of numbers from main memory, and numbers bytes of from and then back to the main memory. A total of numbers. In the special case, when , only numbers need to be moved, as each sub–matrix multiplication produces the final result (no accumulation is necessary).
Total of such sub–matrix multiplications are performed, and require numbers to be moved. With being when and when .
In the Winograd method real matrix multiplications () are performed, thus transferring a total of bytes to be moved.
In the Regular–FFT method complex matrix multiplications are performed, requiring a total of bytes to be moved.
The Gauss–FFT replaces each of the complex matrix multiplications with 3 real matrix multiplications. Total of bytes need to be transferred.
The total number of operations is fixed. To minimize the amount of data movements we need to choose optimal values for and , while allowing for in–cache computation
As the values of , , , and are constant, the optimal values of and can be chosen by solving the following optimization problem:
|subject to||( is divisible by )|
|( is divisible by )|
|(fits in half cache)|
Where equals when and when ; is for real valued matrices, and for complex valued ones. Half the cache is allowed for sub–matrices of . This is typical practice, to ensure enough space for sub–matrices of and .
The arithmetic intensities can be then computed by dividing the number of required operations with the amount of required data movements for optimal values of and . This results in the AIs of for the Winograd and Gauss–FFT methods, and for the Regular–FFT method, which are equal to the AIs of real matrix, and complex matrix multiplications, respectively (Gannon et al., 1988; Heinecke et al., 2015; Heinecke et al., 2016; Jia et al., 2018).
In Fig. 4, we show the arithmetic intensities for layers with different numbers of channels as a function of cache size. The AIs of both complex (used in Regular–FFT method) and real matrix multiplication (used in Winograd and Gauss–FFT) increase with the cache size and the number of channels (C and C’). For a fixed cache size, the complex matrix multiplication allows for a higher arithmetic intensity. This indicates that the element–wise stage of the Regular–FFT method may achieve better hardware utilization than the ones of Winograd or Gauss–FFT convolution, when the cache size is a limiting resource.
a.4. Output Transform
In the output transform stage, each of the pre–transformed output tiles is transformed from the Winograd/FFT domain back to the spatial domain via . The matrix has size of (), resulting in having a size of . Here is the –th (non–overlapping) tile of the -th image in batch of the final result of the convolutional layer.
Performing the inverse transforms is very similar to performing the input transforms, except that: 1) Only a subset of elements need to be computed, and 2) Different, (inverse transform) matrices (or inverse FFT transforms) are used.
Given the analysis of the input and kernel transforms, analyzing the output transforms is straight–forward. The total number of operations, data movements, and AIs for transforming output tiles of a particular layer, using any of the three methods are given in Tbl. 2.
Appendix B Model Lookup Tables