neonCl-underconstruction
experimental port of nervana neon kernels in OpenCL
view repo
Deep convolutional neural networks take GPU days of compute time to train on large data sets. Pedestrian detection for self driving cars requires very low latency. Image recognition for mobile phones is constrained by limited processing resources. The success of convolutional neural networks in these situations is limited by how fast we can compute them. Conventional FFT based convolution is fast for large filters, but state of the art convolutional neural networks use small, 3x3 filters. We introduce a new class of fast algorithms for convolutional neural networks using Winograd's minimal filtering algorithms. The algorithms compute minimal complexity convolution over small tiles, which makes them fast with small filters and small batch sizes. We benchmark a GPU implementation of our algorithm with the VGG network and show state of the art throughput at batch sizes from 1 to 64.
READ FULL TEXT VIEW PDF
This paper considers a convolutional neural network transformation that
...
read it
The article describes a system for image recognition using deep convolut...
read it
Convolutional Neural Networks (CNNs) have gained widespread popularity i...
read it
We present flattened convolutional neural networks that are designed for...
read it
Though recent advanced convolutional neural networks (CNNs) have been
im...
read it
Photonic waveguides are prime candidates for integrated and parallel pho...
read it
When a Convolutional Neural Network is used for on-the-fly evaluation of...
read it
experimental port of nervana neon kernels in OpenCL
a winograd based kernel for convolutions in deep learning framework
Deep convolutional neural networks (convnets) achieve state of the art results on image recognition problems [11][7]. The networks take several days of GPU time to train and require significant compute resources during classification as well. Larger data sets and models lead to better accuracy but also increase computation time. Therefore progress in deep neural networks is limited by how fast the networks can be computed.
Likewise the application of convnets to low latency inference problems, such as pedestrian detection in self driving car video imagery, is limited by how fast a small set of images, possibly a single image, can be classified.
Distributed training of convnets can be achieved by partitioning each batch of examples across the nodes of a cluster and accumulating weight updates across the nodes. Large batch sizes adversely affect convergence of the network, so the minimum batch size that can be computed efficiently places an upper limit on cluster size [8, 6].
State of the art convnet architectures for image recognition use deep networks of convolutional layers, because they achieve better accuracy with fewer weights than shallow networks with larger filters [11, 7].
Therefore there is a strong need for fast convnet algorithms for small batch sizes and small filters. However conventional convnet libraries require large batch sizes and large filters for fast operation.
This paper introduces a new class of fast algorithms for convolutional neural networks based on the minimal filtering algorithms pioneered by Winograd [13]. The algorithms can reduce the arithmetic complexity of a convnet layer by up to a factor of 4 compared to direct convolution. Almost all of the arithmetic is performed by dense matrix multiplies of sufficient dimensions to be computed efficiently, even when the batch size is very small. The memory requirements are also light compared to the conventional FFT convolution algorithm. These factors make practical implementations possible. Our implementation for NVIDIA Maxwell GPUs achieves state of the art throughput for all batch sizes measured, from 1 to 64, while using at most 16MB of workspace memory.
The FFT and convolution theorem have been used to reduce the arithmetic complexity of convnet layers, first by Mathieu et al. [10], then refined by Visalache et al. [12], and then implemented in the NVIDIA cuDNN library [1].
The Strassen algorithm for fast matrix multiplication was used by Cong and Xiao [3] to reduce the number of convolutions in a convnet layer, thereby reducing its total arithmetic complexity. The authors also suggested that more techniques from arithmetic complexity theory might be applicable to convnets.
Various approaches have been attempted to reduce the complexity of convnets by quantizing or otherwise approximating the convolutional layer. We consider these approaches as orthogonal and complementary to those that exploit algebraic structure, and therefore declare them outside the scope of this paper.
A convnet layer correlates a bank of filters with channels and size against a minibatch of images with channels and size . We denote filter elements as and image elements as .
The computation of a single covnnet layer output is given by the formula:
(1) |
and we can write the output of an entire image/filter pair as
(2) |
where denotes 2D correlation.
It has been known since at least 1980 that the minimal filtering algorithm for computing outputs with an -tap FIR filter, which we call , requires
(3) |
multiplications [13, p. 39]. Also, we can nest minimal 1D algorithms and to form minimal 2D algorithms for computing outputs with an filter, which we call . These require
(4) |
multiplications [14]. We can continue to nest 1D algorithms to form algorithms for multi-dimensional FIR filters.
It is interesting to note that in 1D, 2D, and multi-dimensions, the minimal algorithm requires a number of multiplications equal to the number of inputs. In other words, to compute we must access an interval of data values, and to compute we must access a tile of data values. Therefore the minimal filtering algorithm requires one multiplication per input.
The standard algorithm for uses multiplications. Winograd [13, p. 43] documented the following minimal algorithm:
(5) |
where
This algorithm uses just 4 multiplications and is therefore minimal by the formula . It also uses 4 additions involving the data, 3 additions and 2 multiplications by a constant involving the filter (the sum can be computed just once), and 4 additions to reduce the products to the final result.
Fast filtering algorithms can be written in matrix form as:
(6) |
where indicates element-wise multiplication. For , the matrices are:
(7) |
A minimal 1D algorithm is nested with itself to obtain a minimal 2D algorithm, like so:
(8) |
where now is an filter and is an image tile. The nesting technique can be generalized for non-square filters and outputs, , by nesting an algorithm for with an algorithm for .
uses multiplications, whereas the standard algorithm uses . This is an arithmetic complexity reduction of . The data transform uses 32 additions, the filter transform uses 28 floating point instructions, and the inverse transform uses 24 additions.
Algorithms for can be used to compute convnet layers with kernels. Each image channel is divided into tiles of size , with elements of overlap between neighboring tiles, yielding tiles per channel, . is then computed for each tile and filter combination in each channel, and the results are summed over all channels.
Substituting and , we have:
(9) |
Labeling tile coordinates as , we rewrite the convnet layer formula (2) for a single image , filter , and tile coordinate as:
(10) |
Thus we can reduce over channels in transform space, and only then apply the inverse transform to the sum. This amortizes the cost of the inverse transform over the number of channels.
We examine the sum
(11) |
and simplify the notation by collapsing the image/tile coordinates down to a single dimension, . We also label each component of the element-wise multiplication separately, as , yielding:
(12) |
This equation is just a matrix multiplication, so we can write:
(13) |
Matrix multiply has efficient implementations on CPU, GPU, and FPGA platforms, owing to its high computational intensity. Thus we have arrived at the practical implementation for the fast algorithm listed in Algorithm 1.
Winograd documented a technique for generating the minimal filtering algorithm for any choice of and . The construction uses the Chinese remainder theorem to produce a minimal algorithm for linear convolution, which is equivalent to polynomial multiplication, then transposes the linear convolution algorithm to yield a minimal filtering algorithm. The reader is referred to Winograd’s seminal book [13], or Blahut’s book [2] for a modern treatment of the subject. We provide derivations of the specific algorithms used in this paper in the supplementary material.
Training a network using stochastic gradient descent requires computation of the gradients with respect to the inputs and weights. For a convnet layer, the gradient with respect to the inputs is a convolution of the next layer’s backpropagated error, of dimension
, with a flipped version of the layer’s filters. Therefore it can be computed using the same algorithm that is used for forward propagation.The gradient with respect to the weights is a convolution of the layer inputs with the backpropagated errors, producing outputs per filter and channel. Therefore we need to compute the convolution , which is impractical because is much too large for our fast algorithms. Instead we decompose this convolution into a direct sum of smaller convolutions, for example . Here the algorithm’s tiles are overlapped by 2 pixels in each dimension, and the outputs are summed over all tiles to form .
The transforms for are given by:
(14) |
With = 16 multiplies versus direct convolution’s multiplies, it achieves the same arithmetic complexity reduction as the corresponding forward propagation algorithm.
A minimal algorithm for has the form:
(15) |
The data transform uses 13 floating point instructions, the filter transform uses 8, and the inverse transform uses 10.
Applying the nesting formula yields a minimal algorithm for that uses multiplies, while the standard algorithm uses . This is an arithmetic complexity reduction of .
The 2D data transform uses floating point instructions, the filter transform uses , and the inverse transform uses .
The number of additions and constant multiplications required by the minimal Winograd transforms increases quadratically with the tile size [9, p. 211]. Thus for large tiles, the complexity of the transforms will overwhelm any savings in the number of multiplications.
The magnitude of the transform matrix elements also increases with increasing tile size. This effectively reduces the numeric accuracy of the computation, so that for large tiles, the transforms cannot be computed accurately [13, p. 28].
The Fast Fourier Transform (FFT) can be used to produce a tiled convolution algorithm that has the same form as Algorithm
1. The main difference is that the transform matrices are replaced with FFT and inverse FFT, and point-wise multiplication of complex FFT components yields cyclic convolution. Only components of the cyclic convolution are valid, the rest must be discarded, and the tiles must be overlapped by and in order to recompute the discarded outputs. This technique is referred to as overlap and save [2, p. 195].The similarity of overlap and save to our approach makes for an easy comparison. With FFT based convolution, the multiply stage still uses 1 multiply per input, but now the operands are complex numbers. Direct multiplication of complex numbers requires 4 real multiplications. Thankfully, a couple of tricks reduce the complexity further.
The Fourier transform of a real signal has Hermitian symmetry, which reduces the number of unique products in each by almost half. FFT based convnet implementations have exploited this property [10, 12]. Specifically, the discrete Fourier transform of a array of real values can be represented with an array of complex values. Furthermore , so the products of the missing values can be reconstructed simply by taking the complex conjugate of the computed values. Thus the multiply stage of the FFT convnet algorithm with tile size requires complex multiplications, or complex multiplies per input.
Using the standard algorithm for multiplying complex numbers, this equals real multiplies per input.
Another technique, which to our knowledge has not been used in convnets, is to use a fast algorithm to multiply complex numbers with 3 real multiplications [13]:
(16) |
where
(17) |
An FFT based convnet algorithm can incorporate this by modifying the FFT transforms of the filter and data to output the the real valued matrices and instead of the complex valued matrices and . This adds 2 floating point instructions per output to the filter transform, and 1 to the data transform. It also increases the memory footprint of each matrix by half.
Then we can calculate using 3 calls to a standard real matrix multiply function (e.g. SGEMM):
(18) |
The accumulation of temporary matrix is performed using regular SGEMM with and , at the cost of adding 2 floating point instructions per output. We can think of these instructions as adding to the inverse transform cost. The temporary matrix increases memory use by half, so that the total workspace size is approximately twice that of FFT based convolution with direct CGEMM.
Combining Hermitian symmetry with fast CGEMM gives us a multiplication stage with real multiplies per input. Recall that the multiply stage of the Winograd algorithms is always 1 real multiply per input. Thus even with fast CGEMM, FFT base convolution must use a significantly larger tile size in order to rival the arithmetic complexity of the Winograd algorithms.
For the FFT transform itself, we consider the split-radix FFT algorithm, which is the minimal practical FFT algorithm when is a power of 2 [9, p. 150]. We assume the 2D FFT transform is constructed using row-column composition, and borrow the complexity figures from the DSP Handbook [9, pp. 173,175] for Table 1.
Tile | Winograd | FFT | ||||
---|---|---|---|---|---|---|
3 | 9.00 | - | - | - | ||
4 | 4.00 | 2.00 | 1.75 | 1.50 | ||
5 | 2.78 | 3.60 | 2.24 | 2.24 | ||
6 | 2.25 | 4.33 | 2.00 | 2.78 | ||
8 | 1.78 | 6.50 | 2.23 | 4.38 | 4.44 | 2.42 |
16 | 2.94 | 4.23 | ||||
32 | 2.42 | 6.24 | ||||
64 | 2.20 | 8.30 | ||||
128 | 2.10 | 10.37 | ||||
256 | 2.05 | 12.42 |
Tile | FFT with Fast CGEMM | |||
---|---|---|---|---|
8 | 3.33 | 3.77 | 4.30 | 4.30 |
16 | 2.20 | 6.23 | 6.82 | 6.82 |
32 | 1.81 | 8.94 | 9.57 | 9.57 |
64 | 1.65 | 11.72 | 12.36 | 12.36 |
128 | 1.57 | 14.48 | 15.14 | 15.14 |
256 | 1.54 | 17.22 | 17.88 | 17.88 |
In our model of fast convnets, the arithmetic complexity of the multiplication stage is:
(19) |
When , the formula equals the arithmetic complexity of direct convolution. Therefore direct convolution is the minimal algorithm for
Although our analysis employs minimal convolutions, the convnet layer itself is still not minimal because it performs more convolutions than are strictly necessary. We could reduce the number of convolutions by employing Strassen recursions as in [3], but each recursion reduces all 3 dimensions of our matrices by half while providing only an reduction in arithmetic complexity. The matrix multiplications cannot be computed efficiently if or is too small. Fast convolution alone provides a or larger arithmetic complexity reduction while shrinking only the largest dimension of the matrix, . Still, for layers with large , , and , it may be worthwhile to perform Strassen recursions in addition to fast convolution. We leave this as an area for further research.
In order to simplify the equations, we will henceforth assume that and have no remainders. We also assume square filters and blocks, and .
The multiplication complexity can be rewritten as:
(20) |
where and
The total arithmetic complexities of the data, filter, and inverse transforms can be written as:
(21) |
where , , and are the number of floating point instructions used by the corresponding transforms for single tiles.
Dividing the complexity of each transform by yields its relative complexity:
(22) |
We call , , and the normalized arithmetic complexities of the data, filter, and inverse transforms, respectively. is the number of tiles per channel.
Adding the terms for each stage gives the total arithmetic complexity of the convnet layer:
(23) |
In order to achieve a large speedup, the multiplication complexity must be as small as possible, and the transform complexities , , and must each be small compared with , , and , respectively.
For direct convolution, and . Therefore the maximum speedup of a fast algorithm versus direct convolution is .
We list the normalized transform complexity for different tile sizes and algorithms in Tables 1 and 2. Due to its similarity to our approach, FFT based convolution complexity can also be measured with Equation 23.
FFT based convnet layers with direct CGEMM must use tile size at least to equal the multiplication stage complexity of Winograd and its tile, but then it incurs much greater transform overhead. Also a tile will waste computation on many unwanted pixels for images with sizes that are not close to a multiple of . Even for moderate size layers, a moderate to large minibatch must be used, or there will be too few tiles to compute the CGEMM efficiently. Finally, the memory used by a single transformed filter channel is units, which is a large expansion of the unit filter. The 6x6 tile of expands the same filter to units.
FFT based convnet layers with fast CGEMM can be much more competitive with Winograd algorithms. They have multiplication stage parity with tile size , and reasonable transform complexity. Also tile size generates a reasonably large number of tiles with large convnet layers or moderate batch size.
Even with fast CGEMM, the larger tile size compared to Winograd means FFT based convnet implementations must have a large memory workspace to hold transformed data. A decent amount of transformed data must be held in order to amortize transform cost and to generate matrices with large enough dimensions so that the multiply stage is efficient. This is problematic for current GPUs, which have a limited amount of on chip memory. CPUs have large caches and might therefore compute FFT based convolution more efficiently.
We implemented for NVIDIA Maxwell GPUs and tested on the NVIDIA Titan X model.
The small tile size and light weight transforms of make possible a fused implementation of the algorithm stages, where the the data and filter transform, 16 batched matrix multiplies (GEMMs), and inverse transform are all computed in the same block. Another resource limit is the instruction cache, which can only fit about 720 instructions. Our main loop is larger than this, but aligning the start of the loop with the 128 byte instruction cache-line boundary helps mitigate the cost of a cache miss.
The 16 batched GEMMs compute
outputs, which enables us to fit the workspace in the registers and shared memory of a single block and still have 2 active blocks per SM for latency hiding. Zero padding is implicit through use of predicates. If the predicate deselects a global image load, the zero value is loaded with a dual issued I2I instruction.
Image data is stored in CHWN order to facilitate contiguous and aligned memory loads, significantly reducing over-fetch. We employ a “super blocking” strategy to load 32 tiles of size from a configurable number of images, rows, and columns. For , we load tiles from 32 separate images. For , we load a super block of tiles per image. This strategy facilitates efficient loads with small batch sizes, as the dimensions of the input data are contiguous in memory. Furthermore, the 2 pixel overlap between adjacent tiles causes high L1 cache hit rates when using several tiles in a super block.
We also employ L2 cache blocking to increase the re-use of overlapping blocks. Since the number of image tiles is typically much larger than the number of filters, our block mapping iterates over a group of up to 128 filters in the inner loop, and then iterates over all image tiles in the second loop. All channels of the filter group fit in L2 cache, so each filter will only be loaded once from DDR memory, and each image tile will be loaded times as we iterate over the filter groups. This strategy reduces DDR memory bandwidth by almost half.
We implemented a version of our kernel that loads fp16 data, which decreases global memory bandwidth. We also implemented a variant that we call “FX” that runs a filter transform kernel first and stores the result in a workspace buffer. The convolution kernel loads transformed filter values from the workspace as needed. The size of the workspace is only units of memory, which equals just 16MB when and data is fp32.
We ran accuracy and speed experiments with VGG Network E [11]. This is a deep network that uses filters exclusively in the convolution layers, which are summarized in Table 3.
Layer | Depth | K | GFLOPs | |
---|---|---|---|---|
conv 1.1 | 1 | |||
conv 1.2 | 1 | |||
conv 2.1 | 1 | |||
conv 2.2 | 1 | |||
conv 3.1 | 1 | |||
conv 3.2 | 3 | |||
conv 4.1 | 1 | |||
conv 4.2 | 3 | |||
conv 5 | 4 | |||
Total |
We tested the accuracy of our fast algorithms with both single precision (fp32) and half precision (fp16) data and filters. In all tests we used fp32 arithmetic instructions. We used random data and filters from the uniform distribution
and measured absolute element error. Ground truth was computed by direct convolution using a double precision accumulator for reductions.We measured the speed of our GPU implementation of and compared with cuDNN v3 [1] on a superclocked NVIDIA Titan X GPU. We disabled boost clock and observed a maximum clock rate of 1126MHz. The GPU has 3072 cores, yielding a device peak throughput of TFLOPS.
Speed for a given layer was calculated by dividing the number of GFLOPs of computation required by direct convolution, as tabulated in 3, by the run time in milliseconds to yield Effective TFLOPS. The reduction of arithmetic complexity allows fast algorithms to have Effective TFLOPS that can exceed device peak throughput.
Total GFLOPs and run time were calculated by weighting the GFLOPs and run time for each layer by its depth, and total throughput was calculated as the ratio of the two.
Table 4 shows the numeric accuracy of the different convolution layer algorithms tested with single precision (fp32) and half precision (fp16) input data and filters.
is actually slightly more accurate than direct convolution. Its simple transforms do not lose much precision, and its multiplication stage performs a reduction over channels, rather than the filter elements reduced by direct convolution. has a larger error, but it is still more accurate than direct convolution with fp16 data.
All tested algorithms are equally accurate with fp16 data. Here accuracy is limited by the precision of the inputs. Because direct convolution is accurate enough for training and inference with low precision data [4, 5], we conclude that is too.
Table 5 and Table 6 show the total throughput for VGG Network E layers for cuDNN and our implementation for fp32 and fp16 data for different batch sizes.
For fp32 data, is X at and X as fast at . The throughput at is TFLOPS. For fp16 data, extends its lead over cuDNN, recording TFLOPS throughput for . performance is still very good at TFLOPS.
Figure 1 shows throughput by layer. Hatch marks indicate the layers where cuDNN used the FFT algorithm, otherwise direct convolution was used. For , hatch marks indicate that the external filter transform (FX) was used, otherwise the fused transform was faster.
cuDNN appears to erroneously select its FFT algorithm for intermediate values of despite the fact that it performs very poorly, under
TFLOPS. While this is probably just a bug, it is revealing. Low performance at moderate values of
suggests that the FFT convolution implementation either uses large tiles, or possibly just a single tile per image, as in [12], which leads to inefficient multiplication stages unless is large. At large , cuDNN FFT performs much better, but stays well under TFLOPS.performs better than cuDNN at every layer and batch size, except layer conv1.1, which contributes less than of the total network computation.
In general, we found that the FX variant of our implementation performed best unless the number of filters and channels was very large. Computing the filter transform is heavily memory bound, therefore transforming a larger filter bank decreases computational efficiency.
The worst performance occurs for the layers when . In this case the superblock runs over the image boundary and computes unwanted pixels. Throughput on this layer configuration is still over TFLOPS, where cuDNN performance is just TFLOPS.
cuDNN FFT uses a global memory workspace up to GB in our experiments. By contrast, our fused implementation does not use any global workspace, and the FX variant uses no more than MB.
performance shows new capabilities for high throughput and small batch size with state of the art convolutional neural networks. We expect performance to increase again when is implemented.
Layer | fp32 | fp16 | ||
---|---|---|---|---|
Direct | F(2x2,3x3) | F(4x4,3x3) | ||
1.2 | 4.01E-05 | 1.53E-05 | 2.84E-04 | 1.14E-02 |
2.2 | 8.01E-05 | 2.86E-05 | 5.41E-04 | 1.45E-02 |
3.2 | 1.53E-04 | 5.34E-05 | 9.06E-04 | 1.99E-02 |
4.2 | 3.20E-04 | 5.34E-05 | 1.04E-03 | 3.17E-02 |
5 | 3.43E-04 | 4.20E-05 | 1.08E-03 | 2.61E-02 |
N | cuDNN | F(2x2,3x3) | Speedup | ||
---|---|---|---|---|---|
msec | TFLOPS | msec | TFLOPS | ||
1 | 12.52 | 3.12 | 5.55 | 7.03 | 2.26X |
2 | 20.36 | 3.83 | 9.89 | 7.89 | 2.06X |
4 | 104.70 | 1.49 | 17.72 | 8.81 | 5.91X |
8 | 241.21 | 1.29 | 33.11 | 9.43 | 7.28X |
16 | 203.09 | 3.07 | 65.79 | 9.49 | 3.09X |
32 | 237.05 | 5.27 | 132.36 | 9.43 | 1.79X |
64 | 394.05 | 6.34 | 266.48 | 9.37 | 1.48X |
N | cuDNN | F(2x2,3x3) | Speedup | ||
---|---|---|---|---|---|
msec | TFLOPS | msec | TFLOPS | ||
1 | 14.58 | 2.68 | 5.53 | 7.06 | 2.64X |
2 | 20.94 | 3.73 | 9.83 | 7.94 | 2.13X |
4 | 104.19 | 1.50 | 17.50 | 8.92 | 5.95X |
8 | 241.87 | 1.29 | 32.61 | 9.57 | 7.42X |
16 | 204.01 | 3.06 | 62.93 | 9.92 | 3.24X |
32 | 236.13 | 5.29 | 123.12 | 10.14 | 1.92X |
64 | 395.93 | 6.31 | 242.98 | 10.28 | 1.63X |
Artificial Neural Networks and Machine Learning–ICANN 2014
, pages 281–290. Springer, 2014.
Comments
There are no comments yet.