Caffe for Sparse Convolutional Neural Network
Sparse methods and the use of Winograd convolutions are two orthogonal approaches, each of which significantly accelerates convolution computations in modern CNNs. Sparse Winograd merges these two and thus has the potential to offer a combined performance benefit. Nevertheless, training convolution layers so that the resulting Winograd kernels are sparse has not hitherto been very successful. By introducing a Winograd layer in place of a standard convolution layer, we can learn and prune Winograd coefficients "natively" and obtain sparsity level beyond 90 dataset. Furthermore, we present a sparse Winograd convolution algorithm and implementation that exploits the sparsity, achieving up to 31.7 effective TFLOP/s in 32-bit precision on a latest Intel Xeon CPU, which corresponds to a 5.4x speedup over a state-of-the-art dense convolution implementation.READ FULL TEXT VIEW PDF
Caffe for Sparse Convolutional Neural Network
Convolution neural networks (CNN) have achieved undisputed success in many practical applications. These deep neural networks typically contain multiple layers, many (though not all) of which perform the namesake computation of convolution. A convolution layer is an architecture whose connection between an input and output tensor is via a number of convolution kernels, and the basic arithmetic operations are that of multiply-accumulate. Because over 90% of the computation during inference and training of recent CNN designs is in convolutions (Krizhevsky et al., 2012; Szegedy et al., 2015), different strategies have been devised to speed up this core operation. Sparse methods is one such strategy. Here, many of the convolution kernel coefficients are made zero by some pruning or compression techniques (LeCun et al., 1989; Lebedev & Lempitsky, 2016; Liu et al., 2015). Sparsity exploiting convolution implementations are then employed in the actual inference process. (Park et al., 2017) reported a three-fold speed up on the convolution layers when an optimized sparse convolution implementation is applied on well pruned kernels.
Orthogonal to sparse methods, transform methods such as FFT (Mathieu et al., 2013; Vasilache et al., 2015) or Winograd transformation (Winograd, 1980; Lavin & Gray, 2016) have proved to be successful as well. For the typical small convolution sizes (e.g., 33) that arise in CNNs, the Winograd-kernel approach is more effective than FFT and has demonstrated more than twofold speed up over well-implemented spatial convolution approaches.
These recent advances beg the question of why not apply sparse methods on Winograd transform. The potential gain of this combination is obvious. To realize this potential, however, one must (1) be able to significantly prune away Winograd coefficients of a CNN with minimal impact to the CNN’s accuracy, and (2) develop computation implementations that can exploit the pruned Winograd parameters well enough that result in meaningful inference speedups. It turns out that pruning Winograd parameters is challenging for reasons we will explain shortly. The Winograd sparsity achieved reported so far is only moderate, which may explain why there is no published effort on optimized sparse Winograd convolution implementations to boost inference speed. This paper reports advances in both fronts, illustrated by pruning Winograd kernels to more than 90% sparsity while containing accuracy loss to within 0.1%; as well as an actual up to 5.4 and 2.1 speedups of our sparse Winograd convolution compared to dense direct and dense Winograd convolution, respectively.
Pruning the original convolution coefficients (the “spatial” domain) does not in general result in sparse Winograd kernels. More fundamentally, the linear transform that maps spatial to Winograd parameters is non-invertible as there are more Winograd than spatial coefficients. Thus any training and pruning method that needs to make use of both sets of parameters in conjunction will cause inconsistency, which in turn leads to major accuracy loss when achieving acceptable sparsity.
We tackle this challenge and show that pruning Winograd parameters becomes successful when we replace the convolution layer in question by a Winograd layer, eliminating the need to use both the spatial and Winograd parameters in conjunction. Our effective pruning method then paves the road to our highly optimized sparse Winograd convolution to materialize the sparsity for high speed inference. Figure 1 illustrates the architecture of a Winograd layer using a simple 1D convolution of a 2-length spatial kernel with 4-length input. The corresponding Winograd kernel is 3-length. Replacing a convolution layer by a Winograd layer has several advantages. First, from a conceptual point of view, this new layer architecture reflects directly the relationship between the key parameters of the computation (which are the Winograd parameters) and the overall neural network. Second, as alluded to earlier, we have more Winograd than spatial parameters. Thus a Winograd layer in fact has a larger capacity as we are no longer restricted to use only those Winograd parameters that actually correspond to a set of spatial convolution parameters. Last but most important, steering of the training process such as pruning becomes possible and straightforward. We emphasize that this is not the case if one employs the spatial convolution layer: the non-invertible mapping between convolution and Winograd parameters is a major obstacle. While this obstacle may be overcome via approximation for small networks (LeNet) and datasets (MNIST) as reported in (Liu & Turakhia, 2016), our experiments (Section 6.1) show that approximation does not work for larger networks (AlexNet) and datasets (ImageNet).
The main contributions of this paper are as follows:
[leftmargin=*, noitemsep, topsep=-10pt]
We design and implement a highly optimized sparse Winograd convolution computation for fast inference by exploiting high sparsity obtained by our pruning methods (Section 5).
We demonstrate the effectiveness of training and pruning with the Winograd-layer architecture (Section 6) and fast inference with our sparse Winograd convolution. Particularly, we prune AlexNet (Krizhevsky et al., 2012) to eliminate its Winograd parameters by more than 90%, while maintaining its original accuracy. This leads to up to 2.1 speedups over an ideal dense Winograd convolution, one of the fastest convolution algorithms to date (Lavin & Gray, 2016).
appear to be the first ones that detailed the use of FFT as a compute kernel in Deep Learning. This kernel executes the “spatial-domain” convolution by element-wise multiplication in the “frequency-domain”. More recently,(Lavin & Gray, 2016) shows convincingly that Winograd transforms outperform FFTs in the common convolution use cases in Deep Learning. These works illustrate the use of transform methods (FFT or Winograd) as computation kernels to speed up a normal inference or training process.
Recently, there have been a few research attempts to further reduce compute requirements and memory footprint of Winograd convolution by pruning. However, they have had limited success in pruning and not shown actual speedups in inference. (Liu & Turakhia, 2016) attempts to prune parameters in the transformed domain. For Winograd parameters, the overall network’s architecture is that of the original CNN, but the forward pass is performed using the Winograd kernel with some of the parameters set to zero. A backward pass is performed that updates the original “spatial-domain” parameters. We believe there is an inconsistency in this model caused by the fact that mapping between spatial-domain and Winograd convolution parameters is non-invertible. In general, there is no spatial-domain convolution parameters that correspond to the modified (masked off) Winograd parameters that are used to compute all the forward-pass intermediate quantities, while the backward pass computes the gradient using these intermediate quantities in conjunction with the spatial-convolution-kernel parameters. While they achieve reasonable results (Liu & Turakhia, 2016) with LeNet (LeCun et al., 1998a) on the MNIST dataset (LeCun et al., 1998b), our results on larger networks and datasets such as AlexNet (Krizhevsky et al., 2012) on ImageNet (Deng et al., 2009) as shown in Section 6.1 demonstrate that significant accuracy loss and/or low sparsity are inevitable for such approaches as in (Liu & Turakhia, 2016). Indeed, direct pruning of the parameters in our proposed Winograd layers is how we finally overcome this accuracy problem.
(Liu et al., 2017)
, a concurrent work to ours, similarly addresses the non-invertible issue by directly pruning in Winograd domain. Interestingly, they also move the ReLU operation into Winograd domain to obtain sparsity in activations as well. However, their sparsity is lower than ours (75% vs. 90+%) and is evaluated with smaller dataset (CIFAR-10 vs. ImageNet). Moreover, their work does not show actual inference speedups from sparsity in Winograd domain. Our work provides a highly optimized sparse Winograd convolution design and implementation for fast inference.
In a conceptual sense, (Rippel et al., 2015) relates closely to our current work. The authors advocate representing the convolution kernels in frequency domain and detailed the gradient calculation with respect to the frequency parameters. The network architecture, however, remains in the original form as outlined in Section 4 of the paper: When convolution is to be computed, the frequency parameters must first be transformed back to spatial domain in which a regular convolution is performed.
Since pruning in spatial domain does not provide a high enough sparsity in Winograd parameters to benefit from general sparse representations such as compressed sparse row (CSR), a hardware feature for zero-skipping has been proposed to take advantage of the low sparsity in Winograd parameters (Park et al., 2016). Our paper shows that, by directly pruning in Winograd domain, we can obtain high enough sparsity to speedup inference without such specialized hardware features.
Consider a typical convolution layer where the input tensors with channels of features maps each of dimension are transformed into
output channels via a simple unit-stride, unit-dilation linear convolution with kernels of size:
where and and stands for 2D linear convolution.
The computation of Equation 1 can be performed using the Winograd transform which has a lower arithmetic complexity than Equation 1 suggests. The details of this computation we present now are crucial to our definition of Winograd layers and their training. A convolution with of size can be broken down into many convolutions each involving smaller tiles of . We illustrate this “overlapping” method by the following one-dimensional example, using self-explanatory Matlab-like array index notation.
Note that is broken up into two tiles with some duplicating elements while is partitioned
(without duplication) into two tiles. More generally, given convolution kernels of size
and (small) sizes that divide111 This assumption simplifies the presentation and can be easily eliminated by for example
This assumption simplifies the presentation and can be easily eliminated by for example zero padding.and , respectively, we reshape the input and output tensors , into ,
The value is the number of resulting tiles of the reshaping, . The input tile size is , while the output tile size is . We express the reshaping by two index mapping functions and
where is many-to-one and maps a 3-tuple to a 2-tuple while is invertible and maps a 2-tuple to a 3-tuple. Using the overlapped form, we have
Computing Equation 2 with straightforward convolution takes multiplications in each of the summands. In contrast, Winograd’s method can possibly need only as few as multiplications via:
The six matrices , , (of consistent dimensions) are independent of and , and is element-wise multiplication. In many instances the matrices are so simple that applying them requires no multiplications. For example when and , , and are simple matrices with , and are entries (see (Lavin & Gray, 2016) for example).
Motivated by this, we define a Winograd layer to be a topology specified by a tensor that computes from via
To incorporate a Winograd layer within a standard CNN framework (e.g., Caffe) so as to allow training and inference, it suffices to be able to compute the forward and backward passes. The forward pass is straightforward as it simply follows Equation4, for which we note that (Lavin & Gray, 2016)
details an optimized implementation. For the backward pass, we need to compute the partial derivatives of the scalar loss functionw.r.t. each of the variables and in terms of the known partial derivatives of w.r.t. , or in short. We present the derivations here as they pertain to the Winograd layers and are thus unavailable elsewhere.222Note that the Winograd-kernel approach (Lavin & Gray, 2016) does not consider Winograd as a layer architecture with full fledged backward propagation. They use Winograd solely to accelerate the convolution operations in both forward and backward propagation while all parameters reside in the spatial domain.
First, we use this key form of chain rule: Suppose the partial derivatives of a scalar functionw.r.t. an array of variables , , are known. Moreover, the variables are in fact dependent variables of an array of variables , via where are constant matrices of commensurate dimensions. The partial derivatives of w.r.t. are then given by
Denote the intermediate variables in Equation 4 by and : and for all applicable indices . Note in particular that , that is the 2-dimensional slice of with any fixed is a matrix product. We can then compute using the Chain Rule via
noting that is with a simple index mapping .
Similarly, we use the Chain Rule to obtain , using computed above:
Consider a -layer network with a mixture of layers such as convolution, fully connected, and pooling. Let be the parameters in the usual spatial domain, and be the parameters when the convolution layers are replaced by Winograd layers. Let and be the respective loss functions. Training is typically done by minimizing an energy function that is the sum of the loss and a regularization penalty. We aim to arrive at a such that many of its elements, including those within the Winograd layer, are zero. Pre-training, pruning, and fine-tuning (re-training) are the three major steps to achieve this goal.
In pre-training, we try to obtain a parameter that makes the network accurate but is also amenable to pruning in the next stage. Because many open-source networks provide already-trained weights (in the spatial domain), the pre-training method we adopt here starts with these weights. We apply the standard SGD algorithm on the energy function of the form . Here is a regularization function applied on , which converts “on-the-fly” the parameters related to the convolution layers to the Winograd coefficients using the matrices shown in Equation 3, in order to encourage sparsity in Winograd domain. We note that common frameworks such as Caffe allow the incorporation of various regularization penalties on different layers of a network in a straightforward manner. At the end of the SGD iterations, the obtained spatial parameter is mapped into the Winograd domain by the same function : .
Pruning is done directly in Winograd domain, which is enabled by the Winograd layer. Regularization and gradient-based thresholding are the two techniques used for pruning. Regularization is a common and useful technique to attenuate over-fitting and induce sparsity. The energy function is
Common choices of norms are or , and we use -norm for the layers to be pruned.
To achieve higher sparsity, thresholding is used in addition to regularization. The idea of thresholding is to set to zero particular entries within the parameter that are deemed inconsequential (Han et al., 2015; Wen et al., 2016; Guo et al., 2016). Instead of using one uniform threshold to judge the significance, the threshold on a particular parameter is dependent on how significantly this parameter affects the underlying loss function . The thresholding function is of the form
We apply the thresholding function to a vector simply by applying it on each element, denoted with the slightly abused notation:. The numbers and are part of the “hyper-parameters” of the training procedure. This scheme uses smaller thresholds as the magnitude of gradient increases (threshold is , , and 0, when the magnitude of gradient is 0, , and , respectively). We find that the choice of =1e-4 and =0.1 works well. We emphasize that thresholding the parameters including Winograd parameters is now straightforward because of the Winograd parameters are the direct independent parameters of the Winograd layers.
In the Winograd domain, where the L-layer network contains both Winograd layers and other layers (e.g., pooling) with parameters: , the pruning step applies regularization and thresholding as follows where is the familiar stochastic gradient with a mini batch and is the learning rate:
Similar to pruning in spatial domain, pruning Winograd parameters will cause accuracy loss, which necessitates a fine-tuning step to recover the loss. Same as pruning, fine-tuning is only done in the Winograd domain. During fine-tuning, the zero parameters obtained from the pruning step are fixed, while the network is trained to adjust the other non-zeros parameters to recover accuracy loss. We use L2 regularization during fine-tuning. The larger capacity of Winograd domain gives another benefit here. Even with high sparsity, the remaining degrees of freedom allow a better recovery of accuracy by the fine-tuning step. This further allows in general more aggressive regularization and thresholding during the pruning step.
A highly optimized sparse Winograd Convolution implementation is paramount to turn the sparsity obtained into actual performance gain in inference computations. Winograd convolution consists of three steps: (1) input transformation that corresponds to multiplications of each tile with small matrices and in Equation 3, (2) element-wise multiplications in Winograd domain, and (3) output inverse transformation that corresponds to multiplications with and . The bulk of computation is in the second step, which can be implemented as independent multiplications of matrices with matrices, where we have number of tiles of in size, sized convolution kernels, input channels, and output channels. With a model pruned in Winograd domain, the matrices are sparse and the matrix multiplications become sparse-matrix times dense-matrix multiplications (SpMDM).
We first parallelize all three steps over multiple images within a batch. When the batch is not large enough for each thread to have its own image, we further parallelize over input/output channels. This minimizes data exchanges among cores and also allows fusing the second and third steps for cache locality. Alternatively, the second step can be parallelized over independent matrix multiplications. This approach has a benefit of increasing the multiplication size to by , where is the number of images per batch, exploiting more reuse out of the matrices. However, this approach varies decomposition over the three steps, incurring significant data exchanges among the cores. Our scheme replicates the matrices at each core, but this is mitigated by that the matrices are sparse and can be much smaller to fit L1 or L2 caches.
Libraries for dense convolution operations in CNN often layout the data interleaving multiple channels so that vectorization can be done over channels. However, since the sparsity pattern of parameters varies over channels, this layout and vectorization scheme is inefficient for SpMDM in the second step of sparse Winograd convolution. Therefore, we use a layout where the fastest moving dimension is tiles and vectorize over tiles. Our open source project will show more implementation details (link omitted for double blind review).
This section describes Winograd training/pruning and sparse Winograd inference results. We implement Winograd layer (forward and backward propagation for training/pruning) as in Section 3 and sparse Winograd convolution as in Section 5 in our branch of Caffe (Jia et al., 2014).
We use pre-trained spatial domain model to save overall training time. Particularly, we start with the Caffe reference model from the Caffe model zoo (we call it AlexNet for simplicity even though it is a slight variation). We use the ImageNet ILSVRC-2012 dataset for pruning and test. Since the first convolution layer does not provide high enough sparsity to get speedups (Han et al., 2015; Wen et al., 2016), we do not attempt to prune that layer. We use the gradient-based thresholding described in Section 4 with =1e-4 and =0.1. We use learning rates of 5e-5 and 1e-4, and regularization factors of 5e-4 and 5e-5 in the pruning and fine-tuning steps, respectively. We use 200 smaller learning rates for the Winograd layers to ensure convergence.
Table 1 lists the accuracy and the sparsity of conv2-5 layers in AlexNet. Our method of training and pruning directly in Winograd domain (method A) results in 90.6–95.8% sparsity with only 0.1% top-1 accuracy drop from the reference model. Method B maintains convolution parameters both in spatial and Winograd domains and applies thresholding in 3 steps: (1) temporarily transform spatial parameters to Winograd, (2) threshold the Winograd parameters, and (3) find the least-square projection that maps the parameters back to the spatial domain. Since we have more parameters in Winograd domain, the Winograd parameters cannot be inversely transformed to the spatial domain exactly, hence the least-square projection. Due to this non-invertibility, method B either drops the accuracy by 8.1% (B1 in Table 1) or results in much lower sparsity (B2). Method C is from recent spatial domain pruning results (Park et al., 2017), which shows that, even when a model has 90% high sparsity in spatial domain, the sparsity significantly degrades to 25–70% once converted to Winograd domain.
The results shown in Table 1 use gradient-based thresholding described in Section 4, where atop regularization the gradient-based thresholding further reduces the non-zero density of each layer by up to 1.3 without affecting accuracy. The fine-tuning step improves the accuracy from 56.5% to 57.3% in method A. When natively pruned in Winograd domain (method A), we frequently find sparsity patterns that have no counter parts in spatial domain such as a 66 kernel in Winograd domain with only one non-zero at the second row of the second column. This illustrates that CNN architectures with layers of directly trainable parameters in Wingorad domain are more expressive and have more opportunity for pruning.
|Top-1||Sparsity of Convolution Layers in Winograd domain|
|C (Park et al., 2017)||57.4%||25.3% (85.6%)||69.3% (93.1%)||66.0% (91.8%)||61.3% (88.5%)|
Our sparse Winograd convolution inference is evaluated on a dual-socket server with Intel Xeon Platinum 8180 processors running at 2.5GHz, with total 56 cores and 77 MB last-level cache. This platform represents the latest server-class systems inside data centers. In Figure 2, SparseWinograd shows layer-wise performance of our sparse Winograd convolution design with the our obtain sparsity shown as method A in Table 1. DenseDirect is measured with libxsmm (Heinecke et al., 2016), a state-of-the-art open source dense convolution implementation. DenseWinograd_ideal is an ideally projected performance assuming that the speedup over DenseDirect is commensurate with the reduction in floating-point operations by using Winograd algorithm. We use this projection because there has yet to be a dense Winograd implementation optimized enough for the evaluated platform. Note that the ideal speedup is not usually realizable because Winograd convolution is less compute intensive and its performance is more bandwidth bound than direct convolution. SparseDirect is measured with an open source sparse direct convolution implementation from (Park et al., 2017) with the model pruned in spatial domain shown as method C in Table 1. SparseWinograd constantly outperforms the other methods: up to 5.4 over DenseDirect, 2.1 over DenseWinograd_ideal, and 1.5 over SparseDirect. Since dense Winograd convolution has been demonstrated to be the fastest (Lavin & Gray, 2016) for small convolution kernels in popular CNNs, the 2.1 speedup of our spare Winograd over the ideal dense Winograd shows its great potential in accelerating inference for popular CNNs.
As CNN has become pervasive, the relentless pursuit of fast convolution has inspired new algorithms and techniques for accelerating convolution. Transformation methods, especially Winograd convolution, and sparse methods are among the most successful approaches. This paper is the first to successfully combine them to construct sparse Winograd convolution. Moreover, we have demonstrated that our sparse Winograd can achieve 90+% sparsity without accuracy loss, leading to more than 5.4 speedup over dense direct convolution in 33 convolution layers of AlexNet on a latest Intel Xeon CPU. Looking ahead, our next step includes application to deeper networks like GoogLeNet (Szegedy et al., 2015) and ResNet (He et al., 2016) and to other platforms like FPGAs.
The MNIST database of handwritten digits.1998b.