Caffe for Sparse Convolutional Neural Network
Phenomenally successful in practical inference problems, convolutional neural networks (CNN) are widely deployed in mobile devices, data centers, and even supercomputers. The number of parameters needed in CNNs, however, are often large and undesirable. Consequently, various methods have been developed to prune a CNN once it is trained. Nevertheless, the resulting CNNs offer limited benefits. While pruning the fully connected layers reduces a CNN's size considerably, it does not improve inference speed noticeably as the compute heavy parts lie in convolutions. Pruning CNNs in a way that increase inference speed often imposes specific sparsity structures, thus limiting the achievable sparsity levels. We present a method to realize simultaneously size economy and speed improvement while pruning CNNs. Paramount to our success is an efficient general sparse-with-dense matrix multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns. Complementing this, we developed a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures. Together, these two allow us to demonstrate 3.1--7.3× convolution speedups over dense convolution in AlexNet, on Intel Atom, Xeon, and Xeon Phi processors, spanning the spectrum from mobile devices to supercomputers. We also open source our project at https://github.com/IntelLabs/SkimCaffe.READ FULL TEXT VIEW PDF
The success of CNNs in various applications is accompanied by a signific...
Sparse methods and the use of Winograd convolutions are two orthogonal
Deep neural networks have achieved remarkable accuracy in many artificia...
Convolutional Neural Networks (CNNs) are computationally intensive, whic...
Convolutional Neural Networks (CNNs) have revolutionized the research in...
While CNNs naturally lend themselves to densely sampled data, and
Group convolutional neural networks (G-CNNs) can be used to improve clas...
Caffe for Sparse Convolutional Neural Network
Due to the success of deep neural networks in a broad set of practical and even critical artificial intelligence tasks, they are now widely deployed in a spectrum of platforms: smart phones, autonomous cars, data center servers, and even supercomputers. While suitably designed and trained CNNs can be powerful, they are often large – requiring many parameters (e.g., the celebrated AlexNet(Krizhevsky et al., 2012) has 60 millions). That large neural network models incur cost in terms of memory, energy, and inference speed is easy to see.
, to name a few) that tries to prune the parameters after a CNN design is trained and proved useful. A common thread is to post-process a trained CNN. Post-processing may consist of retraining with sparsity inducing regularization or of approximating tensors of parameters via tensor factorization. These methods reduce the size of CNNs significantly while preserving inference accuracy. Nevertheless, the inference speed gains in pruned networks is not nearly as impressive as the size reduction. In this sense, the benefits of CNN pruning seem not fully realized.
While seemingly unintuitive, that the significantly pruned CNNs run not nearly as significantly faster can be easily explained. First, fully connected (fc) layers usually contain the bulk of the parameters while convolutional (conv) layers consume the bulk of computation time. This property shows that reducing the size of just the fc layers will readily lead to meaningful reduction in size as in Han et al. (2016b); Guo et al. (2016); but little speed improvement.
The crux of speed improvement thus lie in actual fast convolution of sparse kernels with feature maps (not just floating-point operations reduction), which is a challenging problem. It is well known in the field of numerical linear algebra that the performance of sparse matrix operations is typically memory bandwidth bound. Direct application of the sparse matrix operations to compute the conv layers when the kernels are sparse will likely result in sub-optimal speed gains. This concern on low efficiency of sparse operations is also discussed in the design of GoogLeNet (Szegedy et al., 2015). We will term methods that work directly with sparse data structures “sparse methods.” Alternative to sparse methods, “dense methods” gather data in a way that allow the actual convolution be performed by dense linear algebra functions such as GEMM. An example is found in (Lebedev & Lempitsky, 2015; Wen et al., 2016) which produces some group-wise sparsity patterns that facilitate the use of existing and highly tuned dense matrix computation library functions to perform the convolutions. However, imposing sparsity patterns limits the sparsity level that would otherwise be achievable had arbitrary patterns been allowed. We note that high compression in the conv layers are gaining importance as these layers consist a much larger percentage of parameters in recent networks such as GoogLeNet (Szegedy et al., 2015) and ResNet (He et al., 2015).
We view sparse methods differently. Convolutions in CNNs involve multiple channels and thus offer much higher data reuse than typical sparse matrix operations in scientific computing. Specifically, we present a highly efficient direct sparse convolution design formulated as sparse-matrix-dense-matrix multiplication with the dense matrix columns generated on-the-fly from a single column vector. In addition to being highly efficient, this sparse convolution design is friendly with convolution kernels with arbitrary sparsity patterns. We call thiselement-wise sparsity to distinguish it from group-wise sparsity mentioned previously. As shown later on, accepting element-wise sparsity significantly increases the achievable sparsity level.
Complementing our sparse convolution design, we formulate a performance model to elucidate when and how best to use sparse convolutions on different computer architectures and at different CNN layers. Our formulation follows the roofline model (Williams et al., 2009). In particular, our model suggests (correctly) that sparse convolution can improve inference speed even with a moderate sparsity level of around 70%. In addition, the model provides upper and lower bounds of sparsity levels that can contribute to speed improvements. Sparsity higher than the upper bound offer no further speed improvement; and sparsity lower than the lower bound can in fact slow down inference rather than accelerating it.
Combining the sparse convolution design with the performance model allows us to prune a CNN in a co-design manner, with our proposed new pruning algorithm—Guided Sparsity Learning (GSL). As illustrated later, we can adjust sparsity targets precisely at different layers so as to maximize inference speed, best preserve accuracy, and minimize a network’s size. In some cases, particular layers are identified as best not pruned at all due to no potential speedups, leaving them unchanged gives other layers more room for gainful pruning in terms of size and speed.
Our paper makes the following contributions:
A high performance sparse convolution design that takes advantage of arbitrary sparsity patterns and outperforms dense convolution even with a moderate sparsity.
A general performance model that (1) projects speedups over dense convolutions on varying level/types of sparsity and different computing platforms and (2) provides training guidelines for precisely targeting layers and sparsity ranges with potential to accelerate inference.
Guided Sparsity Learning (GSL), the first pruning algorithm fusing the awareness of speedup potential into sparsity learning; and its application to AlexNet and GoogLeNet. In particular, in GoogLeNet, we prune out more than 80% of parameters of all 55/33 conv layers and fc layers with no accuracy drop.
An optimized sparse convolution implementation (http://github.com/IntelLabs/SkimCaffe) that provides 7.3, 3.4, 3.1
speedups of convolution layers in AlexNet over dense methods on Intel Atom, Xeon, and Knights Landing processors, respectively, with no accuracy drop. In particular, this paper is one of the first evaluations of Xeon Phi processors on deep learning algorithms.
The rest of the paper is organized as follows. Section 2 presents the details of our sparse convolution design, formulation of the performance model, the Guided Sparsity Learning (GSL) pruning algorithm, and how they are combined to prune and accelerate CNNs. Section 3 demonstrates the effectiveness of these developments on AlexNet and GoogLeNet on a variety of platforms. Section 4 discusses related works and review the state of the art. Section 5 concludes and outlines a few next-steps.
As explained previously, prunning CNN models does not yet benefit inference speed as much as model size reduction. This section first presents our efficient direct sparse convolution design that remedies this situation significantly. We then develop a performance model that projects speedup over different sparsity levels and on different processor architectures. The model guides our speedup-aware training method, Guided Sparstiy Learning (GSL).
A sparse convolution for the all output positions across all output channels can be eventually considered as a virtual sparse-matrix-dense-matrix multiplication (SpMDM), as described in the following. Consider a bank of filters each with size against an feature with input channels. We denote the filter bank as a 4-mode tensor with size , the input feature as a 3-mode tensor with size , and the output feature as a 3-mode tensor with size . The output value at th position of th output channel is computed by
which is a dot-product of two 3D tensors as shown in Figure 2. This can be treated as a vector dot-product by: first vectorizing the 3D subtensor of corresponding to the th output channel, then vectorizing (denoted as ), and finally stretching the first vector to match the dimension of two vectors. When is sparse, then this vector dot-product becomes a sparse-vector-dense-vector dot-product. Consider flattening dimensions except the first one of into a sparse matrix (i.e. mode-1 matricization of as in Kolda & Bader (2009)), with its row vectors stretched to match the dimension of . is then the dot-product between the th row of and . Subsequently, the values at the same given th position of all output channels can be computed collectively as a sparse-matrix-dense-vector multiplication (SpMV):
where denotes the tensor with its last two dimensions shifted by . The values at different output positions can be computed as a sparse-matrix-dense-matrix multiplication (SpMDM), where the columns of the dense matrix are actually the same vector but with different offsets. In order to save bandwidth usage, we operate with a virtual dense matrix, , where its columns are generated on the fly by adjusting indices through which we access .
Using the virtual dense matrix essentially skips the lowering step used in standard frameworks such Caffe, and, therefore, we call our methoddirect sparse convolution, to distinguish it from sparse convolution with lowering such as the method used in Liu et al. (2015). The lowering approach replicates the input feature multiple times, significantly reducing arithmetic intensity. The lowering process has demonstrated overhead for dense convolution as in Hadjis et al. (2015); Chintala (2015), and is particularly problematic for sparse convolution with intensity already lower than its dense counter part. Figure 3 demonstrates the advantage of our direct sparse convolution, using the performance model that will be developed Section 2.2, where our direct sparse convolution significantly outperforms lowering-based methods at a high level of sparsity.
Even though direct sparse convolution may seem conceptually more complicated than the usual SpMDM or the lowering-based methods, it can be concisely expressed in the pseudo code shown in Figure 2. To decouple from a specific layout of tensor , the pseudo code uses layout function such that maps to the offset corresponding to th element of (we assume ). For example, in CHW layout, .
In convolutional layers in CNN, an input channel is reused against multiple output channels and vice versa, and there is also ample reuse out of an input channel due to overlapping between dot-products, especially for a large filter size and a small stride. Therefore, the arithmetic intensity of sparse convolution can be significantly higher than typical sparse-matrix computations such as SpMV, thus leading to high compute efficiency. Our optimized implementation111https://github.com/IntelLabs/SkimCaffe/blob/intel_scnn/include/caffe/util/sconv.hpp fully takes advantage of the reuse, applying loop tiling to both input and output channels, with column blocking (Buluç et al., 2009) applied to the weight sparse matrix. SIMDification and register blocking optimizations are also applied to the y and x loops in the pseudo code. Non-contiguous indirect access (i.e. gather) is another overhead of typical sparse-matrix computations. However, as shown in our pseudo code, the values read from colidx and value arrays of a sparse matrix are reused times. The access to tensor in is also contiguous as long as the tensor’s elements with contiguous or values are stored contiguously, which is a common case as in the CHW format.
Even though a bulk of computation belongs to convolution layers (hence the focus of this paper), we also briefly discuss exploiting sparsity in fully connected layers. Exploiting sparsity in fully connected layers is actually simpler than convolution layers, because fully connected layers are implemented as GEMM and we can leverage work done before on sparse matrix and dense matrix multiplication (SpMDM). Similarly to sparse convolutions, the arithmetic intensity of SpMDM decreases with higher sparsity, and its actual FLOP/s is lower than that of GEMM. Therefore, we also have a range of useful sparsity that can guide training for balanced accuracy-speed-size trade-offs. We apply optimizations similar to the ones applied to direct sparse convolution such as loop tiling and register blocking.
We briefly discuss how the sparse matrices arising from CNN differ from the ones in scientific computing or graph analytics Davis & Hu (2011). Sparse matrices in scientific computing typically have banded non-zero patterns (i.e., a long diameter in their graph interpretations) that lead to high temporal locality. We observe however that matrices in sparse CNN do not exhibit such banded non-zero patterns and reordering algorithms such as reverse Cuthill McKee Cuthill & McKee (1969) improve the locality little. Therefore, existing sparse linear algebra library routines like the ones in Intel MKL that have been optimized for scientific matrices do not provide the best performance, hence need for a custom implementation like ours with loop tiling and register blocking. Graph analytics is another area where sparse matrix is extensively used. Sparse matrices in graph analytics are much bigger and often exhibit “power-law” distribution that desires a different data structure like doubly compressed sparse column Buluç & Gilbert (2008).
We observe input features also have high sparsity (up to 85%) in FC layers in AlexNet that motivates using sparse-matrix sparse-matrix multiplication (SpGEMM). However, we evaluate that its performance is lower than SpMDM. A challenge is that the size of output matrix is unknown a priori, requiring two passes over the input matrices for parallelization (the first pass to determine the number of non-zeros per output matrix row and the second pass for the actual computation) Gustavson (1978). There are ways to work around, for example by outputting a dense matrix or by having a separate output sparse matrix per thread, but they are not without their own overheads. As a result, even the state-of-the art SpGEMM implementation on a recent Xeon processor that is significantly faster than its GPU counterparts do not achieve more than 20 GFLOP/s Patwary et al. (2015)
. In addition, the sparsity of FC layer inputs highly depends on activation functions. Even though ReLU with zero slope in negative side is currently a popular choice of activation function resulting in high sparsity, we cannot be certain that this will continue. Still, it is interesting to see if these challenges can be overcome and SpGEMM can further improve the performance of FC layers in sparse CNN.
The performance of sparse convolution depends highly on the sparsity level of the weight tensor. This section develops a performance model to determine the appropriate target sparsity range for pruning and to project theoretical speedup for any given sparsity, using the roofline model (Williams et al., 2009).
We denote the floating-point operations required in a convolution as (in FLOP), the size of input and output activation tensors as (in Bytes), and the size of weight tensor as , all without considering sparsity. We denote the density of non-zero in filters as (the lower the , the higher the sparsity of weight tensor), the compute capability of processor as (in FLOP/s), and the memory bandwidth as (in B/s). With these parameters, the time for dense convolution (), the time for sparse convolution bound by compute () and by bandwidth (), and theoretical speedup can be modeled as follows (we assume dense convolution is not bandwidth bound):
where and denote the compute and storage overheads of sparse representations, respectively. We observe on a Xeon E5-2697 v4 processor, and is typically 2 (in compressed sparse row representation, we need 4B column index for each 4B single-precision floating point non-zero value).
This analytical model is visualized in Figure 3. Here, we define effective FLOP/s with respect to the number of floating-point operations that would have been performed by dense convolutions including the ones for zero weights (i.e. effective FLOP/s = ). With a moderate sparsity (e.g., ), convolution is likely to be compute bound, and hence effective FLOP/s rapidly increases as decreases. For conv5 in AlexNet with , a typical sparsity range without accuracy loss, direct sparse convolution can achieve 2–7 and 4–14 speedup on Xeon and Atom platforms, respectively, as shown in Figure 3 and will be validated in Section 3.2.
However, decreasing arithmetic intensity further with lowering eventually makes the performance bandwidth bound. Thus, there is an upper bound of useful sparsity, and a sparsity higher than it does not provide additional speedup, while only making training more challenging to preserve accuracy. This upper bound can be found by solving for such that (e.g., the upper bound sparsity for conv5 of AlexNet on the Xeon is ). This analysis can be applied to various computing platforms including CPUs and GPUs because the model captures the essential platform-dependent characteristic, the ratio of bandwidth compute capability to memory bandwidth (). When the compute to bandwidth ratio is lower as in a platform like Atom, the performance will be less quickly bandwidth bound. For example, the lower bound of useful sparsity for conv5 of AlexNet is on Atom C2750, which is smaller than that of Xeon. The speedup to sparsity relation also varies over layers. For example, since 11 convolutions in GoogLeNet has low arithmetic intensity to begin with, its performance quickly becomes bandwidth bound at lower sparsity (or higher ).
The compute overhead, , depends on the quality of sparse convolution implementation and on the target processor architecture. Since for , there is a lower bound of useful sparsity such that, with a sparsity lower than that, sparse convolution becomes slower than dense convolution. The previous section described our sparse convolution implementation that achieves =3 (since is the compute overhead, lower is better) on the Xeon instead of =100 as conjectured by Szegedy et al. (2015)222The compute overhead of =3 primarily comes from that access to input tensor is not aligned at cache line boundaries. Recent Xeon processors can execute 1 unaligned SIMD load per cycle, which is not enough to sustain 2 SIMD fused multiply-add operations per cycle. In addition to this 2 overhead, when is not a multiple of SIMD width (8 for Xeon E5-2697 v4), we do not fully utilize the SIMD registers. Since Atom processors do not execute multiple SIMD floating operations per cycle anyway, and because its SIMD width is narrower as 4, its compute overhead is smaller as 1.2 as will be shown in Section 3.2. .
The upper and lower bounds on useful sparsity can provide important insights for training/pruning. The model can tell that sparse convolution is not useful for certain layers, in which case we can skip pruning of those layers to provide more room for sparsity in the other layers. For example, layers like the first layer in AlexNet and GoogLeNet may not provide enough sparsity regardless of the amount of regularization applied as long as the original inference accuracy is to be preserved. A layer may be already bandwidth bound even before pruning like 11 convolution layers in GoogLeNet as shown by inception_4a/5x5_reduce layer in Figure 3.
Guided Sparsity Learning (GSL), our new pruning algorithm, is inspired by the insights and our performance model. GSL is the first to fuse the awareness of speedup potential into sparsity learning. GSL is a generic algorithm and accepts different regularization methods for pruning. When GSL is used with element-wise regularization for pruning, thus denoted as Guided Element-wise Sparsity Learning (GESL), it learns the element-wise sparsity of layers where the model predicts speedups. GSL can also be used with regularization methods that are more complicated than basic ridge and lasso regularization. For example, GSL can be combined with dynamic network surgery (Guo et al., 2016), as will be shown in Section 3.1
|Algorithm: Guided Sparsity Learning (GSL)|
|Input:||Pruning layer set (), performance model ()|
|Initialize||Project speedup for each layer in using ; Exclude all s without speedup potential from|
|Repeat||Train the whole neural network while actively pruning only s in|
|Project speedup of s in , using their current sparsity and , periodically|
|Periodically, for each in do:|
|if (sparsity upper bound of the useful sparsity range), stop pruning|
|if (stabilized sparsity lower bound), stop pruning & restore its original dense weights|
|/ Stop pruning is to give other s better chance to prune further and achieve better accuracy /|
|Until||Maximum iterations or convergence reached|
Although GSL as described above aims primarily at inference speed, GSL can balance the implications of pruning on inference speed, accuracy, and model size. To do this, optional constraints can be given to GSL to prioritize pruning of different layers in the network. For example, by using different regularization strengths on conv and fc, we can tune the priorities on speed and model size.
|Atom C2750 (Atom)||Xeon E5-2697 v4 (BDW)||Xeon Phi 7250 (KNL)|
|Achievable bandwidth (gb/s)||15||122||480|
Our sparse CNN design is evaluated on three platforms shown in Table 1. Intel C2750 (Atom) represents resource-constrained mobile platforms or micro servers optimized for energy efficiency. Xeon E5-2697 v4 (BDW) represents data-center servers. Xeon Phi 7250 (KNL
) is designed for high-performance computing, but its next version, Knights Mill, will specifically target machine learning. Our sparse CNN is implemented as an extension of Caffe deep learning framework(Jia et al., 2014) and is at https://github.com/IntelLabs/SkimCaffe. We use Intel compiler version 17.0.0 and use all cores available. The SGEMM performance and achievable memory bandwidth listed are measured with Intel MKL version 2017 and STREAM benchmark (McCalpin, ), respectively.
We train with the ImageNet ILSVRC-2012 dataset(Deng et al., 2009), starting from the pre-trained Caffe reference model (a slight variation but we call it AlexNet for simplicity) and GoogLeNet model from the Caffe model zoo. Since we find it is easy to have high sparsity with smaller networks and datasets like LeNet and CIFAR regardless of pruning method, we do not present their results. Our training process is based on the method described in Wen et al. (2016) with the following differences. We look for element-wise sparsity with lasso instead of group lasso, and guide the training process to target the layers and range of sparsity where we see speedup potential. We have explored various solver methods and learning rate schedules, but found that they do not significantly affect the eventual accuracy and sparsity, once hyper-parameters are tuned for the respective settings. In general, the pruning step no longer improves after 450K and 900K mini-batch iterations for AlexNet and GoogLeNet, respectively. The re-training step saturates around 150K and 300K mini-batch iterations. To see trade-offs among accuracy, speed, and model size, we try various weight decays ranging from 1e-5 to 1e-3, and, for AlexNet, decay multipliers for fc layer ranging from 1e-2 to 1. We find that the starting learning rate of 1e-3 and weight decay of 5e-5 in general gives a high sparsity with minimal accuracy drop. We reduce the learning rate by 10 for re-training step.
Figure 4 shows the effectiveness of our guided pruning and compares the level of element-wise and group-wise sparsity we can obtain. We should look at layer-by-layer because the speedup over dense convolution does not have a simple linear relation with sparsity as shown by our model, and, therefore, the overall FLOP reduction does not necessarily closely correlate with the real speedup. In AlexNet, using the same element-wise regularization factor across all layers (element-wise sparsity learning, ESL) provides non-zero densities around 0.4 for conv2-5. This is fine sparsity when the primary goal is reducing model size, but not high enough for speeding-up inference. Therefore, guided ESL (GESL) reduces the regularization factor of fc layers (as they have fewer FLOPS) and avoid pruning conv1 entirely (as its sparsity is too low for any potential speedups with more regularization). This leads to less than 0.2 non-zero density for conv2-5, the range where we can get speedups from sparse convolution. Similarly, applying GSL to dynamic network surgery (DNS), a recent proposal to obtain high sparsity, as Guided DNS (GDNS), we can see that GSL effectively improve the obtained sparsity for accelerating inference by de-prioritizing conv1 and fc layers (we go further to not prune fc layers at all to see how much sparsity DNS can provide in conv layers)333Although not shown in Figure 4(b), we also apply DNS and GDNS to GoogLeNet. Compared to DNS, GDNS on GoogLeNet successfully reduces non-zero density by 1.4 on average in layers with speedup potential by prioritizing these layers for pruning..
Structured sparsity learning (SSL) provides group-wise sparsity, for which we can use dense methods, but its sparsity is lower because of constrained forms of sparsity. According to our model, SSL performs better when , where and are non-zero density of ESL and SSL, and and are the compute overhead of ESL and SSL, respectively. Even if we use an ideal 100% efficiency for SSL ()444Note that certain kinds of group-wise sparsity like “column-wise sparsity” defined in Wen et al. (2016) need lowering, which can be considerable overhead, making it hard to approach the ideal efficiency. and the measured overhead for ESL, shown in Figure 4(a) is not small enough to outperform GESL. Note that our guiding principles are already applied to SSL, where conv1 and fc layers are not pruned. In short, sparsity SSL can currently obtain is too low to outperform once compared with an optimized sparse convolution design for element-wise sparsity such as ours. This motivates further investigation of pruning methods for higher group-wise sparsity.
GoogLeNet has many 11 convolutions, for which sparse convolution does not provide speedups due to their low arithmetic intensity, as our model predicts. As shown in Figure 4(a), GESL successfully discovers this and avoids pruning the 11 convolutions for higher sparsity in 33 and 55 convolutions, where our model projects speedup, and, more importantly, almost recovers the original accuracy. For 11 convolutions, group-wise sparsity implemented in SSL reduces to element-wise sparsity555This is because filter coefficients for a given input and output channel pair is also a type of group that SSL is looking for. , and dense methods can no longer be used. We believe that SSL provides higher sparsity for 11 convolutions than ESL because SSL does not prune fc layers, providing more room to prune other layers666However, no pruning in fc layers is not an inherent limitation of SSL. This is just because Wen et al. (2016) focus on conv layer, and we follow the same approach to see maximum sparsity that SSL and GSSL can get in conv layers.. For larger convolutions that contribute to the bulk of FLOPs, ESL provides significantly higher sparsity than SSL; most of the larger convolution layers have non-zero density less than 0.2, where sparse convolutions can provide speedups. It is interesting to note that ESL, GESL, and SSL all achieve very high sparsity of non-zero density less than 1.5% in layers like inception_4e/3x3. This may indicate that the 3
3 path is unnecessary in that inception module.
Figure 5 shows layer-wise performance of our direct sparse convolution design with the useful high sparsity obtained from GESL. We evaluate with the sparse matrices from multiple pruned AlexNet models with up to 3% top-1 accuracy drop. Since the performance of sparse matrix operations highly depends on specific sparsity pattern, it is important not to evaluate with random sparse matrices. We use SGEMM as a proxy of dense convolution performance to quantify layer-wise speedups of direct sparse convolution. SGEMM is a good proxy because it has a long history of extensive optimizations, and it allows us not to depend on the quality of a specific dense convolution implementation777This is the same reason for this paper to focus on layer-wise performance instead of overall end-to-end speedup. As the baseline for overall end-to-end speedup may be relative to a baseline whose efficiency is suboptimal with performance bottlenecks in other parts/layers of the code. For more scientific comparison among different CNN speedup techniques, we recommend using dense matrix multiplication (GEMM) FLOP/s of the evaluated platform as the baseline, because many platforms readily have vendor-provided extensively-optimized GEMM implementations which can be a proxy of highly-optimized dense CNN implementation. This also aligns with a long-accepted standard practice in high performance computing community.. We use batch sizes of 32, 144, and 272 for Atom, BDW, and KNL, multiples of the number of hardware threads in respective platforms.
BDW achieves 3.4 speedup with non-zero density = 0.09, the sparsity similar to those of conv2-5 with no accuracy drop. The actual TF/s (as opposed to effective TF/s that also counts FLOPs for zeros) is 0.76 when sparse convolution is sufficiently compute bound (e.g., ). This performance corresponds to about a third of SGEMM, from which we can derive the compute overhead of sparse convolution as 3. As explained in Section 2.2, this leads to the lower-bound of sparsity to get speedups at , which matches with Figure 5(b). Atom with a higher bandwidth to flop ratio achieves higher 7.3 speedup at . The actual GF/s is 51 when , which is 1.2 lower than SGEMM performance (i.e. ). Note that the performance projection for conv5 in Figure 3 using s derived here resembles the measured performance in Figure 5 (conv2-5 share similar performance characteristics). KNL achieves impressive 13.9 effective TF/s at (3.1 over SGEMM).
|A: Lebedev & Lempitsky (2015), Wen et al. (2016)||B: Han et al. (2015), Han et al. (2016b), Liu et al. (2015), Guo et al. (2016), GESL||C: Denton et al. (2014), Jaderberg et al. (2014), Lebedev et al. (2015), Zhang et al. (2015), Kim et al. (2016), Ioannou et al. (2016), Tai et al. (2016), Denil et al. (2013)|
Recent researches have achieved great success on reducing model size and accelerating inference of CNNs while maintaining accuracy, exploring a large design space as shown in Table 2. Regularization-based and factorization-based approaches are the two main camps. Regularization-based approaches use a separate training step to discover and prune redundant parameters in a pre-trained model using various regularizations, including ridge (Han et al., 2015, 2016b), lasso, and group lasso (Liu et al., 2015; Wen et al., 2016), combined with thresholding. Factorization-based approaches use low-rank decomposition and can quickly produce compressed models without additional pruning steps. Both approaches can use a fine-tuning step to recover accuracy loss caused by model pruning.
Researches focusing on fully connected layers (Han et al., 2015, 2016b; Denil et al., 2013) have achieved 10–50 model size reduction for networks such as AlexNet (Krizhevsky et al., 2012). However, they achieved marginal inferencing speedup because fully connected layers usually account for less than 10% of total computation in modern CNNs. Researches in groups A and C shown in Table 2 aim at speeding up inference by focusing more on convolution layers, with most of them relying on dense methods for computing convolution. While factorization-based approaches (group C) obtain smaller models in dense format naturally, regularization-based approaches (group A) need group regularization to impose group-wise sparsity. Although Liu et al. (2015) explore sparse methods in computing convolution layers, their approach involves lowering overhead and uses hard-coding non-zeros in sparse matrix with full unrolling that leads to a large instruction footprint.
While our direct sparse convolution is demonstrated to achieve high speed up on convolution when having enough sparsity, factorization-based approaches can complement. This is because the inherent sparsity in the first few convolution layers can be not high enough, while factorization-based approaches can achieve speedups there. Liu et al. (2015) also show that factorization and regularization-based approaches can be combined.
Winograd (Lavin & Gray, 2015) and FFT based algorithms (Vasilache et al., 2015) also aim to speedup convolution. While being orthogonal, these techniques can have synergies with our direct sparse convolution. For example, FFT based convolutions are more effective for large filters that usually reside in the first few layers where sparsity is low. While this paper focuses on convolution layer performance, our technical report (Park et al., 2016) also considers optimizations for fully connected layers, and sparsity in activations, which is also discussed in Han et al. (2016a).
Powerful CNNs are often quite compute demanding. Pruning as a post-processing step has been effective in drastically reducing the model size while boosting inference speed moderately. We aim to more fully realize the potential performance benefits due to the reduced FLOP counts resulting from pruned convolution kernels. By combining our high-performance direct sparse convolution method with a performance model, we developed a guided approach that prunes CNNs in a co-design fashion for different computer architectures and on different layers of a CNN in question. In particular, we demonstrated 3.1–7.3 convolution speedups in AlexNet on a variety of platforms, all in comparison to extensively-optimized dense linear algebra operations.
Looking ahead, as this paper shows that pruning can boost inference speed significantly in additional to reducing model size, further techniques in pruning should be explored. While our direct sparse convolution algorithm is successful, our performance model also reveals that sparse convolution cannot speedup all convolution layers, as seen from 11 convolutions in GoogLeNet. We plan to expand our performance model to cover other FLOP-reduction methods such as FFT, Winograd, and tensor factorization, so that we can make informed decisions to choose the best performing method for each layer and the training process can be guided accordingly.
We would like to thank Yiwen Guo, Anbang Yao, and Yurong Chen for sharing the dynamic network surgery source code and their insights. We would also like to thank Nitish Shirish Keskar for his recommendations on hyper-parameter settings.