1 Introduction
Deep learning with convolutional neural networks (CNNs) has recently achieved performance breakthroughs in many of computer vision applications (LeCun et al., 2015). However, the large model size and huge computational complexity hinder the deployment of stateoftheart CNNs on resourcelimited platforms such as batterypowered mobile devices. Thus, it is of great interest to compress largesize CNNs into compact forms to lower their storage requirements and to reduce their computational costs (Sze et al., 2017; Cheng et al., 2018).
CNN size compression has been actively investigated for memory and storage size reduction. Han et al. (2016) showed impressive compression results by weight pruning, quantization using
means clustering and Huffman coding. It has been followed by further analysis and mathematical optimization, and more efficient CNN compression schemes have been suggested afterwards, e.g., in
Choi et al. (2017); Ullrich et al. (2017); Agustsson et al. (2017); Molchanov et al. (2017); Louizos et al. (2017); Choi et al. (2018); Dai et al. (2018). CNN computational complexity reduction has also been investigated on the other hand. The major computational cost of CNNs comes from the multiplyaccumulate (MAC) operations in their convolutional layers (Sze et al., 2017, Table II). There have been two directions to reduce the complexity of convolutions in CNNs:
[noitemsep,topsep=0em,leftmargin=1.6em]

First, instead of conventional spatialdomain convolution, it is suggested to use either frequencydomain convolution (Mathieu et al., 2013; Vasilache et al., 2014) or Winograd convolution (Lavin and Gray, 2016). In particular, for typical smallsize filters such as filters, Lavin and Gray (2016) showed that Winograd convolution is more efficient than both spatialdomain convolution and frequencydomain convolution.

Second, weight pruning is another approach to reduce the number of MACs required for convolution by skipping the MACs involving pruned weights (zero weights). The previous work mostly focused on spatialdomain weight pruning, which leads us to exploit sparse spatialdomain convolution of low complexity, e.g., see Han et al. (2015); Lebedev and Lempitsky (2016); Wen et al. (2016); Guo et al. (2016); Lin et al. (2017); Park et al. (2017). More recently, there have been some attempts to prune Winograddomain weights and reduce the complexity of Winograd convolution (Li et al., 2017; Liu et al., 2018).
Previous works either focused on spatialdomain weight pruning and compression or focused on Winograddomain weight pruning and complexity reduction. Compression of Winograd CNNs has never been addressed before, to the best of our knowledge. Other shortcomings of the previous works addressing the complexity reduction of Winograd CNNs are that their final CNNs are no longer backward compatible with spatialdomain convolution due to the noninvertibility of the Winograd transformation, and hence they suffer from accuracy losses if they need to be run on platforms that only support spatialdomain convolution. To our knowledge, this paper is the first to address the universal CNN pruning and compression framework for both Winograd and spatialdomain convolutions.
Our proposed solutions are summarized in Figure 1. The main novelty of the proposed framework comes from the fact that it optimizes CNNs such their convolutional filters can be pruned either in the Winograd domain or in the spatial domain without accuracy losses and without extra training or finetuning in that domain. Our CNNs can be further optimized for and compressed by universal quantization and universal source coding such that their decompressed convolutional filters still have sparsity in both Winograd and spatial domains. Hence, one universally compressed model can be deployed on any platform whether it uses spatialdomain convolution or Winograd convolution, and the sparsity of its convolutional filters can be utilized for complexity reduction in either domain, with no need for further training. Since many lowpower platforms, such as mobile phones, are expected to only support the inference of CNNs, and not their training, our approach is more desirable for widescale deployment of pretrained models without worrying about underlying inference engines.
2 Preliminary
2.1 Winograd convolution
We first review the Winograd convolution algorithm (Winograd, 1980) in this subsection. It is well known that spatialdomain convolution is equivalent to elementwise product in the frequency domain or in the Winograd domain (e.g., see Blahut (2010, Section 5)). In particular, the Winograd convolution algorithm is designed to compute a convolution with the minimum multiplications possible (Selesnick and Burrus, 1998, Section 8.4).
For the sake of illustration, consider that we are given a twodimensional (2D) input of size and a 2D filter of size for convolution. We first prepare a set of patches of size
extracted from the input with stride of
for . Each of the patches is convolved with the filter by the Winograd convolution algorithm and produces an output patch of size . Finally, the output patches are combined into one output image.Let and be one of the input patches and its corresponding output patch, respectively, and let be the filter. In Winograd convolution, the input and the filter are transformed into the Winograd domain by and using the Winograd transformation matrices and , respectively, where the superscript denotes the matrix transpose. In the Winograd domain, both and are of size , and elementwise product of them follows. Then, the output is transformed back to the spatial domain using matrix by
(1) 
where is the elementwise product of two matrices. The transformation matrices , , and are specific and can be obtained from the Chinese remainder theorem (e.g., see Blahut (2010, Section 5.3)). In case of input channels, the inverse transformation in (1) can be deployed once after summation over all channels of the elementwise product outputs in the Winograd domain (see Lavin and Gray (2016, Section 4)), i.e.,
where and are the Winogradtransformed filter and input of channel , respectively (see Lavin and Gray (2016)).
2.2 Sparse Winograd convolution
Similar to spatialdomain weight pruning for sparse spatialdomain convolution of low complexity, it is considered to skip some of the computations in the Winograd domain by pruning (i.e., setting to zero) some of the Winogradtransformed filter weights (elements of in (1)) for sparse Winograd convolution. The most related work to this end can be found in Li et al. (2017); Liu et al. (2018).
Pruning spatialdomain weights does not yield sparse Winograddomain filters in general since the sparsity is not maintained after transformation. Thus, Li et al. (2017) introduced new Winograd layers, which are similar to convolutional layers but their learnable parameters are defined in the Winograd domain, and not in the spatial domain. In their framework, Winograddomain weights are directly learned in training where the loss and gradients are computed with Winograd layers. For Winograddomain weight pruning, some insignificant Winograddomain weights are nullified in every training iteration based on their magnitude and gradient values. In Liu et al. (2018)
, the complexity of Winograd layers is further reduced by putting rectified linear units (ReLUs) in the Winograd domain and skipping MACs not only for zero weights, but also for zero activations in the Winograd domain.
However, if we learn Winograddomain weights directly using Winograd layers, the trained model has to use Winograd layers in inference as well. We cannot transform the learned Winograddomain weights back to the spatial domain without considerable accuracy loss, since the inverse transformation from the Winograd domain to the spatial domain is overdetermined. Hence, the model is not deployable on the platforms that only support classical spatialdomain convolution. Moreover, storing Winograddomain weights is inefficient, since the number of weights is larger in the Winograd domain. Thus, we suggest that it is better to compress weights in the spatial domain even if the target computational platform only deploys Winograd convolution.
2.3 Universal compression
A universal CNN compression framework was proposed in Choi et al. (2018), where CNNs are optimized for and compressed by universal quantization and universal entropy source coding with schemes such as the variants of Lempel–Ziv–Welch (Ziv and Lempel, 1977, 1978; Welch, 1984) and the Burrows–Wheeler transform (Effros et al., 2002). Of particular interest for universal quantization is randomized uniform quantization, where uniform random dithering makes the distortion independent of the source, and the gap of its rate from the ratedistortion bound at any distortion level is provably no more than bits per sample for any source (Zamir and Feder, 1992)
. Universal CNN compression has practical advantages as it is easily applicable to any CNN model at any desired compression rate, without the extra burden required by previous approaches to compute or estimate the statistics of the CNN weights, and is guaranteed to achieve nearoptimal performance.
3 Training with joint sparsity constraints
In this section, we present our CNN training method with regularization under joint spatialWinograd sparsity constraints, to enable efficient deployment of pretrained CNNs in either domain, without additional training for deployment.
3.1 CNN model
We consider a typical CNN model consisting of convolutional layers. The input of layer has channels of size and the output has channels of size , where the input is convolved with filters of size . For , and , let be the 2D convolutional filter for input channel and output channel of layer .
3.2 Regularization for jointly sparse convolutional filters
In this subsection, we introduce our Winograddomain and spatialdomain partial L2 regularizers to attain convolutional filters that are sparse in both the Winograd domain and the spatial domain. We choose L2 regularizers to promote sparsity, although other regularizers such as L1 regularizers can be used instead (see Section 5 for more discussion). For notational simplicity, let be the set of all convolutional filters of layers, which are learnable, i.e.,
Moreover, given any matrix , we define as the matrix that has the same size as while its element is one if the corresponding element in satisfies and is zero otherwise.
Winograddomain partial L2 regularization: To optimize CNNs under a Winograddomain sparsity constraint, we introduce the Winograddomain partial L2 regularizer given by
(2) 
where denotes the L2 norm and is the Winograd transformation matrix determined by the filter size and the input patch size of layer for Winograd convolution (see Section 2.1); is the total number of Winograddomain weights of all layers. Although the constraints are on the Winograddomain weights, they translate as the constraints on the corresponding spatialdomain weights, and the optimization is done in the spatial domain; this facilitates the optimization for additional sparsity constraints in the spatial domain as will be clarified below.
Observe that the L2 regularization in (2) is applied only to a part of Winograddomain weights if their magnitude values are not greater than the threshold value . Due to the partial L2 regularization, spatialdomain weights are updated towards the direction to yield diminishing Winograddomain weights in part after training and being transformed into the Winograd domain. Given a desired sparsity level (%) in the Winograd domain, we set the threshold value to be the th percentile of Winograddomain weight magnitude values. The threshold is updated at every training iteration as weights are updated. Note that the threshold decreases as training goes on since the regularized Winograddomain weights gradually converge to small values in magnitude (see Section 3.3). After finishing the regularized training, we finally have a set of Winograddomain weights clustered very near zero, which can be pruned (i.e., set to zero) at minimal accuracy loss.
Spatialdomain partial L2 regularization: To optimize CNNs while having sparsity in the spatial domain, we regularize the cost function by the partial sum of L2 norms of spatialdomain weights, determined by given a target sparsity level (%), similar to (2), as below:
(3) 
where is the total number of spatialdomain weights of all layers.
3.3 Regularized training with learnable regularization coefficients
Combining the regularizers in (2) and (3), the cost function to minimize in training is given by
(4) 
for and , where is a training dataset and the
is the network loss function such as the crossentropy loss for classification or the meansquarederror loss for regression. We emphasize that training is performed in the spatial domain with conventional spatialdomain convolution and we update spatialdomain filters in
, while the regularizers steer the filters to have a desired percentage of weights with small or nearzero values either in the spatial domain or in the Winograd domain when transformed, which are safe to prune at little accuracy loss.In (4), we introduce two regularization coefficients and . Conventionally, we use a fixed value for a regularization coefficient. However, we observe that using fixed regularization coefficients for the whole training is not efficient to find good sparse models. For small fixed coefficients, regularization is weak and we cannot reach the desired sparsity after training. For large fixed coefficients, on the other hand, we can achieve the desired sparsity, but it likely comes with considerable performance loss due to strong regularization.
Learnable regularization coefficient: To overcome the problems with fixed regularization coefficients, we propose novel learnable regularization coefficients, i.e., we let regularization coefficients be learnable parameters. Starting from a small initial coefficient value, we learn an accurate model with little regularization. As training goes on, we make the regularization coefficients increase gradually so that the performance does not degrade much but we finally have sparse convolutional filters at the end of training in both Winograd and spatial domains. Towards this end, we first replace and with and , respectively, and learn and instead, for the sake of guaranteeing that the regularization coefficients always positive in training. Moreover, we include an additional regularization term, e.g., for , to penalize small regularization coefficients and encourage them to increase in training. The cost function in (4) is then altered into
(5) 
Observe that we introduced a new hyperparameter , while making regularization coefficients learnable. The tradeoff between the loss and the regularization is now controlled by the new hyperparameter instead of regularization coefficients, which is beneficial since is not directly related to either of the loss or the regularization, and we can induce smooth transition to a sparse model.
L2 regularization for parameters corresponds to assuming a zeromean Gaussian prior on the parameters (e.g., see (Bishop, 2006, Section 5.5)). The Winograddomain partial L2 regularization can be interpreted as if we assume a zeromean Gaussian prior for partial Winograddomain weights within the threshold value and use the negative loglikelihood of the Gaussian prior as a regularization term. The regularization coefficient in (5
) can be related to the variance of the Gaussian prior, i.e., the reciprocal of the variance of the Gaussian prior corresponds to the regularization coefficient
. In this Bayesian model, we can even consider the variance of the Gaussian prior as a random variable and find the optimal variance by learning, which leads us to the learnable regularization coefficient idea with the penalty term in (
5). A similar interpretation applies to the spatialdomain partial L2 regularization. Training with Gaussian priors has been considered in other contexts, e.g., Gaussian mixture is used for weight quantization in Ullrich et al. (2017).Gradient descent: From (5), we have
(6) 
where is provided from the CNN backpropagation algorithm. It can be shown that
(7)  
(8) 
The detailed proof of (7) is can be found in Appendix, while (8) is straightforward to show. We note that the indicator functions in (2) and (3
) are nondifferentiable, which is however not a problem when computing the derivatives in practice for stochastic gradient descent. Combining (
6)–(8), we can perform gradient descent for weights in . We update and by gradient decent usingrespectively. Observe that tends to . This implies that as the regularizer decreases, the regularization coefficient gets larger. A larger regularization coefficient further encourages spatialdomain weights to move towards the direction where regularized Winograddomain weights converge zero in the following update. In this way, we gradually sparsify Winograddomain filters. Similarly, spatialdomain filters are sparsified owing to increasing and decreasing .
(a) Iterations=0  (b) Iterations=100k  (c) Iterations=120k  (d) Iterations=200k  

Winograd domain  
Spatial domain 
Evolution of weight histogram: In Figure 2, we present how the weight histogram (distribution) of the AlexNet second convolutional layer evolves in the Winograd domain and in the spatial domain as training goes on due to the proposed partial L2 regularizers with the learnable regularization coefficients. Observe that a part of the weights converges to zero in both domains. Finally, we have a peak at zero, which can be pruned at little accuracy loss, in each domain.
Winograd domain  

Spatial domain 
Examples of pruned filters: In Figure 3, we present convolutional filter samples that are sparse after pruning either in the Winograd domain and in the spatial domain, which are obtained by our regularization method for different sparsity levels. The AlexNet second convolutional layer consists of filters and we assume to use Winograd convolution of in Section 2.1.
As observed above, we have presented our algorithms using L2 regularizers. Often L1 norms are used to promote sparsity (e.g., see Chen et al. (2001)), but here we suggest using L2 instead, since our goal is to induce smallvalue weights rather than to drive them to be really zero. The model retrained with our L2 regularizers is still dense and not sparse before pruning. However, it is jointly regularized to have many smallvalue weights, which can be pruned at negligible loss, in both domains. The sparsity is actually attained only after pruning its smallvalue weights in either domain as shown in Figure 4
. This is to avoid the fundamental limit of joint sparsity, similar to the uncertainty principle of the Fourier transform
(Donoho and Stark, 1989). See Section 5 for more discussion.4 Universal compression and dual domain deployment
We compress the jointly sparse CNN model from Section 3.3 by universal compression in the spatial domain for universal deployment. Universal compression consists of the following three steps, as illustrated in Figure 1.
Universal quantization and pruning: First, we randomize spatialdomain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval of by
(9) 
where are the individual spatialdomain weights of all layers, and are independent and identically distributed uniform random variables with the support of ; the rounding function satisfies , where is the largest integer smaller than or equal to . The weights rounded to zero in (9) are pruned and fixed to be zero for the rest of the finetuning and compression steps. The random dithering values or their random seed are assumed to be known at deployment, and the dithering values are cancelled for the unpruned weights after decompression by
(10) 
where is the final deployed value of weight for inference. If (no dithering), this simply reduces to uniform quantization.
Finetuning the uniform codebook: Second, we finetune the uniform codebook to compensate the accuracy loss after quantization. The average gradient is computed for unpruned weights that are quantized to the same value in (9). Then, their shared quantized value in the codebook is updated by gradient descent using the average gradient of them, which is given by
(11) 
where is the iteration time, is the learning rate, and is the index set of all weights that are quantized to the same value in (9) for some nonzero integer . After the codebook is updated, individual weights are updated by following their shared quantized value in the codebook, i.e.,
We emphasize here that the pruned weights in (9) are not finetuned and stay zero. We do not include the spatialdomain regularizer in (11) since this step follows after the joint sparsity optimization as shown in Figure 1. We determine which spatialdomain weights to prune in (9) and fix them to zero. However, to maintain the sparsity in the Winograddomain while optimizing the quantization codebook in the spatial domain, we keep the Winograddomain regularizer, i.e., we use
Universal lossless source coding: Finally, universal lossless source coding follows for compression. It is assumed that the encoder and the decoder share the information on the random dithers, or it is assumed that the dithering information can be already known to both of them through a compression protocol, e.g., by sending the random seed. The indexes in the codebook of the universally quantized weights are passed as an input stream to a universal entropy source coding scheme such as Lempel–Ziv–Welch (Ziv and Lempel, 1977, 1978; Welch, 1984), gzip (Gailly and Adler, 2003) and bzip2 (Seward, 1998) that uses the Burrows–Wheeler transform (Effros et al., 2002), which produces a compressed stream. We also need to deploy the codebook that contains the indexes and corresponding finetuned shared quantized values for decompression.
Deployment: At deployment, the compressed stream is decompressed, and random dithers are cancelled to get unpruned spatialdomain weights as in (10). Then, the CNN can be deployed in the spatial domain with the desired sparsity. If we deploy the CNN in the Winograd domain, its convolutional filters are transformed into the Winograd domain, and pruned to the desired sparsity level (see deployment in Figure 1).
5 Discussion
5.1 Joint sparsity
The uncertainty principle for the Fourier transform establishes the fundamental limit of the joint sparsity of the signal in the time and frequency domains, e.g., see Donoho and Stark (1989, Theorem 1 and Corollary 1) and Eldar (2015, Section 11.3.4). The Winograd transform was originally proposed as a method to calculate the discrete Fourier Transform (DFT) efficiently, by reordering the input such that DFT can be implemented as cyclic convolutions (Winograd, 1978). However, reordering the input sequence by the Winograd transform does not transform it to the frequency domain. Hence, one may not directly apply the timefrequency uncertainty principles to the Winograd transform. We also show by example that some sparse spatialdomain filters can have a considerable number of zero elements even after Winograd transformation, while they become dense in the Fourier transform case, as follows:
and
The fundamental limit of joint sparsity explains why we use L2 instead of L1 for joint sparsity regularization. Here, we need to clarify that the model retrained with our Winograddomain and spatialdomain L2 regularizers is still dense and not sparse before pruning. However, it is jointly regularized to have many smallvalue weights, which can be pruned at negligible loss, in both domains. In other words, our regularized model is not simultaneously sparse in both domains. The sparsity is actually attained only after pruning its smallvalue weights in either domain, i.e., in any of the spatial or the Winograd domain (see Figure 4).
We further considered the compression of jointly sparse models for universal deployment. In this case, we make the model actually sparse in the spatial domain by pruning smallvalue weights in the spatial domain. Then, we quantize the model in the spatial domain for compression (see Figure 1). The resulting quantized model is sparse in the spatial domain, but it becomes dense in the Winograd domain. Thus, in order to recover the sparsity in the Winograd domain as much as possible, we finetune the spatialdomain quantization codebook with the Winograddomain L2 regularizer (see Section 4) and induce smallvalue Winograddomain weights that can be pruned at small loss.
Compression ratio versus top1 accuracy for compressed AlexNet models on ImageNet classficiation.
Figure 5 shows the compression ratio versus top1 accuracy for compressed AlexNet models on ImageNet classification. The models are pruned, quantized, finetuned, and compressed, as explained above, in the spatial domain at the same pruning ratio but for different quantization cell sizes (the larger the cell size, the bigger the compression ratio). In the Winograd domain, we decompressed them and applied different pruning ratios to evaluate the accuracy at different sparsity levels. Observe that the accuracy degrades as the pruning ratio increases in the Winograd domain.
Finally, we collect the statistics of the number of nonzero elements in each filter, after pruning in the spatial domain and in the Winograd domain, respectively, for our ResNet18 model targeting 80% joint sparsity. In Figure 6, observe that 60% filters are of all zeros, which can be filterpruned, but we still have a considerable amount of sparse filters that have nonzero elements, which come from our regularization method and contribute additional 20% sparsity.
5.2 Partial L2 regularization for sparsity
We compare the performance of our partial L2 regularization method to the conventional L1 and elastic net (Zou and Hastie, 2005) regularization methods for sparsity. Figure 7 shows the results from our experiments, and our partial L2 was better than the others empirically at least in our experiments. It remains as our future work to test more recent relaxation methods of sparsity constraints, such as support norm (Argyriou et al., 2012).
6 Experiments
6.1 ResNet18 for ImageNet classification
We experiment our universal CNN pruning and compression scheme on the ResNet18 model in He et al. (2016) for the ImageNet ILSVRC 2012 dataset Russakovsky et al. (2015). As in Liu et al. (2018), we modify the original ResNet18 model by replacing its convolutional layers of stride with convolutional layers of stride
and maxpooling layers, in order to utilize Winograd convolution for all possible convolutional layers. One difference from
Liu et al. (2018) is that we place maxpooling after convolution (Conv+Maxpool) instead of placing it before convolution (Maxpool+Conv). Our modification provides better accuracy (see Figure 8) although it comes with more computations.Regularization (sparsity )  Inference domain  Pruning ratio  Top1 / Top5 accuracy  # MACs per image 
Pretrained model  SD    68.2 / 88.6  2347.1M 
SD (80%)  SD  80%  67.8 / 88.4  837.9M 
WD (80%)  SD  80%  44.0 / 70.5  819.7M 
WD+SD (80%)  SD  80%  67.8 / 88.5  914.9M 
Pretrained model  WD    68.2 / 88.6  1174.0M 
SD (80%)  WD  80%  56.9 / 80.7  467.0M 
WD (80%)  WD  80%  68.4 / 88.6  461.9M 
WD+SD (80%)  WD  80%  67.8 / 88.5  522.6M 
The Winograddomain regularizer is applied to all convolutional filters, for which Winograd convolution can be used. We assume to use Winograd convolution of for filters (see Section 2.1). The spatialdomain regularizer is applied to all convolutional and fullyconnected layers not only for pruning but also for compression later in the spatial domain. We use the Adam optimizer (Kingma and Ba, 2014) with the learning rate of for iterations with the batch size of . We set in (5). The initial values for and are both set to be , and they are updated using the Adam optimizer with the learning rate of .
We follow the definition of the compression ratio from Han et al. (2016), and it is the ratio of the original model size (without entropy coding or zipping) to the compressed model size (pruned, quantized, and entropycoded or zipped); we used bzip2 (Seward, 1998) as our entropy coding scheme after quantizaion, instead of Huffman coding used in Han et al. (2016). Many of the previous DNN compression papers follow this definition, e.g., see Choi et al. (2017); Ullrich et al. (2017); Agustsson et al. (2017); Louizos et al. (2017); Choi et al. (2018), and thus we used the same one to be consistent with them. We also compared the pruning ratio and the number of MACs in our tables, which are not impacted by coding or zipping. Top accuracy is the percentage of images whose groundtruth label is in the top highest confidence predictions by the network (see Russakovsky et al. (2015, Section 4.1)).
Table 1 summarizes the average pruning ratio, the accuracy and the number of MACs to process one input image of size for pruned ResNet18 models. The number of MACs for Winograd convolution is counted by following Lavin and Gray (2016, Section 5). We compare three models obtained by spatialdomain regularization only (SD), Winograddomain regularization only (WD), and both Winograddomain and spatialdomain regularizations (WD+SD). The accuracy is evaluated using (1) spatialdomain convolution and (2) Winograd convolution,^{1}^{1}1We used https://github.com/IntelLabs/SkimCaffe (Li et al., 2017) for Winograd convolution in accuracy evaluation. for convolutional layers of filters. In case of (2), the filters are transformed into the Winograd domain and pruned to the desired ratio.
As expected, the proposed regularization method produces its desired sparsity only in the regularized domain. If we prune weights in the other domain, then we suffer from considerable accuracy loss. Using both Winograddomain and spatialdomain regularizers, we can produce one model that is sparse and accurate in both domains. We can reduce the number of MACs by and when using sparse convolution in the spatial and the Winograd domains, respectively, at accuracy loss less than 0.5%.
We compare the accuracy of our pruned ResNet18 models to the ones from Liu et al. (2018) in Figure 8. Observe that our models outperform the ones from Liu et al. (2018) in the Winograd domain. We emphasize that the major advantage of our scheme is that it produces one model that can use any of sparse spatialdomain convolution or sparse Winograd convolution. However, the models from Liu et al. (2018)
are constrained to utilize their special WinogradReLU layers even though they can additionally exploit the dynamic sparsity of ReLU activations in the Winograd domain as explained in Section
2.2.In Figure 9, we present the layerbylayer sparsity of our jointly sparse ResNet18 model, obtained by the regularization of WD+SD (80%) in Table 1 of our paper. Observe that the Winograddomain sparsity is provided for convolutional layers of filters only, where Winograd convolution can be used, while the spatialdomain sparsity is given for all layers. We have different pruning ratios by layer since we use one threshold value for pruning all layers (see Section 3.1), in contrast to Liu et al. (2018), where all convolutional layers of filters are pruned at the same ratio in the Winograd domain. Allowing different pruning ratios by layer can be beneficial since some layers can be more important than the others and we may want to prune less in the important layers.
Regularization (sparsity )  Quantization (cell size )  Compression ratio  Inference domain  Top1 / Top5 accuracy  # MACs per image 
Pretrained model    SD  68.2 / 88.6  2347.1M  
WD+SD (80%)  UQ (0.005)  24.2  SD  67.4 / 88.2  888.6M 
UQ (0.010)  28.9  SD  66.9 / 87.9  859.0M  
UQ (0.020)  38.4  SD  63.7 / 86.0  749.6M  
DUQ (0.005)  23.8  SD  67.5 / 88.2  886.5M  
DUQ (0.010)  28.7  SD  66.8 / 87.8  848.1M  
DUQ (0.020)  38.6  SD  60.0 / 83.5  708.8M  
Pretrained model    WD  68.2 / 88.6  1174.0M  
WD+SD (80%)  UQ (0.005)  24.2  WD  67.4 / 88.2  516.4M 
UQ (0.010)  28.9  WD  66.9 / 87.9  516.5M  
UQ (0.020)  38.4  WD  63.7 / 86.0  495.1M  
DUQ (0.005)  23.8  WD  67.4 / 88.3  516.3M  
DUQ (0.010)  28.7  WD  66.6 / 87.7  512.9M  
DUQ (0.020)  38.6  WD  60.0 / 83.5  502.5M 
Table 2 shows our universal CNN quantization and compression results for the ResNet18 model. We take the model obtained by regularization of WD+SD (80%) in Table 1 and compress its weights as described in Section 4. We compare uniform quantization (UQ) and dithered uniform quantization (DUQ). We use bzip2 (Seward, 1998) for universal source coding. The results show that we can achieve more than compression at accuracy loss less than 1% in both cases (1) and (2).
6.2 AlexNet for ImageNet classification
We perform similar pruning and compression experiments for AlexNet (Krizhevsky et al., 2012). The AlexNet model has a huge number of weights in its fullyconnected (FC) layers (58.6M out of total 61M), and thus we first prune roughly 90% spatialdomain weights mostly from its FC layers by the incremental pruning as suggested in Han et al. (2015).
We retrain the pruned AlexNet model, similar to the ResNet18 case above. In particular, we apply the proposed regularizers only to the second to the fifth convolutional layers (Conv2Conv5), where their filter sizes are small such as and . We assume to use Winograd convolution of and for filters and filters, respectively. The first convolutional layer (Conv1) is excluded since its filter size is , which is not small for Winograd convolution.
Regularization (sparsity )  Quantization (cell size )  Compression ratio  Inference domain*  Top1 / Top5 accuracy (%)  # MACs per image 
Pretrained model    SD  56.8 / 80.0  724.4M  
WD+SD (70%)  UQ (0.005)  40.7  SD  56.4 / 79.7  253.7M 
UQ (0.010)  47.5  SD  56.0 / 79.5  237.1M  
UQ (0.020)  62.8  SD  54.3 / 78.0  211.3M  
DUQ (0.005)  40.7  SD  56.4 / 79.7  256.1M  
DUQ (0.010)  47.7  SD  56.1 / 79.3  240.0M  
DUQ (0.020)  65.0  SD  52.8 / 77.1  213.5M  
Han et al. (2016)  35.0  SD  57.2 / 80.3  301.1M  
Guo et al. (2016)  N/A  SD  56.9 / 80.0  254.2M  
Pretrained model    WD  56.8 / 80.0  330.0M  
WD+SD (70%)  UQ (0.005)  40.7  WD  56.4 / 79.7  146.2M 
UQ (0.010)  47.5  WD  56.0 / 79.5  144.2M  
UQ (0.020)  62.8  WD  54.3 / 78.0  134.9M  
DUQ (0.005)  40.7  WD  56.4 / 79.7  145.7M  
DUQ (0.010)  47.7  WD  56.0 / 79.3  142.6M  
DUQ (0.020)  65.0  WD  52.8 / 77.0  132.6M  
Li et al. (2017)  N/A  WD  57.3 / N/A  319.8M  
* Winograd convolution is used for Conv2–Conv5 in WD inference. 
Regularization (sparsity )  Quantization (cell size )  Inference domain*  Sparsity (%)  
Conv1  Conv2  Conv3  Conv4  Conv5  FC1  FC2  FC3  
WD+SD (70%)  UQ (0.005)  SD  15.7  62.9  81.2  75.2  71.9  93.2  92.1  80.2 
UQ (0.010)  SD  17.2  68.3  81.9  76.1  72.7  93.5  92.2  80.4  
UQ (0.020)  SD  25.4  73.8  83.0  77.5  73.9  94.9  93.1  81.7  
DUQ (0.005)  SD  15.8  62.2  80.9  74.8  71.6  93.2  92.1  80.2  
DUQ (0.010)  SD  18.3  66.8  81.7  75.8  72.4  93.7  92.3  80.6  
DUQ (0.020)  SD  26.8  71.9  83.1  77.5  73.9  95.4  93.6  82.6  
Han et al. (2016)  SD  16.0  62.0  65.0  63.0  63.0  91.0  91.0  75.0  
Guo et al. (2016)  SD  46.2  59.4  71.0  67.7  67.5  96.3  93.4  95.4  
WD+SD (70%)  UQ (0.005)  WD  15.7  43.6  72.0  63.7  62.5  93.2  92.1  80.2 
UQ (0.010)  WD  17.2  43.9  72.0  63.7  62.4  93.5  92.2  80.4  
UQ (0.020)  WD  25.4  45.2  72.0  63.6  62.1  94.9  93.1  81.7  
DUQ (0.005)  WD  15.8  47.4  71.7  63.3  62.0  93.2  92.1  80.2  
DUQ (0.010)  WD  18.3  47.4  71.7  63.3  62.0  93.7  92.3  80.6  
DUQ (0.020)  WD  26.8  45.7  71.9  63.5  62.0  95.4  93.6  82.6  
Li et al. (2017)  WD  0.0  90.6  95.8  94.3  93.9  0.0  0.0  0.0  
* Winograd convolution is used for Conv2–Conv5 in WD inference. 
In Table 3 and Table 4, we provide the compression ratio, the sparsity, the accuracy and the number of MACs to process one input image of size for compressed AlexNet models. We compare our results to Han et al. (2016); Guo et al. (2016) in the spatial domain and to Li et al. (2017) in the Winograd domain. We note that Han et al. (2016); Guo et al. (2016); Li et al. (2017) produce sparse models only in one domain. The AlexNet model in Guo et al. (2016) has more pruning in FC layers and less pruning in Conv layers than ours. Hence, the overall pruning ratio is larger in Guo et al. (2016) (since FC layers are dominant in the number of parameters), but ours results in more computational cost reduction (since Conv layers are dominant in the number of MACs). Furthermore, our model can be even sparse in the Winograd domain. Comparing to Li et al. (2017), our method yields less pruning for the Conv2Conv5 layers in the Winograd domain, but we also prune the Conv1 and FC layers heavily in the spatial domain. Observe that we can reduce the number of MACs by and when using sparse spatialdomain convolution and using sparse Winograd convolution, respectively, at accuracy loss less than 1%. The results also show that we can achieve more than compression.
6.3 CTSRCNN for image super resolution
Regularization (sparsity )  Quantization (cell size )  Compression ratio  Inference domain  PSNR (dB)  SSIM  # MACs per image 
Pretrained model    SD  29.70  0.8301  233.2G  
WD+SD (90%)  UQ (0.005)  27.2  SD  29.39  0.8236  21.1G 
UQ (0.010)  30.5  SD  29.38  0.8237  19.7G  
UQ (0.020)  35.4  SD  29.32  0.8225  17.4G  
DUQ (0.005)  27.1  SD  29.38  0.8234  21.1G  
DUQ (0.010)  30.3  SD  29.37  0.8233  19.8G  
DUQ (0.020)  34.8  SD  29.30  0.8222  18.0G  
Pretrained model    WD  29.70  0.8301  56.7G  
WD+SD (90%)  UQ (0.005)  27.2  WD  29.38  0.8235  10.7G 
UQ (0.010)  30.5  WD  29.38  0.8237  10.3G  
UQ (0.020)  35.4  WD  29.32  0.8225  9.9G  
DUQ (0.005)  27.1  WD  29.37  0.8232  10.7G  
DUQ (0.010)  30.3  WD  29.37  0.8233  10.3G  
DUQ (0.020)  34.8  WD  29.31  0.8222  10.0G 
We finally evaluate the proposed scheme for the cascadetrained SRCNN (CTSRCNN) model of 9 convolutional layers (Ren et al., 2018). We apply the Winograddomain regularizer to the and filters of the second to the last layers; the filters of the first layer are excluded. We use Winograd convolution of and for and filters, respectively. The spatialdomain regularizer is applied to all 9 layers.
The average peaksignaltonoiseratio (PSNR) and structuredsimilarity (SSIM) are compared for Set14 dataset (Zeyde et al., 2010) in Table 5
for compressed CTSRCNN models. We also summarize the compression ratio and the number of MACs for super resolution to get one highdefinition image of size
by upscaling factor in Table 5. Observe that we achieve compression at PSNR loss less than dB. The number of MACs is reduced by and when using sparse spatialdomain convolution and using sparse Winograd convolution, respectively.7 Conclusion
We introduced a framework for hardware or software platform independent pruning and compression of CNNs. The proposed scheme produces one compressed model whose convolutional filters can be made sparse either in the Winograd domain or in the spatial domain with minimal loss of accuracy and without further training. Thus, one compressed model can be deployed on any platform and the sparsity of its convolutional filters can be utilized for complexity reduction in either domain, unlike the previous approaches that yield sparse models in one domain only. We showed by experiments that the proposed scheme successfully prunes and compresses ResNet18, AlexNet and 9layer CTSRCNN with compression ratios of , and , while reducing complexity by , and , respectively. Finally, our regularization method for joint sparsity can be extended for sparse frequencydomain convolution, which remains as our future work. It will be also interesting to compare our partial L2 norm to support norm (Argyriou et al., 2012) for sparsity regularization.
Appendix
In this appendix, we show the proof of (7) of our paper.
Proof.
We have (e.g., see (Golub and Van Loan, 2013, Section 1.3.7))
(12) 
where
is the columnvectorization of matrix
and denotes the Kronecker product of two matrices. Thus, it follows that(13) 
For any matrix , column vector and column vector , it is straightforward to show that
where is the diagonal matrix whose diagonal elements are from , and then it follows that
(14) 
Combining (12)–(14) leads us to
(15) 
which results in (7) of our paper. ∎
We note that the gradient is actually not defined at the discontinuous points where any element in is exactly equal to the threshold in magnitude, which however can be ignored in stochastic gradient descent.
References
 LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Sze et al. [2017] Vivienne Sze, YuHsin Chen, TienJu Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.
 Cheng et al. [2018] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1):126–136, 2018.
 Han et al. [2016] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations, 2016.
 Choi et al. [2017] Yoojin Choi, Mostafa ElKhamy, and Jungwon Lee. Towards the limit of network quantization. In International Conference on Learning Representations, 2017.
 Ullrich et al. [2017] Karen Ullrich, Edward Meeds, and Max Welling. Soft weightsharing for neural network compression. In International Conference on Learning Representations, 2017.
 Agustsson et al. [2017] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, and Luc V Gool. Softtohard vector quantization for endtoend learning compressible representations. In Advances in Neural Information Processing Systems, pages 1141–1151, 2017.

Molchanov et al. [2017]
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov.
Variational dropout sparsifies deep neural networks.
In
International Conference on Machine Learning
, pages 2498–2507, 2017.  Louizos et al. [2017] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pages 3290–3300, 2017.
 Choi et al. [2018] Yoojin Choi, Mostafa ElKhamy, and Jungwon Lee. Universal deep neural network compression. arXiv preprint arXiv:1802.02271, 2018.
 Dai et al. [2018] Bin Dai, Chen Zhu, and David Wipf. Compressing neural networks using the variational information bottleneck. In International Conference on Machine Learning, 2018.
 Mathieu et al. [2013] Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851, 2013.
 Vasilache et al. [2014] Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580, 2014.

Lavin and Gray [2016]
Andrew Lavin and Scott Gray.
Fast algorithms for convolutional neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4013–4021, 2016.  Han et al. [2015] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 Lebedev and Lempitsky [2016] Vadim Lebedev and Victor Lempitsky. Fast convnets using groupwise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2554–2564, 2016.
 Wen et al. [2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 Guo et al. [2016] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient DNNs. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
 Lin et al. [2017] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In Advances in Neural Information Processing Systems, pages 2178–2188, 2017.
 Park et al. [2017] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. Faster CNNs with direct sparse convolutions and guided pruning. International Conference on Learning Representations, 2017.
 Li et al. [2017] Sheng Li, Jongsoo Park, and Ping Tak Peter Tang. Enabling sparse Winograd convolution by native pruning. arXiv preprint arXiv:1702.08597, 2017.
 Liu et al. [2018] Xingyu Liu, Jeff Pool, Song Han, and William J Dally. Efficient sparseWinograd convolutional neural networks. In International Conference on Learning Representations, 2018.
 Winograd [1980] Shmuel Winograd. Arithmetic Complexity of Computations, volume 33. SIAM, 1980.
 Blahut [2010] Richard E Blahut. Fast Algorithms for Signal Processing. Cambridge University Press, 2010.
 Selesnick and Burrus [1998] Ivan W Selesnick and C Sidney Burrus. Fast convolution and filtering, 1998.
 Ziv and Lempel [1977] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, 1977.
 Ziv and Lempel [1978] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variablerate coding. IEEE Transactions on Information Theory, 24(5):530–536, 1978.
 Welch [1984] Terry A. Welch. A technique for highperformance data compression. Computer, 6(17):8–19, 1984.
 Effros et al. [2002] Michelle Effros, Karthik Visweswariah, Sanjeev R Kulkarni, and Sergio Verdú. Universal lossless source coding with the Burrows Wheeler transform. IEEE Transactions on Information Theory, 48(5):1061–1081, 2002.
 Zamir and Feder [1992] Ram Zamir and Meir Feder. On universal quantization by randomized uniform/lattice quantizers. IEEE Transactions on Information Theory, 38(2):428–436, 1992.
 Bishop [2006] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 Chen et al. [2001] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129–159, 2001.
 Donoho and Stark [1989] David L Donoho and Philip B Stark. Uncertainty principles and signal recovery. SIAM Journal on Applied Mathematics, 49(3):906–931, 1989.
 Gailly and Adler [2003] JeanLoup Gailly and Mark Adler. gzip, 2003. URL www.gzip.org.
 Seward [1998] Julian Seward. bzip2, 1998. URL www.bzip.org.
 Eldar [2015] Yonina C Eldar. Sampling theory: Beyond bandlimited systems. Cambridge University Press, 2015.
 Winograd [1978] Shmuel Winograd. On computing the discrete fourier transform. Mathematics of computation, 32(141):175–199, 1978.
 Zou and Hastie [2005] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
 Argyriou et al. [2012] Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse prediction with the support norm. In Advances in Neural Information Processing Systems, pages 1457–1465, 2012.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 Ren et al. [2018] Haoyu Ren, Mostafa ElKhamy, and Jungwon Lee. CTSRCNN: Cascade trained and trimmed deep convolutional neural networks for image super resolution. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
 Zeyde et al. [2010] Roman Zeyde, Michael Elad, and Matan Protter. On single image scaleup using sparserepresentations. In International Conference on Curves and Surfaces, pages 711–730. Springer, 2010.
 Golub and Van Loan [2013] Gene H Golub and Charles F Van Loan. Matrix Computations. JHU Press, 2013.
Comments
There are no comments yet.