Deep convolutional neural networks (CNNs) have been ubiquitously utilized in various application domains. However, their performance comes at the cost of a significant amount of computation which keeps growing over time. As an example, for the ImageNet challenge (Russakovsky et al., 2015), Krizhevsky et al. (2012) proposed AlexNet which requires more than multiplications. Later, in 2016, the ResNet-152 model (He et al., 2016) increased the computation cost to multiplications. This high computation cost limits the deployment of larger and deeper CNN models.
There are two primary methods to reduce the required computation of CNN models: pruning techniques and Winograd/FFT convolution. Pruning removes redundant weight parameters, inducing sparsity into the network. On the other hand, Winograd convolution (Lavin & Gray, 2016) and FFT convolution (Mathieu et al., 2013) transform the computation into different domains. The convolution operations can then be replaced by element-wise multiplications. For the typical convolution kernel size of , Winograd convolution can achieve more than twofold speedup over highly optimized spatial convolution algorithms, and typically requires fewer flops than FFT-based approaches (Li et al., 2017). Therefore, in this paper, we focus on the Winograd convolution.
The pruning techniques and Winograd convolution are not directly compatible with each other. Sparse weight matrices, which are generated by pruning, lose most of the sparsity after the Winograd transformation from the spatial (original) domain to the Winograd domain. The remaining sparsity is much lower than what we need for improving computation performance.
To increase the Winograd-domain sparsity, Li et al. (2017) propose to perform pruning and retraining directly on Winograd-domain weights. However, it requires using an extremely small learning rate, e.g., 200x smaller for AlexNet, in retraining and is difficult to be applied to deep networks. Besides, Winograd-ReLU pruning (Liu et al., 2018) moves ReLU function into the Winograd domain, which helps increase Winograd-domain sparsity but requires changes in the network structure.
In this paper, to further improve the sparsity of Winograd-domain weights without changing the network structure, we propose a new pruning method, spatial-Winograd pruning. It includes two parts: spatial structured pruning and Winograd direct pruning. In spatial structured pruning, we prune the spatial-domain weights in a structured way, in which the structures are designed to transfer the spatial-domain sparsity into the Winograd domain efficiently. After spatial structured pruning, weights of the pruned layers will be converted to and kept in the Winograd domain. Then, for Winograd direct pruning, we perform pruning and retraining entirely in the Winograd domain to improve the sparsity further.
This paper makes the following contributions:
We propose a new pruning method, spatial-Winograd pruning. Without changing the network structure, it can achieve higher sparsity in Winograd-domain weights compared with previous methods.
As the first part of spatial-Winograd pruning, we provide a structured pruning method to transfer the spatial-domain sparsity into the Winograd domain efficiently. It can help avoid Winograd-domain retraining in this part and accelerate the pruning process.
In the second part, to perform pruning directly in the Winograd domain, we present a new approach to measuring the importance of each Winograd-domain weight based on its impact on output activations. Also, we propose to use an importance factor matrix to adjust the gradients of Winograd-domain weights, which makes it much faster to retrain deep networks directly in the Winograd domain without changing the network structure.
2 Preliminary and Related Work
Winograd convolution (Lavin & Gray, 2016) is a typical algorithm to reduce the arithmetic complexity of CNNs. It transforms the computation into the Winograd domain, and the convolution operations can then be replaced by element-wise multiplications. We call the domain, in which the conventional convolution operation is executed, to be the spatial domain.
The basic block of Winograd convolution works on a 2D input tile, , with a size of and a 2D weight filter, , with a size of . In this case, the 2D output tile generated, , will have a size of . For a typical convolutional layer, the input feature maps are first disassembled into input tiles and, after the Winograd convolution, the output tiles will be reassembled into the output feature maps.
Figure 0(a) shows how conventional Winograd convolution works. As the first step, the weight filter and the input tile are converted into the Winograd domain using the predefined matrices and . Element-wise multiplication is then applied to the Winograd-domain weight filter, , and input tile, , to generate the Winograd-domain output tile with a size of . In the last step, the output tile is converted back into the spatial domain with another predefined matrix . With as the Hadamard product (element-wise multiplication), the entire process can be written as
The transform/inverse-transform matrices , and are only determined by and . These matrices contain many repeating elements and applying them requires only few multiplications. In this case, considering only the element-wise multiplication between and , the Winograd convolution can reduce the number of multiplications from to .
In addition to Winograd convolution, pruning is also a well-explored method to reduce CNN computation. Han et al. (2015b, a) propose to perform pruning and retraining iteratively, which can help reduce the computation by up to . To fully utilize the sparsity incurred by pruning to accelerate CNN computation, Wen et al. (2016) and Yu et al. (2017) prune networks in a structured way: weights are clustered into groups with hardware-friendly structures and then get pruned in groups.
However, Winograd convolution is not directly compatible with conventional pruning algorithms. The transformation fills in the zeros in the sparse weight filters generated by pruning. There have been several research attempts to solve this problem.
Liu & Turakhia (2016)
propose to directly mask out Winograd-domain weights and use backpropagation to train the spatial-domain weights. However, compared with spatial-domain weights, Winograd-domain weights are in a higher-dimensional space. Directly setting Winograd-domain weights to zero will cause an inconsistency between the spatial domain and the Winograd domain. This inconsistency will lead to a significant accuracy loss or a low sparsity on networks, e.g., AlexNet, for large datasets(Li et al., 2017).
To address the inconsistency between the spatial and Winograd domain, Li et al. (2017) propose the sparse Winograd convolution. Figure. 0(b) shows how it works. Weight values are stored in the Winograd domain instead of the spatial domain. Both pruning and retraining are applied directly to Winograd-domain weights. This native pruning algorithm achieves sparsity on AlexNet (Krizhevsky et al., 2012) but cannot provide a high sparsity for deep networks (Liu et al., 2018). Also, direct retraining in the Winograd domain requires an extremely small learning rate, e.g., 200x smaller for AlexNet, which makes the retraining much slower.
Based on sparse Winograd convolution, Liu et al. (2018) introduce the Winograd-ReLU pruning. It moves the ReLU function from the spatial domain into the Winograd domain. In this case, the computation of Winograd convolution becomes
The Winograd-domain inputs also become sparse, which helps further reduce the required computation. Besides, a higher sparsity in Winograd-domain weights can be achieved. However, with weight filters being sparse, it is challenging to utilize both the weight and input sparsity for CNN acceleration on general-purpose processors due to more irregularity in the access pattern and control flow. Also, Winograd-ReLU pruning cannot be applied to conventional CNN models since the new computation in Equation 2 does not correspond to the original convolution operation. It requires changing the network structure and retraining the network from scratch.
3 Spatial-Winograd Pruning
In this paper, to achieve a high Winograd-domain weight sparsity on deep CNN models without changing network structures, we propose the spatial-Winograd pruning. As shown in Figure 2, it consists of two parts: spatial structured pruning and Winograd direct pruning.
In spatial structured pruning, spatial-domain weights are pruned in a structured way and then retrained to regain the original accuracy. After spatial structured pruning, the weights of the pruned model will be transferred into and kept in the Winograd domain. The Winograd direct pruning then performs pruning and retraining directly onto the weights in the Winograd domain. The pruning and retraining steps in both spatial structured pruning and Winograd direct pruning will be iteratively executed until we achieve the desired sparsity or the produced model loses much accuracy.
3.1 Spatial Structured Pruning
The first part of the spatial-Winograd pruning is spatial structured pruning. Spatial-domain weights which affect the same Winograd-domain weight are clustered into the same group. Less important weight groups are removed, and the pruned network will be retrained to regain the accuracy. This structured pruning method can help transfer more spatial-domain sparsity into the Winograd domain.
Spatial Pruning For each spatial-domain filter , we need to generate a mask to indicate the redundant weights. Assuming has a size of and the Winograd-domain filter is , we have . Each element of the Winograd-domain filter, , is the weighted sum of the spatial-domain weights:
is a 4D tensor containing the weight coefficients of the spatial-domain weights and is only determined byand . Details about the calculation of can be found in Appendix A.1.
For each Winograd-domain weight , we can create a set containing the spatial-domain weights which affect the value of . is defined as
In this case, for each weight group , we use a function to measure its importance. In this paper, we use the maximum norm function as
With a specific threshold , if , then is considered as redundant and all weights included need to be removed. In this case, the corresponding will be fixed to 0 and also removed. The set of redundant weights for entire is the union of all redundant and can be calculated as
Here we define in a structured way based on the relation between spatial-domain weights and Winograd-domain weights. It helps transfer as much spatial-domain sparsity into Winograd-domain sparsity as possible.
The mask matrix can be generated by
After spatial pruning, we can perform the spatial retraining with conventional training algorithms, e.g., stochastic gradient descent (SGD). The removed weights are fixed to 0 by applyingafter each training iteration. is element-wise multiplication. The steps of spatial pruning and spatial retraining will be iteratively performed until the retrained model loses much accuracy. The threshold is gradually increased to incur more sparsity into the network.
In spatial structured pruning, both pruning and retraining steps are performed in the spatial domain. It helps avoid the Winograd-domain retraining to accelerate the pruning process but, at the same time, incurs high Winograd-domain sparsity.
3.2 Winograd Direct Pruning
After spatial structured pruning, as in sparse Winograd convolution, weights of the pruned model will be transferred into and kept in the Winograd domain. In Winograd direct pruning, we measure the importance of each weight based on its impact on output activations, and unimportant weights are removed. The pruned network is then retrained in the Winograd domain, and an importance factor matrix is deployed to adjust the weight gradients.
Winograd Pruning Similar to spatial pruning, in Winograd pruning, we need to generate a mask matrix for each Winograd-domain filter to indicate the redundant weights. With the weight filter in the Winograd domain, the output tile is calculated as
Each output element can be considered as the weighted sum of the products of weights and inputs
where is a 6D tensor containing the weight coefficients of different products () and is only determined by and . Details about the calculation of can be found in Appendix A.2.
By removing one weight , the change on each output is
In Winograd pruning, we need to remove a certain amount of weights while minimizing the change of the output activations . Removing an important weight will lead to a larger change in output activations. Therefore, we propose to measure the importance of each weight by the expected value of . In this case, we have
For simplicity, we can assume input values are independent and identically distributed (i.i.d.), and have expected values of 0. With this assumption, we have
Since the importance of weights are relative numbers, we can assume . In this case,
Based on Equation. 13, we can generate an importance factor matrix , where
Therefore, is only determined by and , and keeps the same for all 2D Winograd-domain filters in a specific layer. Then Equation. 13 can be simplified to
In this case, with a specific threshold , we can generate the mask matrix as
For a specific weight , conventional pruning algorithms (Han et al., 2015b; Guo et al., 2016) use its absolute value as the weight importance, which is equivalent to using . Therefore, in Equation 16, the employed weight importance, , can be considered as using the importance factor matrix to adjust the conventional weight importance .
Winograd Retraining As the same with the spatial retraining, we fix the removed Winograd-domain weights to 0 by applying after each training iteration.
However, using conventional SGD to retrain the Winograd-domain parameters will lead to divergence. This is because, as shown in Equation. 14, different locations of Winograd-domain weights have different importance and, therefore, require different learning speeds. Using an extremely small learning rate can avoid the divergence but makes the retraining much slower.
To address this problem, in Winograd retraining, we propose to adjust the gradients of Winograd-domain weights with the importance factor matrix . Assume to be the loss value. At the training step , after the backward computation, the gradients of , , will be adjusted by
where and are the Hadamard division (element-wise division) and Hadamard power (element-wise power of ) function, respectively. In this paper, based on empirical results, is fixed to 1.5. In this case, with the learning rate of , the SGD update for the Winograd-domain weights at the training step becomes
. PyTorch(Paszke et al., 2017) is used to implement the pruning framework.
We use the Winograd-ReLU pruning (Liu et al., 2018) as the baseline pruning technique. To show the effectiveness of our proposed method, we test the same models as in Winograd-ReLU pruning: VGG-nagadomi (Nagadomi, 2014), ConvPool-CNN-C (Springenberg et al., 2014) and ResNet-18 (He et al., 2016) on the three datasets tested, respectively. Those models are chosen since the majority of the included convolutional layers use kernels.
For kernels, we set the input tile size to 6 instead of 4. A larger input tile size can help achieve higher computation speedup. With our proposed method, we expect that lower input tile sizes can lead to a similar or higher sparsity. This is because, with lower input tile sizes, the spatial-domain weights have less correlation between each other and the spatial structured pruning can achieve a higher sparsity.
4.1 CIFAR-10: VGG-nagadomi
For the CIFAR-10 dataset, we test the VGG-nagadomi model (Nagadomi, 2014). It contains 8 convolutional layers with
kernels. We use batch normalization instead of dropout to regularize the convolutional layers. The original model has a prediction accuracy of 93.96%. We prune the first convolutional layer with a fixed Winograd-domain sparsity of 20%. For the remaining convolutional layers, we incur a uniform Winograd-domain sparsity, increasing from 20% to 80%, for simplicity.
Figure 2(a) shows the pruning results. The baseline result reported in (Liu et al., 2018) is shown as the dashed line. With accuracy loss, it achieves a sparsity of 60%. With spatial-Winograd pruning, we can achieve a Winograd-domain sparsity of 63%. It is similar to Winograd-ReLU pruning, but spatial-Winograd pruning does not require changing the network structure.
4.2 CIFAR-100: ConvPool-CNN-C
For the CIFAR-100 dataset, the ConvPool-CNN-C model (Springenberg et al., 2014) is tested. It contains 9 convolutional layers, in which 7 layers use kernels. The original model has a prediction accuracy of 69.95%. We prune the first convolutional layer with a fixed Winograd-domain sparsity of 20%. Similar to the VGG-nagadomi model, the remaining 6 convolutional layers with kernels are iteratively pruned and retrained with uniform Winograd-domain sparsities.
Figure. 2(b) shows the result of the relative accuracy against the Winograd-domain sparsity. The baseline result reported in (Liu et al., 2018) is shown as the dashed line. With accuracy loss, it achieves a sparsity of 40%. Winograd direct pruning is applied to the model pruned by spatial structured pruning with 30% sparsity. With no accuracy loss, spatial-Winograd pruning can reach a sparsity of 50%, which is 10% higher than Winograd-ReLU pruning.
4.3 ImageNet: ResNet-18
We test the ResNet-18 model on the ImageNet (ILSVRC-2012) dataset. As the same with Winograd-ReLU pruning, we replace eachconvolutional layer with a
-stride max-pooling layer followed by a-stride convolutional layer. This change makes it easier to apply Winograd convolution on most of the convolutional layers.
The original model has a top-1/top-5 prediction accuracy of 69.82%/89.55%. However, for Winograd-ReLU pruning, Liu et al. (2018) use the model with the original top-1/top-5 accuracy of only 66.67%/87.42%. Despite this, we still use the relative accuracies reported in (Liu et al., 2018) as the baseline.
We prune the 16 convolutional layers in the residual blocks with the same Winograd-domain sparsity. The first convolutional layer and the downsample layers are kept intact. Figure. 4 shows the results of the relative accuracy against the Winograd-domain sparsity. As the dashed line show, the Winograd-ReLU pruning achieves a sparsity of 70%/65% with top-1/top-5 accuracy loss.
We apply Winograd direct pruning to the models pruned by spatial structured pruning with 65% and 70% Winograd-domain sparsity, annotated as Winograd direct pruning - 0.65 and 0.70, respectively. As shown in the figure, with top-1/top-5 accuracy loss, applying Winograd direct pruning to the model with 70% Winograd-domain sparsity can achieve a higher sparsity of 74%/72%. This is because, with the sparsity increasing, Winograd direct pruning makes the prediction accuracy drop much faster than spatial structured pruning. Although we can use the importance factor matrix to adjust the weight gradients to accelerate the Winograd-domain retraining, the learning rate still needs to be much lower than in the spatial retraining. In this case, the accuracy loss recovered through Winograd retraining is limited, which makes the accuracy drop much faster when applying Winograd direct pruning.
4.4 Effectiveness of Importance Factor Matrix
In Winograd direct pruning, we use the importance factor matrix to adjust the weight importance and gradients for different locations of Winograd-domain weights. Here we test the effectiveness of employing the importance factor matrix in both the Winograd pruning and retraining.
We first test how the importance factor matrix helps in Winograd pruning. Winograd pruning without retraining is applied to the model pruned by spatial structured pruning with 70% sparsity. Figure 4(a) shows the relative accuracy against the sparsity when pruning with weight importance unadjusted or adjusted with . As shown in the figure, adjusting the weight importance with the importance factor matrix can dramatically reduce the accuracy loss when performing Winograd pruning. When pruning the model to 76% sparsity, using the absolute value as the weight importance will cause a 22% accuracy loss. In comparison, using the importance factor matrix to adjust the weight importance can help reduce the accuracy loss to 10%.
We also test the effectiveness of the importance factor matrix in Winograd retraining. For the model pruned with spatial structured pruning (70% sparsity), Winograd pruning is applied to increase the sparsity to 74%. We then perform Winograd retraining for 10 epochs. Figure4(b) shows the relative accuracy against the retraining epochs with unadjusted and adjusted gradients. For unadjusted gradients, we try three learning rates of 1e-7, 1e-8 and 1e-9. Higher learning rates, e.g., 1e-6, will lead to an accuracy drop through retraining. As shown in the figure, adjusting the gradients with the importance factor matrix can substantially accelerate the convergence. With retraining of only 10 epochs, it reduces the accuracy loss to 0.2% while retraining without gradient adjustment only reduces the accuracy loss to 0.7%.
In this paper, we present a new pruning method, spatial-Winograd pruning, to improve the Winograd-domain weight sparsity without changing network structures. It includes two steps: spatial structured pruning and Winograd direct pruning. In spatial structured pruning, we prune the spatial-domain weights based on the internal structure in the Winograd transformation. It can help efficiently transfer the spatial-domain sparsity into the Winograd domain. For Winograd direct pruning, we perform both pruning and retraining in the Winograd domain. An importance factor matrix is proposed to adjust the weight gradients in Winograd retraining, which makes it possible to effectively retrain the Winograd-domain network to regain the original accuracy without changing the network structure. We evaluate spatial-Winograd pruning on three datasets, CIFAR-10, CIFAR-100, ImageNet, and it can achieve the Winograd-domain sparsities of 63%, 50%, and 74%, respectively.
- Guo et al. (2016) Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387, 2016.
- Han et al. (2015a) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
- Han et al. (2015b) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015b.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- Lavin & Gray (2016) Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021, 2016.
- Li et al. (2017) Sheng Li, Jongsoo Park, and Ping Tak Peter Tang. Enabling sparse winograd convolution by native pruning. arXiv preprint arXiv:1702.08597, 2017.
- Liu & Turakhia (2016) Xingyu Liu and Yatish Turakhia. Pruning of winograd and fft based convolution algorithm. 2016.
- Liu et al. (2018) Xingyu Liu, Jeff Pool, Song Han, and William J Dally. Efficient sparse-winograd convolutional neural networks. arXiv preprint arXiv:1802.06367, 2018.
- Mathieu et al. (2013) Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013.
- Nagadomi (2014) Nagadomi. Code for kaggle-cifar10 competition. 5th place. https://github.com/nagadomi/kaggle-cifar10-torch7, 2014.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Springenberg et al. (2014) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
- Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
- Yu et al. (2017) Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In ACM SIGARCH Computer Architecture News, volume 45, pp. 548–560. ACM, 2017.
Appendix A Coefficient Tensors S and H
a.1 Coefficient Tensor S
To calculate the tensor
, we first introduce an equivalent transformation: for two vectorsand with a size of , and a matrix with a size of , we have
where is a vector of size and all its entries are 1.
For matrix , with the weight transform matrix , we have . Each element in Winograd-domain weight filter is calculated as
where . In this case, compared with Equation 3, each element in the coefficient tensor can be calculated as
a.2 Coefficient Tensor H
With the Winograd-domain weight filter , the output tile is calculated as
Each element is calculated as
where . Based on Equation 19, we have
Let , then
where . The element can be calculated as
Therefore, compared with Equation 9, each element in the coefficient tensor is calculated as
Appendix B ResNet-18 Pruning with Varied Sparsities across Layers
In addition to pruning ResNet-18 with the same sparsity across all targeting layers, we experiment incurring different sparsities into different layers with spatial-Winograd pruning.
For spatial structured pruning, we first test the pruning sensitivity of each convolutional layer to decide which layers need to be pruned and the corresponding thresholds. To choose the targeting layers, we measure the accuracy loss when 60% of Winograd-domain weights are pruned for each layer. Only one layer is pruned at one time, and other layers are kept intact. Figure. 6 shows the results. In ResNet-18, the -th residual block contains two convolutional layers, -a and -b. As shown in the figure, the first layer in each residual block is much more sensitive to pruning than the second layer. Therefore, we will only prune the second convolutional layer, -b, in each residual block.
|0-b||78.0 %||88.8 %||85.0 %|
|1-b||77.1 %||87.9 %||84.0 %|
|2-b||76.4 %||88.5 %||84.4 %|
|3-b||85.2 %||93.1 %||90.4 %|
|4-b||76.9 %||88.9 %||84.7 %|
|5-b||88.6 %||95.1 %||93.0 %|
|6-b||74.4 %||88.3 %||82.4 %|
|7-b||82.6 %||87.8 %||92.3 %|
|Average||79.4 %||88.9 %||87.6 %|
|69.92 %||69.94 %|
|89.34 %||89.51 %|
For each targeting layer -b, we determine the corresponding pruning threshold based on its pruning sensitivity. We gradually increase the threshold until the validation accuracy drops by 2% and the threshold is recorded as . Then in spatial structured pruning, we can calculate the threshold used for layer -b as where is a multiplier shared across all targeting layers. With a larger , the threshold and, therefore, the sparsity will be higher. Also, in Winograd direct pruning, we use the same strategy to choose the thresholds used for different layers.
Table 1 lists the pruning results. After spatial structured pruning, we can reach an average Winograd-domain sparsity of 79.4% for the pruned layers. The corresponding spatial-domain sparsity is 88.9% which is 9.5% higher. Winograd direct pruning can further improve the Winograd-domain sparsity to 87.6% and layer -b has the highest sparsity of 93.0%.
Appendix C Sparsity Distribution
For the pruned ResNet-18 model, we analyze more detailed sparsity distribution across and inside 2D weight filters. Here each Winograd-domain weight matrix is considered as a 2D filter. We use the last convolutional layer (7-b) as an example. The model with a uniform sparsity of 74% across all pruned layers, which corresponds to point P in Figure 3(a), is tested. Figure 8 shows the sparsity distribution across the filters. As shown in the figure, more than half (62%) of the filters have all weights removed. An interesting observation is that a large portion of the filters have exact 20 weights removed.
To explain why many filters have exact 20 weights removed, we visualize the sparsity distribution inside the filters. Figure 8a shows the sparsity of different locations for the filters with 20 weights removed. Darker locations have higher sparsities where more weights are removed. The border part of the filter, which includes 20 weights, has much higher sparsity than the central part. It means the border part of the Winograd-domain weights is much less important than weights in the central part. A potential reason is that the weights in the central part are correlated to more spatial-domain weights and, therefore, removing them will lead to a larger difference in the output activations. In Figure 8b, we also visualize the sparsity distribution inside the filters with at least one weight remaining, and it shows a similar pattern.