FSNet: Compression of Deep Convolutional Neural Networks by Filter Summary

02/08/2019 ∙ by Yingzhen Yang, et al. ∙ Microsoft Baidu, Inc. 0

We present a novel method of compression of deep Convolutional Neural Networks (CNNs). The proposed method reduces the number of parameters of each convolutional layer by learning a 3D tensor termed Filter Summary (FS). The convolutional filters are extracted from FS as overlapping 3D blocks, and nearby filters in FS share weights in their overlapping regions in a natural way. The resultant neural network based on such weight sharing scheme, termed Filter Summary CNNs or FSNet, has a FS in each convolution layer instead of a set of independent filters in the conventional convolution layer. FSNet has the same architecture as that of the baseline CNN to be compressed, and each convolution layer of FSNet generates the same number of filters from FS as that of the basline CNN in the forward process. Without hurting the inference speed, the parameter space of FSNet is much smaller than that of the baseline CNN. In addition, FSNet is compatible with weight quantization, leading to even higher compression ratio when combined with weight quantization. Experiments demonstrate the effectiveness of FSNet in compression of CNNs for computer vision tasks including image classification and object detection. For classification task, FSNet of 0.22M effective parameters has prediction accuracy of 93.91 using ResNet-18 of 11.18M parameters as baseline. Furthermore, FSNet version of ResNet-50 with 2.75M effective parameters achieves the top-1 and top-5 accuracy of 63.80 task, FSNet is used to compress the Single Shot MultiBox Detector (SSD300) of 26.32M parameters. FSNet of 0.45M effective parameters achieves mAP of 67.63 on the VOC2007 test data with weight quantization, and FSNet of 0.68M parameters achieves mAP of 70.00 data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Convolutional Neural Networks (CNNs) have achieved stunning success in various machine learning and pattern recognition tasks by learning highly semantic and discriminative representation of data. Albeit the power of CNNs, they are usually over-parameterized and of large parameter space, which makes it difficult for deployment of CNNs on mobile platforms or other platforms with limited storage. In the recently emerging architecture such as Residual Network

(He et al., 2016) and Densely Connected Network (Huang et al., 2017)

, most parameters concentrate on convolution filters, which are used to learn deformation invariant features in the input volume. The deep learning community has developed several compression methods of reducing the parameter space of filters, such as filter pruning

(Luo et al., 2017), weight sharing and quantization (Han et al., 2016) and low-rank and sparse representation of the filters (Ioannou et al., 2016; Yu et al., 2017).

Weight sharing has been proved to be an effective way of reducing the parameter space of CNNs. The success of deep compression (Han et al., 2016) and filter pruning (Luo et al., 2017) suggest that there is considerable redundancy in the parameter space of filters of CNNs. Based on this observation, our goal of compression can be achieved by encouraging filters to share weights.

In this paper, we propose a novel representation of filters, termed Filter Summary (FS), which enforces weight sharing across filters so as to achieve model compression. FS is a D tensor from which filters are extracted as overlapping D blocks. Because of weight sharing across nearby filters that overlap each other, the parameter space of convolution layer with FS is much smaller than its counterpart in conventional CNNs. In contrast, the model compression literature broadly adopts a two-step approach: learning a large CNN first, then compressing the model by various model compression techniques such as pruning, quantization and coding (Han et al., 2016; Luo et al., 2017), or low-rank and sparse representation of filters (Ioannou et al., 2016; Yu et al., 2017). The weight sampling network (Jin et al., 2018) studies overlapping filters for compression of D CNNs, and our work is a generalized one accounting for weight sharing through overlapping in regular D CNNs. The idea of FS resembles that of epitome (Jojic et al., 2003)

, which is developed for learning a condensed version of Gaussian Mixture Models (GMMs). In epitome, the Gaussian means are represented by a two dimensional matrix wherein each window in this matrix contains parameters of the Gaussian means for a Gaussian component. The idea of using overlapping structure in generative model is adopted for representing filters of CNNs in our work.

CNNs where each convolution layer has a FS representing its filters are named FSNet. FSNet has a compact architecture compared to its conventional CNNs counterpart. Instead of a two-step process of compression, FSNet is trained from scratch without the need of training a large model beforehand, which could be space and energy consuming. In the following text, we use to indicate integers between and inclusively, and subscript indicates index of a element of a tensor.

2 Formulation

We propose Filter Summary Convolutional Neural Networks (FSNet) in this section. Each convolution layer of FSNet has a D tensor named Filter Summary (FS), where filters are D blocks residing in the FS in a weight sharing manner. FSNet and its baseline CNN have the same architecture except that each convolution layer of FSNet has a compact representation of filters, namely a FS, rather than a set of independent filters in the baseline. FS is designed to generate the same number of filters as that of the filters in the corresponding convolution layer of the baseline CNN. Figure 1 shows an example of FS and how filters are extracted from the FS. A more concrete example is given here to describe the compact architecture of FSNet. Suppose that a convolution layer of the baseline CNN model has filters of channel size and spatial size , the corresponding convolution layer in the FSNet has a FS of size . The filters of size

are extracted by striding along each spatial dimension by

, and striding along the channel dimension by . The ratio of the parameter size of the filters to that of the corresponding FS is , indicating that the parameter space of the FS is times smaller than that of the independent filters in the baseline CNN.

Formally, let a FS generate filters of size where is the spatial size and is the channel size. Let the sampling strides along the two spatial dimensions and the channel dimension of FS are respectively. Then the dimension of the FS is , where is the spatial size and is the channel size. In this paper we set the channel size of FS to that of the filter, i.e. . This is based on our observation that weight sharing along channel dimension tends not to hurt the prediction performance of FSNet. Therefore, the ratio of the parameter size of independent filters to that of the corresponding FS is as follows:


In a typical setting where and the spatial stride is smaller than the corresponding filter size, i.e. , , FS has a more compact size than that of the individual filters. Note that a large , namely the sampling number along the channel dimension, contributes to better compression ratio, and this is the case shown in our experimental results. Compared to (Yang et al., 2018), FSNet also compresses convolution layer using FS and we offer much more extensive experimental results in this paper. In the case of convolution, we set and indicating that all the filters are extracted along the channel dimension of FS. In addition, the channel size of FS can be larger than that of the filter so that a proper compression ratio for convolution layer is obtained.

Algorithm 1 describes the forward and backward process in a convolution layer of FSNet. We use the mapping which maps the indices of the elements of the extracted filters to the indices of the corresponding elements in the FS. Namely, for a filter and the corresponding FS , . The mapping is used to conveniently track the origin of the elements of the filters extracted from FS. It should be emphasized that the inference speed of FSNet is the same as that of conventional CNNs, since filters are accessed by correctly indexing into FS and convolution is performed in a normal way. The model compression is achieved at the cost of the backward operation when training FSNet, wherein the gradients of the elements of filters that share the same weight are averaged to obtain the gradient of the corresponding weight of FS. More details are referred to Algorithm 1. Figure 2 illustrates typical architecture of FSNet where each convolution has a FS.

1:  Forward: Perform regular convolution with the input and filters, denoted by , in the FS . Each filter with indices , where for , is the block with indices in the FS. is the number of filters in the corresponding convolution layer of the baseline CNN.
2:  Backward: Obtain the gradients of all the filters by ordinary back-propagation as , . The gradient of each element is computed by
Algorithm 1 Forward and Backward Operation in a convolution layer of FSNet with FS
Figure 1: Illustration of a Filter Summary (FS). The three filters in green, red and yellow are three overlapping filters extracted from the FS. The -th filter of size marked in red is a block in the FS located at , where are sampling strides in the two spatial dimensions and the channel dimension respectively. This filter is the block with indices in the FS, which is illustrated by the dashed red block. A copy of the original FS is appended along the channel dimension, illustrated in dashed line, so as to ensure that all filters have valid elements.
Figure 2: Illustration of FSNet

3 Experimental Results

We conduct experiments with CNNs for image classification and object detection tasks in this section, demonstrating the compression results of FSNet.

We demonstrate the performance of FSNet in this subsection by comparative results between FSNet and its baseline CNN for classification task on the CIFAR- dataset (Krizhevsky, 2009). Using ResNet (He et al., 2016) or DenseNet (Huang et al., 2017) as baseline CNNs, we design FSNet by replacing all the convolution layers of ResNet or DenseNet by convolution layers with FS. We train FSNet and its baseline CNN, and show the test accuracy and the parameter number of all the models in Table 2. The convolution layers of ResNet and DenseNet have filters of spatial size of either or . We design the size of FS according to the number of filters in the corresponding convolution layer of the baseline CNN. Throughout this section, we set the spatial strides for convolution layers. According to the formula of compression ratio (1), a large indicates a large compression ratio, while also risks more prediction performance loss. In this experiment, we set according to Table 1. We do not specifically tune

, and one can choose other settings of these hyperparameters as long as their product matches the number of filters in the baseline CNN. All the

convolution layers are compressed by times.

It can be observed in Table 2 that FSNet with a compact parameter space achieves accuracy comparable with that of different baselines including ResNet-, ResNet-, ResNet-, ResNet- and DenseNet-. DenseNet- denotes DenseNet with a growth rate of and layers. The baseline CNNs are trained with the initial learning rate of , and it is divided by when half and

of the total epoches are finished. We use the idea of cyclical learning rates

(Smith, 2015) for training FSNet. cycles are used for training FSNet, and each cycle uses the same schedule of learning rate as that of the baseline. A new cycle starts with the initial learning rate of after the previous cycle ends. The training loss and training error for the first cycle of FSNet is shown in Figure 3, and the test loss and test error are shown in Figure 3. We can see that the patterns of training and test of FSNet are similar to that of its baseline, ResNet-.

Furthermore, FSNet with weight quantization, or FSNet-WQ, boosts the compression ratio without sacrificing performance. One-time weight quantization is performed for each convolution layer of trained FSNet. levels are evenly set between the maximum and minimum values of all the element of the FS in a convolution layer, and then each element of the FS is set to its nearest level. In this way, a quantized FS uses a byte to store each of its element, together with the original maximum and minimum values. The number of effective parameters of FSNet-WQ is computed by considering an element of a quantized FS as parameter since the storage required for a byte is of that for a floating number.

3.1 FSNet for Classification

Figure 3: Training loss and training error of FSNet on the CIFAR- dataset for ResNet-
Figure 4: Test loss and test error of FSNet on the CIFAR- dataset for ResNet-

FSNet-WQ achieves more than compression ratio for all the four types of ResNet, and it has less than accuracy drop for ResNet and DenseNet. It is interesting to observe that FSNet-WQ even enjoys slight better accuracy than FSNet for ResNet- and ResNet-. In addition, FSNet-WQ achieves exactly the same accuracy as the baseline with compression ratio for DenseNet-. We argue that weight quantization imposes regularization on the filter summary which may improve its prediction performance.

In order to evaluate FSNet on large-scale dataset, Table 3 shows its performance on ILSVRC- dataset (Russakovsky et al., 2015) using ResNet- as baseline. The accuracy is reported on the standard k validation set. We train ResNet- and the corresponding FSNet for epoches. The initial learning rate is , and it is divided by at epoch respectively.

#Filters 12 16 32 64 128 256 512
2 4 2 4 4 4 8
Table 1: Sampling number along the channel dimension for FSNet
Model Before Compression FSNet Compression Ratio FSNet-WQ Compression Ratio
# Param Accuracy # Param Accuracy # Param Accuracy
ResNet ResNet- M M M
ResNet- M M M
ResNet- M M M
ResNet- M M M
DenseNet DenseNet- M M M
Table 2: Performance of FSNet on the CIFAR- dataset
ModelPerformance # Params Top- Top-
ResNet- M
Table 3:

Performance of FSNet on ImageNet

3.2 FSNet for Object Detection

We evaluate the performance of FSNet for object detection in this subsection. The baseline neural network is the Single Shot MultiBox Detector (SSD) (Liu et al., 2016)

. The baseline is adjusted by adding batch normalization

(Ioffe & Szegedy, 2015) layers so that it can be trained from scratch. Both SSD and FSNet are trained on the VOC training datasets, and the mean average precision (mAP) is reported on the VOC test dataset shown in Table 4. We employ two versions of FSNet with different compression ratios by adjusting the sampling number along the channel dimension, denoted by FSNet- and FSNet- respectively. Again, weight quantization either slightly improves mAP (for FSNet-), or only slightly hurts it (for FSNet-). Compared to Tiny SSD (Wong et al., 2018), FSNet--WQ enjoys smaller parameter space while its mAP is much better. Note that while the reported number of parameters of Tiny SSD is M, its number of effective parameters is only half of this number. i.e. M, as the parameters are stored in half precision floating-point. In addition, the model size of FSNet--WQ is MB, around smaller than that of Tiny SSD, MB.

Model # Params mAP
Tiny SSD (Wong et al., 2018) M
FSNet-1 M
FSNet-1-WQ M
FSNet-2 M
FSNet-2-WQ M
Table 4: Performance of FSNet on object detection

4 Conclusion

We present a novel method for compression of CNNs through learning weight sharing by Filter Summary (FS). Each convolution layer of the proposed FSNet learns a FS from which the convolution filters are extracted from, and nearby filters share weights naturally. By virtue of the weight sharing scheme, FSNet enjoys much smaller parameter space than its baseline while maintaining competitive predication performance. The compression ratio is further improved by one-time weight quantization. Experimental results demonstrate the effectiveness of FSNet in tasks of image classification and object detection.


We would like to express our gratitude to Boyang Li and Pradyumna Tambwekar for their efforts in training the baseline SSD on the VOC dataset.


  • Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016. doi: 10.1109/CVPR.2016.90.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, pp. 2261–2269, 2017.
  • Ioannou et al. (2016) Yani Ioannou, Duncan P. Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training cnns with low-rank filters for efficient image classification. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456, 2015.
  • Jin et al. (2018) Xiaojie Jin, Yingzhen Yang, Ning Xu, Jianchao Yang, Nebojsa Jojic, Jiashi Feng, and Shuicheng Yan. Wsnet: Compact and efficient networks through weight sampling. In International Conference on Machine Learning, ICML, Stockholmsmässan, Stockholm, Sweden, pp. 2357–2366, 2018.
  • Jojic et al. (2003) Nebojsa Jojic, Brendan J. Frey, and Anitha Kannan. Epitomic analysis of appearance and shape. In 9th IEEE International Conference on Computer Vision ICCV, Nice, France, pp. 34–43, 2003.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Liu et al. (2016) Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. In European Conference on Computer Vision, ECCV, Amsterdam, The Netherlands, pp. 21–37, 2016.
  • Luo et al. (2017) Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision, ICCV, Venice, Italy, 2017.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • Smith (2015) Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. arXiv e-prints, art. arXiv:1506.01186, June 2015.
  • Wong et al. (2018) Alexander Wong, Mohammad Javad Shafiee, Francis Li, and Brendan Chwyl. Tiny SSD: A Tiny Single-shot Detection Deep Convolutional Neural Network for Real-time Embedded Object Detection. arXiv e-prints, art. arXiv:1802.06488, February 2018.
  • Yang et al. (2018) Yingzhen Yang, Jianchao Yang, Ning Xu, and Wei Han. Learning $3$D-FilterMap for Deep Convolutional Neural Networks. arXiv e-prints, art. arXiv:1801.01609, January 2018.
  • Yu et al. (2017) Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 2017.