1 Introduction
Despite the great performance improvement brought to many fields, deep convolutional neural networks (CNNs) typically suffer large model complexity [4]. Especially, since deep residual network [5] was proposed, the model complexity of CNNs has been increasing rapidly. The resulted high memory and computation costs hinder the deployment of DNNs on lowend devices, such as mobile phones, with limited memory and computation resources. Several compact DNN models requiring less resources have been proposed recently [6, 7, 8]. Among these models, the weight sampling based model WSNet [1] reduces the memory and computation cost significantly on 1D CNNs for speech processing, and shows great potential on compressing 2D and 3D CNNs for image and video processing.
WSNet [1] compresses CNNs by allowing convolution weights to be shared between two adjacent 1D convolution kernels through sampling the weights from a compact set. However, WSNet is limited to audio processing for three main reasons. First, WSNet deploys handcrafted hyperparameters for the parameter sharing. This works well on the ESC50 sound dataset [9] since the sound signal is redundant in time domain. However, such redundancy pattern is much more complicated for images and videos, thus it is impractical to handcraft the weights sharing and sampling procedures. Secondly, WSNet compresses the 1D convolution along the spatial dimensions reusing the parameters from a compact weights set. This only works well on convolution layers with large kernel size. However, recent compact CNNs such as MobileNet [6], MobileNetV2 [7] and ShuffleNetV2 [8] deploy small and convolution kernels. Thirdly, WSNet repeats sampling along the channel dimension to reuse the convolution parameters, which does not well apply to more complex images and videos.
In this paper, we aim to alleviate the above limitations of WSNet and make the sampling procedure fully learnable and extensible to 2D or even higher dimensional data. Specifically, we design a small neural network to learn the sampling rules instead of handcrafting them. This reduces the hyper parameter tuning complexity and ensures that the hyperparameters are optimized together with the neural network towards better performance. Besides, we extend the weight sampling to the channel dimensions to remove the constraints on the spatial kernel size and increase the representation capability of the sampled filters along the channel dimension. To tackle the issue that the produced sampling position within the compact set may be fractional, we develop a principled method to transform the fractional outputs to the bilinear interpolation between two adjacent sampling positions in the compact set. This transformation ensures that the sampling process is differentiable.
The proposed autosampling method can be encapsulated into a single convolution layer as a dropin replacement. We apply it to compressing a variety of CNN models. To demonstrate its efficacy, we first conduct experiments on ESC50 dataset [9] to verify the improvements over WSNet on 1D convolution. With the same compression ratio, our method outperforms WSNet by . We then experiment on CIFAR10 [10] and ImageNet [11] to demonstrate the improvements on 2D convolutions. On ImageNet, our method outperforms MobileNetV2 full model by in classification accuracy with FLOPs reduction, and MobileNetV20.35 baseline by . With the same backbone architecture with baseline models, our method even outperforms some neural architecture search (NAS) based methods such as AMC [2] and MNasNet [3].
2 Related work
Recent popular model compression methods are based on network pruning [4, 12, 13], weights quantization [14, 15, 16], knowledge distillation [17, 18] and compact network design [1, 6, 7, 8, 19, 20]
. Network pruning can achieve significant compression results with a properly designed neuron importance function. An Auto ML method
[2]is proposed, using reinforcement learning (RL) method to prune the network automatically. However, the extra reinforcement network needs to be designed separately and training it consumes much computation. In contrast, our method does not use any extra networks and learns to reuse the parameters endtoend. The quantization methods can compress the network aggressively but typically suffer large perform drop
[21, 16]. Knowledge distillation transfers knowledge of a teacher network to a more compact student network. However, a teacher network may be hard to get for new tasks.Recently, Slimmable network [22]
uses the width multiplier to adjust the layer width. Several batch normalization layers are generated for each width multiplier which is denoted as switches. By using the separate batch normalization layers, the performance at each switch can be further improved. However, both the width multiplier and the switch method suffer significant performance drop, especially when the compression ratio is larger. WSNet
[1] uses a parameters sharing scheme by allowing overlapping between two adjacent filters. However, WSNet is mainly verified on 1D convolution layers and the compression ratio is limited by the spatial size of the convolution kernel. These limitations are alleviated in our method by extending the sampling along channel dimensions with a learned sampling strategy.In this paper, we propose a novel sampling method such that the model size and computation are reduced by a specified factor with significant performance improvement compared with the width multiplier and Slimmable network method. Different from WSNet, our new model can learn the sampling rules from a compact weight matrix endtoend.
3 Method
Our proposed autosampling method compresses CNNs through learning to sample filter parameters from a compact set. First, based on the memory and FLOPs constraints, a layerwise compact weight matrix is defined with a much smaller size than the layer size of the vanilla model before compression. Then, sampling rules are learned to sample the filter weights from the compact matrix to constitute the actual convolution kernels. We use a small neural network to optimize the sampling rules on the initial positions and strides along with the model training. This is significantly different from WSNet [1] which uses handcrafted and fixed sampling rules. Based on this method, the originally separate convolution filters are sampled from the compact weight matrix with shared weights and thus the model size can be reduced. To further lower the computation cost, we propose to precompute a product map from the compact weight matrix which avoids redundant convolution computation with only cost. We now proceed to explain details of our proposed method, starting with how to design the compact weight matrix.
3.1 Compact weight matrix design
The size of the compact weight matrix is determined by the original model and the desired compression ratio . For a CNN with dimensional convolution layers and fully connected layers, its number of parameters can be calculated as where
denotes the length of the convolution weight tensor along the
dimension of the convolution layer. and denote input and output dimension of the fully connected layer.We assign a compact weight matrix for each layer , and a sampling rule , i.e., the weight value of an d filter at location being equal to . The size of the total compact weight matrix is thus . After learning the mapping rule for the layer , we store the mapping as a lookup table of size . The size of all the mapping tables is . Hence, the compression ratio can be calculated as
(1) 
We assume a uniform compression ratio for all the layers. Thus, given a compression ratio , the size of the compact weight matrix for each layer can be calculated correspondingly.
3.2 Convolution layer construction via autosampling
With the above defined compact weight matrix , our method introduces a novel sampling procedure where the convolution filter weights are sampled from with endtoend learned sampling rules . To simplify the illustration, we particularly consider the 2D convolution case. Our method can be extended to dimension convolution and fully connected layers straightforwardly, with details provided in supplementary material.
For a 2D convolution layer with an input feature map
and zero padding, it produces an output feature map
where denote the spatial width, height, input channels and output channels respectively. Let denote a patch of the input feature map, which spatially spans from to , to . denotes the th filter of the convolution kernel , where and and are the spatial size of the convolution kernel. The convolution output can be computed as(2) 
With our proposed method, all the weight parameters are sampled from the compact weight matrix , by selecting a subset of weights tensor from it, as illustrated in Figure 1. The sampling function maps the convolutional kernel parameter to the value . Thus, the above convolution is replaced with as follows,
(3) 
Instead of designing the sampling rules manually as in WSNet [1], we use a small neural network to optimize the mapping in order to maximize the model performance. More details can be found in Section 4 in supplementary material.
However, as the output of a neural network is fractional, it cannot be used as the position index directly. We solve this problem by multiplying the fractional portion of the output with values at two nearest integer positions and then sum them together as the final mapped value:
(4) 
where and . and enumerate over all integral spatial locations within . As most of values returned by are zero, Eqn. (4) is fast to compute. This makes the mapping rule differentiable and endtoend optimizable. In the following analysis, we allow position index to be fractional. Although not explicitly written, we implement Eqn. (4) to handle it.
Computation cost reduction
Since the weight parameters per convolution layer are sampled from the same , there is computation redundancy when the two sampled convolution kernels overlap in . To reuse the multiplication during the convolution, we first compute the multiplication between each parameter in the compact weights matrix and the input feature map. The results are saved as a product map such that the computation with the same parameters in the compact weight matrix is done only once and the following calculation can reuse the results directly. However, with the compact weight matrix , the product map size will be which consumes large computation memory. To reduce its size, we increase the dimension of the compact weight matrix to in order to group the computation results in the product map. Each entry in the product map is calculated as the dot product of the channel dimension along the input feature map and the third dimension along the compact weight matrix. We set to be smaller than to further boost the compression. As illustrated in Figure 2, the feature map input channel is first grouped via
(5) 
where is the compression ratio along the input channel dimension and () is the learned position with , . The transformed input feature map has the same number of channels as . Let index the transformed feature map location. The product map is then calculated as
(6) 
The multiplication can be reused by replacing the convolution kernel with the product map. Based on Eqns. (3), (5), (6), the convolution can be calculated as
(7) 
To reuse the adds operation, we extend the integral image, as detailed in Figure 2 of the original WSNet [1] paper, from 1D to 2D convolution based on :
(8) 
From Eqn. (8), 2D convolution results can be retrieved in a similar way as in WSNet but extended to 2D convolutions as follows,
(9) 
As we set to be smaller than the output channel dimension of the convolution kernel, we reuse the computation results from the compact weight matrix via
(10) 
where , is the sampling times along the output channel dimension and is the learned mapping. The sampling length is a hyperparameter. Thus, the computation FLOPs of a transformed 2D convolution can be calculated as
(11) 
Suppose we use sliding window with stride of 1 and no bias term, the FLOPs of the conventional convolution can be calculated based on Eqn. (2):
(12) 
Therefore, the total FLOPs reduction ratio is
(13) 
The compression ratio for the 2D convolution layers can be calculated via Eqn. (1):
(14) 
where and denote the width and height of the kernel of the layer. and denote the spatial size of the compact weight matrix for the layer. From Eqn. (13) and Eqn. (14), it can be seen that the compression ratio is mainly decided by .
3.3 Compact weight matrix learning
The compact weight matrix is updated together with the convolution layers through back propagation. Since the parameters in the compact weight matrix can be sampled multiple times for the convolution computation, the gradient of the compact weight matrix is the summation of all the positions where the weights parameter is used in the convolution. Here we use the notion of {} to index the set of all convolution kernel weights that are mapped to the position of in . The gradient of updating can thus be calculated as
(15) 
where
is the cross entropy loss function used for training the neural network and
denotes the kernel elements that are sampled from position in .4 Experiments
We first evaluate the efficacy of our method on the sound dataset ESC50 [9] comparing it with WSNet for 1D convolutional model compression. We then test our method with MobileNetV2 as the backbone on 2D convolutions on ImageNet dataset [11] and CIFAR10 dataset [10]. Detailed experiments settings can be found in the supplementary material. We evaluate our methods in terms of three criteria: model size, computation cost and classification performance.
4.1 1D CNN compression
We use the same 8layer CNN model as used in WSNet [1] for a fair comparison. The compression ratio in WSNet is decided by the stride () and the repetition times along the channel dimension (), as shown in Table 1. From Table 1, one can see that with the same compression ratio, our method outperforms WSNet by 6.5% in classification accuracy. Compared to WSNet, our method enables learning the proper weights sharing patterns by sampling filters with flexible overlapping, overcoming the limitation of WSNet in which the sampling stride is fixed. More results can be found in supplementary material.
Method  Conv1  Conv2  Conv3  Conv4  Conv5  Conv6  Conv7  Conv8  Acc.(%) 
Config. 
S C  S C  S C  S C  S C  S C  S C  S C  
baseline  1 1  1 1  1 1  1 1  S 1  1 1  1 1  1 1  66.0 
WSNet  8 1  4 1  2 2  1 2  S 4  1 8  1 8  1 8  66.5 
Ours  8 1  4 1  2 2  1 2  S 4  1 8  1 8  1 8  

4.2 2D CNN compression
Implementation details
We use MobilenetV2 [7] as our backbone for the evaluation on 2D convolutions, which is the stateoftheart compact model and has achieved competitive performance on ImageNet and CIFAR10 with much fewer parameters and FLOPs than ResNet [5] and VGGNet [23]. During the training, we use a batch size of 160 and a starting learning rate of 0.045 with step decay 0.98 for all experiments. We train the network with adam [24]
optimizer for 480 epochs.
Methods  FLOPs(M)  Parameters  Param Comp.rate  Accuracy(%) (top1) 
MobilenetV21.0 
301  3.4M  71.8  
MobilenetV2  217  2.94M  69.14  
MobilenetV2  153  2.52M  67.22  
MobilenetV2  115  2.26M  65.18  
MobilenetV2  71  1.98M  60.70  
Our method0.75 
220  2.94M  71.54  
Our method0.5  157  2.52M  69.42  
Our method0.35  120  2.26M  67.01  
Our method0.18  79  1.95M  64.48 
GROUP  Methods  FLOPs(M)  Parameters  Accuracy(%) (top1) 
60M FLOPs  MobilenetV20.35 [7]  59  1.7M  60.3 
SMobilenetV20.35 [22]  59  3.6M  59.7  
USMobilenetV20.35 [25]  59  3.6M  62.3  
MnasNetA1 (0.35x) [3]  63  1.7M  62.4  
Ours0.18  79  2.0M  
100M FLOPs  MobilenetV20.5 [7]  97  2.0 M  65.4 
SMobilenetV20.5 [22]  97  3.6M  64.4  
USMobilenetV20.5 [25]  97  3.6M  65.1  
Ours0.35  120  2.2M  
200M FLOPs  MobilenetV20.75 [7]  209  2.6 M  69.8 
SMobilenetV20.75 [26]  209  3.6M  68.9  
USMobilenetV20.75 [25]  209  3.6M  69.6  

FBNetA [27]  246  4.3M  73 
AUTOSMobilenetV20.75 [26]  207  4.1M  73  
Our method0.5  
Our method0.75  
Our methodA  
Our methodE  
300M FLOPs  MobilenetV21.0 [7]  300  3.4 M  69.8 
EfficientNet_1.0 [28]  400  5.4M  
our method0.75E  4.52M  76.242 
Results on ImageNet
We first conduct experiments on ImageNet dataset to investigate the effectiveness of our method. An efficient width multiplier has been used to trade off computation and accuracy in MobileNetV2 and some other compact models [8, 7, 20], we thus adopt it in our baseline. We choose four common width multiplier values, i.e., [7, 20, 22] for fair comparison with the baselines. Following MobileNetV2, we apply the width multiplier for all the bottleneck blocks except the last convolution layer.
The performance of our method and the baseline is summarized in Table 2. For all the width multiplier values, our method outperforms the baseline by a significant margin. It is observed that for higher compression ratio, the performance gain is larger. The performance of MobilenetV2 drops significantly when the compression ratio is larger than . Noticeably, our autosampling method increases the classification performance by 3.78% at the same compression ratio. This is because when the compression ratio is high, each layer in the baseline model does not have enough capacity to learn good representation. Our method, however, can learn to generate more weight representations from the compact weight matrix.
We also compare our method with stateoftheart model compression methods in Table 3. Since an optimized architecture tends to allocate more channels to the later layers [2][26], we also run experiments with larger size of the compact weight matrix for later layers, denoted with a suffix ‘A’ in Table 3. It is observed that our method performs even better than neural architecture search based methods as shown in Figure 3 and Table 3. Although our method does not modify the model architecture, the transformation from the compact weight matrix to the convolution kernel optimizes the parameter allocation and enriches model capacity through optimized weights combination and sharing.
Results on CIFAR10
We also conduct experiments on CIFAR10 datasets to verify the efficiency of our method. our method achieves FLOPs reduction and 5.64 model size reduction with only 1% classification accuracy drop, outperforming NAS methods as shown in Table 4. More experiments and implementation details are shown in Table 5 in supplementary material.
Methods  Parameters  FLOPs(M)  Accuracy(%) (top1) 
MobilenetV21  2.2M  93.52  
0.4 M  21.59  91.70  
AutoSlim  0.7M  59  93.00 
AutoSlim  0.3M  28  92.00 
Our method  0.39M  26.8 
From the results, one can observe significant improvements of our method over competitive baselines, even the latest architecture search based methods. The improvement of our method mainly comes from alleviating the performance degradation due to insufficient model size by learning richer and more reasonable combination of weights parameters and allocating suitable weight parameter sharing among different filters. The learned transformation from the compact weight matrix to the convolution kernel increases the weights representation capability with minimum increase on the memory footprint and the computation cost. This property distinguishes our method from WSNet and improves the model performance significantly.
5 Conclusion
We present a novel sampling method which can reuse the parameters efficiently to reduce the model size and FLOPs with minimum classification accuracy drop or even increased accuracy in certain cases. Motivated by the sampling method introduced in WSNet, we propose a novel method to learn the overlapping between the filters to break the constraints upon WSNet sampling method to extend to 2D or even higher dimensional data. We demonstrate the effectiveness of the method on CIFAR10 and ImageNet dataset with extensive experimental results.
References
 [1] Xiaojie Jin, Yingzhen Yang, Ning Xu, Jianchao Yang, Nebojsa Jojic, Jiashi Feng, and Shuicheng Yan. Wsnet: Compact and efficient networks through weight sampling. arXiv preprint arXiv:1711.10067, 2017.

[2]
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, LiJia Li, and Song Han.
Amc: Automl for model compression and acceleration on mobile devices.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pages 784–800, 2018.  [3] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
 [4] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[5]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [6] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [7] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [8] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018.
 [9] Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018. ACM, 2015.
 [10] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [11] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
 [12] Maxwell D Collins and Pushmeet Kohli. Memory bounded deep convolutional networks. arXiv preprint arXiv:1412.1442, 2014.
 [13] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
 [14] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.

[15]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua
Bengio.
Quantized neural networks: Training neural networks with low
precision weights and activations.
The Journal of Machine Learning Research
, 18(1):6869–6898, 2017.  [16] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [17] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [18] Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semisupervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016.
 [19] Yunpeng Chen, Haoqi Fang, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. arXiv preprint arXiv:1904.05049, 2019.
 [20] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
 [21] Yoni Choukroun, Eli Kravchik, and Pavel Kisilev. Lowbit quantization of neural networks for efficient inference. arXiv preprint arXiv:1902.06822, 2019.
 [22] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. arXiv preprint arXiv:1812.08928, 2018.
 [23] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [25] Jiahui Yu and Thomas Huang. Universally slimmable networks and improved training techniques. arXiv preprint arXiv:1903.05134, 2019.
 [26] Jiahui Yu and Thomas Huang. Network slimming by slimmable networks: Towards oneshot architecture search for channel numbers. arXiv preprint arXiv:1903.11728, 2019.
 [27] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443v2, 12 2018.
 [28] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
 [29] Ke Sun, Mingjie Li, Dong Liu, and Jingdong Wang. Igcv3: Interleaved lowrank group convolutions for efficient deep neural networks. arXiv preprint arXiv:1806.00178, 2018.
 [30] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, et al. Chamnet: Towards efficient network design through platformaware model adaptation. arXiv preprint arXiv:1812.08934, 2018.
 [31] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Channelnets: Compact and efficient convolutional neural networks via channelwise convolutions. In Advances in Neural Information Processing Systems, pages 5197–5205, 2018.
Comments
There are no comments yet.