Deep Model Compression via Filter Auto-sampling

07/12/2019 ∙ by Daquan Zhou, et al. ∙ 1

The recent WSNet [1] is a new model compression method through sampling filterweights from a compact set and has demonstrated to be effective for 1D convolutionneural networks (CNNs). However, the weights sampling strategy of WSNet ishandcrafted and fixed which may severely limit the expression ability of the resultedCNNs and weaken its compression ability. In this work, we present a novel auto-sampling method that is applicable to both 1D and 2D CNNs with significantperformance improvement over WSNet. Specifically, our proposed auto-samplingmethod learns the sampling rules end-to-end instead of being independent of thenetwork architecture design. With such differentiable weight sampling rule learning,the sampling stride and channel selection from the compact set are optimized toachieve better trade-off between model compression rate and performance. Wedemonstrate that at the same compression ratio, our method outperforms WSNetby6.5 method outperformsMobileNetV2 full model by1.47 with25 our methodeven outperforms some neural architecture search (NAS) based methods such asAMC [2] and MNasNet [3].

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the great performance improvement brought to many fields, deep convolutional neural networks (CNNs) typically suffer large model complexity [4]. Especially, since deep residual network [5] was proposed, the model complexity of CNNs has been increasing rapidly. The resulted high memory and computation costs hinder the deployment of DNNs on low-end devices, such as mobile phones, with limited memory and computation resources. Several compact DNN models requiring less resources have been proposed recently [6, 7, 8]. Among these models, the weight sampling based model WSNet [1] reduces the memory and computation cost significantly on 1D CNNs for speech processing, and shows great potential on compressing 2D and 3D CNNs for image and video processing.

WSNet [1] compresses CNNs by allowing convolution weights to be shared between two adjacent 1D convolution kernels through sampling the weights from a compact set. However, WSNet is limited to audio processing for three main reasons. First, WSNet deploys handcrafted hyper-parameters for the parameter sharing. This works well on the ESC-50 sound dataset [9] since the sound signal is redundant in time domain. However, such redundancy pattern is much more complicated for images and videos, thus it is impractical to handcraft the weights sharing and sampling procedures. Secondly, WSNet compresses the 1D convolution along the spatial dimensions reusing the parameters from a compact weights set. This only works well on convolution layers with large kernel size. However, recent compact CNNs such as MobileNet [6], MobileNetV2 [7] and ShuffleNetV2 [8] deploy small and convolution kernels. Thirdly, WSNet repeats sampling along the channel dimension to reuse the convolution parameters, which does not well apply to more complex images and videos.

In this paper, we aim to alleviate the above limitations of WSNet and make the sampling procedure fully learnable and extensible to 2D or even higher dimensional data. Specifically, we design a small neural network to learn the sampling rules instead of handcrafting them. This reduces the hyper parameter tuning complexity and ensures that the hyper-parameters are optimized together with the neural network towards better performance. Besides, we extend the weight sampling to the channel dimensions to remove the constraints on the spatial kernel size and increase the representation capability of the sampled filters along the channel dimension. To tackle the issue that the produced sampling position within the compact set may be fractional, we develop a principled method to transform the fractional outputs to the bi-linear interpolation between two adjacent sampling positions in the compact set. This transformation ensures that the sampling process is differentiable.

The proposed auto-sampling method can be encapsulated into a single convolution layer as a drop-in replacement. We apply it to compressing a variety of CNN models. To demonstrate its efficacy, we first conduct experiments on ESC-50 dataset [9] to verify the improvements over WSNet on 1D convolution. With the same compression ratio, our method outperforms WSNet by . We then experiment on CIFAR-10 [10] and ImageNet [11] to demonstrate the improvements on 2D convolutions. On ImageNet, our method outperforms MobileNetV2 full model by in classification accuracy with FLOPs reduction, and MobileNetV2-0.35 baseline by . With the same backbone architecture with baseline models, our method even outperforms some neural architecture search (NAS) based methods such as AMC [2] and MNasNet [3].

2 Related work

Recent popular model compression methods are based on network pruning [4, 12, 13], weights quantization [14, 15, 16], knowledge distillation [17, 18] and compact network design [1, 6, 7, 8, 19, 20]

. Network pruning can achieve significant compression results with a properly designed neuron importance function. An Auto ML method 

[2]

is proposed, using reinforcement learning (RL) method to prune the network automatically. However, the extra reinforcement network needs to be designed separately and training it consumes much computation. In contrast, our method does not use any extra networks and learns to reuse the parameters end-to-end. The quantization methods can compress the network aggressively but typically suffer large perform drop 

[21, 16]. Knowledge distillation transfers knowledge of a teacher network to a more compact student network. However, a teacher network may be hard to get for new tasks.

Recently, Slimmable network [22]

uses the width multiplier to adjust the layer width. Several batch normalization layers are generated for each width multiplier which is denoted as switches. By using the separate batch normalization layers, the performance at each switch can be further improved. However, both the width multiplier and the switch method suffer significant performance drop, especially when the compression ratio is larger. WSNet 

[1] uses a parameters sharing scheme by allowing overlapping between two adjacent filters. However, WSNet is mainly verified on 1D convolution layers and the compression ratio is limited by the spatial size of the convolution kernel. These limitations are alleviated in our method by extending the sampling along channel dimensions with a learned sampling strategy.

In this paper, we propose a novel sampling method such that the model size and computation are reduced by a specified factor with significant performance improvement compared with the width multiplier and Slimmable network method. Different from WSNet, our new model can learn the sampling rules from a compact weight matrix end-to-end.

3 Method

Our proposed auto-sampling method compresses CNNs through learning to sample filter parameters from a compact set. First, based on the memory and FLOPs constraints, a layer-wise compact weight matrix is defined with a much smaller size than the layer size of the vanilla model before compression. Then, sampling rules are learned to sample the filter weights from the compact matrix to constitute the actual convolution kernels. We use a small neural network to optimize the sampling rules on the initial positions and strides along with the model training. This is significantly different from WSNet [1] which uses handcrafted and fixed sampling rules. Based on this method, the originally separate convolution filters are sampled from the compact weight matrix with shared weights and thus the model size can be reduced. To further lower the computation cost, we propose to pre-compute a product map from the compact weight matrix which avoids redundant convolution computation with only cost. We now proceed to explain details of our proposed method, starting with how to design the compact weight matrix.

3.1 Compact weight matrix design

The size of the compact weight matrix is determined by the original model and the desired compression ratio . For a CNN with -dimensional convolution layers and fully connected layers, its number of parameters can be calculated as where

denotes the length of the convolution weight tensor along the

dimension of the convolution layer. and denote input and output dimension of the fully connected layer.

We assign a compact weight matrix for each layer , and a sampling rule , i.e., the weight value of an -d filter at location being equal to . The size of the total compact weight matrix is thus . After learning the mapping rule for the layer , we store the mapping as a look-up table of size . The size of all the mapping tables is . Hence, the compression ratio can be calculated as

(1)

We assume a uniform compression ratio for all the layers. Thus, given a compression ratio , the size of the compact weight matrix for each layer can be calculated correspondingly.

3.2 Convolution layer construction via auto-sampling

With the above defined compact weight matrix , our method introduces a novel sampling procedure where the convolution filter weights are sampled from with end-to-end learned sampling rules . To simplify the illustration, we particularly consider the 2D convolution case. Our method can be extended to -dimension convolution and fully connected layers straightforwardly, with details provided in supplementary material.

For a 2D convolution layer with an input feature map

and zero padding, it produces an output feature map

where denote the spatial width, height, input channels and output channels respectively. Let denote a patch of the input feature map, which spatially spans from to , to . denotes the -th filter of the convolution kernel , where and and are the spatial size of the convolution kernel. The convolution output can be computed as

(2)

With our proposed method, all the weight parameters are sampled from the compact weight matrix , by selecting a subset of weights tensor from it, as illustrated in Figure 1. The sampling function maps the convolutional kernel parameter to the value . Thus, the above convolution is replaced with as follows,

(3)

Instead of designing the sampling rules manually as in WSNet [1], we use a small neural network to optimize the mapping in order to maximize the model performance. More details can be found in Section 4 in supplementary material.

However, as the output of a neural network is fractional, it cannot be used as the position index directly. We solve this problem by multiplying the fractional portion of the output with values at two nearest integer positions and then sum them together as the final mapped value:

(4)

where and . and enumerate over all integral spatial locations within . As most of values returned by are zero, Eqn. (4) is fast to compute. This makes the mapping rule differentiable and end-to-end optimizable. In the following analysis, we allow position index to be fractional. Although not explicitly written, we implement Eqn. (4) to handle it.

Figure 1: Our proposed auto-sampling process from the compact weight matrix through learning the position mapping function, , between the compact weight matrix and the convolution kernel elements, . Here and are kernels that are sampled from the same location in the compact weights matrix. The mapping function is illustrated in more details in supplementary material.

Computation cost reduction

Since the weight parameters per convolution layer are sampled from the same , there is computation redundancy when the two sampled convolution kernels overlap in . To reuse the multiplication during the convolution, we first compute the multiplication between each parameter in the compact weights matrix and the input feature map. The results are saved as a product map such that the computation with the same parameters in the compact weight matrix is done only once and the following calculation can reuse the results directly. However, with the compact weight matrix , the product map size will be which consumes large computation memory. To reduce its size, we increase the dimension of the compact weight matrix to in order to group the computation results in the product map. Each entry in the product map is calculated as the dot product of the channel dimension along the input feature map and the third dimension along the compact weight matrix. We set to be smaller than to further boost the compression. As illustrated in Figure 2, the feature map input channel is first grouped via

(5)
Figure 2: Sampling process along the input channel dimension. The input feature map is first grouped based on the sampling position of the kernels. Input feature map and are both multiplied with the weight kernel . To reuse the multiplication, feature map and are first added together before multiplying with the weights kernel .

where is the compression ratio along the input channel dimension and () is the learned position with , . The transformed input feature map has the same number of channels as . Let index the transformed feature map location. The product map is then calculated as

(6)

The multiplication can be reused by replacing the convolution kernel with the product map. Based on Eqns. (3), (5), (6), the convolution can be calculated as

(7)

To reuse the adds operation, we extend the integral image, as detailed in Figure 2 of the original WSNet [1] paper, from 1D to 2D convolution based on :

(8)

From Eqn. (8), 2D convolution results can be retrieved in a similar way as in WSNet but extended to 2D convolutions as follows,

(9)

As we set to be smaller than the output channel dimension of the convolution kernel, we reuse the computation results from the compact weight matrix via

(10)

where , is the sampling times along the output channel dimension and is the learned mapping. The sampling length is a hyper-parameter. Thus, the computation FLOPs of a transformed 2D convolution can be calculated as

(11)

Suppose we use sliding window with stride of 1 and no bias term, the FLOPs of the conventional convolution can be calculated based on Eqn. (2):

(12)

Therefore, the total FLOPs reduction ratio is

(13)

The compression ratio for the 2D convolution layers can be calculated via Eqn. (1):

(14)

where and denote the width and height of the kernel of the layer. and denote the spatial size of the compact weight matrix for the layer. From Eqn. (13) and Eqn. (14), it can be seen that the compression ratio is mainly decided by .

3.3 Compact weight matrix learning

The compact weight matrix is updated together with the convolution layers through back propagation. Since the parameters in the compact weight matrix can be sampled multiple times for the convolution computation, the gradient of the compact weight matrix is the summation of all the positions where the weights parameter is used in the convolution. Here we use the notion of {} to index the set of all convolution kernel weights that are mapped to the position of in . The gradient of updating can thus be calculated as

(15)

where

is the cross entropy loss function used for training the neural network and

denotes the kernel elements that are sampled from position in .

4 Experiments

We first evaluate the efficacy of our method on the sound dataset ESC-50 [9] comparing it with WSNet for 1D convolutional model compression. We then test our method with MobileNetV2 as the backbone on 2D convolutions on ImageNet dataset [11] and CIFAR-10 dataset [10]. Detailed experiments settings can be found in the supplementary material. We evaluate our methods in terms of three criteria: model size, computation cost and classification performance.

4.1 1D CNN compression

We use the same 8-layer CNN model as used in WSNet [1] for a fair comparison. The compression ratio in WSNet is decided by the stride () and the repetition times along the channel dimension (), as shown in Table 1. From Table 1, one can see that with the same compression ratio, our method outperforms WSNet by 6.5% in classification accuracy. Compared to WSNet, our method enables learning the proper weights sharing patterns by sampling filters with flexible overlapping, overcoming the limitation of WSNet in which the sampling stride is fixed. More results can be found in supplementary material.

Method Conv1 Conv2 Conv3 Conv4 Conv5 Conv6 Conv7 Conv8 Acc.(%)

Config.
S C S C S C S C S C S C S C S C
baseline 1 1 1 1 1 1 1 1 S 1 1 1 1 1 1 1 66.0
WSNet 8 1 4 1 2 2 1 2 S 4 1 8 1 8 1 8 66.5
Ours 8 1 4 1 2 2 1 2 S 4 1 8 1 8 1 8


Table 1: Comparison with WSNet on ESC-50 dataset. We use the same configuration but change the sampling stride to be learnable. ‘S’ denotes the stride and ‘C’ denotes the repetition times along the input channel dimension. Our method uses ‘S’ in WSNet as an initial value and learns the offsets over these initial values via a small network.

4.2 2D CNN compression

Implementation details

We use MobilenetV2 [7] as our backbone for the evaluation on 2D convolutions, which is the state-of-the-art compact model and has achieved competitive performance on ImageNet and CIFAR-10 with much fewer parameters and FLOPs than ResNet [5] and VGGNet [23]. During the training, we use a batch size of 160 and a starting learning rate of 0.045 with step decay 0.98 for all experiments. We train the network with adam [24]

optimizer for 480 epochs.

Methods FLOPs(M) Parameters Param Comp.rate Accuracy(%) (top1)

MobilenetV2-1.0
301 3.4M 71.8
MobilenetV2- 217 2.94M 69.14
MobilenetV2- 153 2.52M 67.22
MobilenetV2- 115 2.26M 65.18
MobilenetV2- 71 1.98M 60.70

Our method-0.75
220 2.94M 71.54
Our method-0.5 157 2.52M 69.42
Our method-0.35 120 2.26M 67.01
Our method-0.18 79 1.95M 64.48
Table 2: Results of ImageNet classification. Our method uses vanilla MobileVetV2 as backbone. For fair comparison, we evaluate multiple width multiplier values of and and only apply it on the filter dimension of the first convolution. We apply the proposed method on all the invert residual blocks equally to disentangle the architecture affects on the performance. denotes our own implementation.
GROUP Methods FLOPs(M) Parameters Accuracy(%) (top1)
60M FLOPs MobilenetV2-0.35 [7] 59 1.7M 60.3
S-MobilenetV2-0.35 [22] 59 3.6M 59.7
US-MobilenetV2-0.35 [25] 59 3.6M 62.3
MnasNet-A1 (0.35x) [3] 63 1.7M 62.4
Ours-0.18 79 2.0M
100M FLOPs MobilenetV2-0.5 [7] 97 2.0 M 65.4
S-MobilenetV2-0.5 [22] 97 3.6M 64.4
US-MobilenetV2-0.5 [25] 97 3.6M 65.1
Ours-0.35 120 2.2M
200M FLOPs MobilenetV2-0.75  [7] 209 2.6 M 69.8
S-MobilenetV2-0.75  [26] 209 3.6M 68.9
US-MobilenetV2-0.75 [25] 209 3.6M 69.6


FBNet-A  [27] 246 4.3M 73
AUTO-S-MobilenetV2-0.75  [26] 207 4.1M 73
Our method-0.5
Our method-0.75
Our method--A
Our method--E
300M FLOPs MobilenetV2-1.0  [7] 300 3.4 M 69.8
EfficientNet_-1.0  [28] 400 5.4M
our method-0.75-E 4.52M 76.242
Table 3: Comparison of our method with other state-of-the-art models on ImageNet where our method shows superior performance over all other methods. Suffix ’-M’ denotes we use MobileNetV2 as our backbone and suffix ’-E’ denotes we use EfficientNet_ as our backbone. Suffix ‘-A’ means we use larger compression ratio for front layers. Our method does not modify the backbone model architecture and applies a uniform compression ratio, unless specified with suffix ‘-A’.
Figure 3: ImageNet classification accuracy of our method, MobileNetV2 baselines and other NAS based methods including AMC [2], IGCV3 [29], MNasNet [3], ChamNet [30] and ChannelNet [31]. Our method outperforms all the methods within the same level of FLOPs. Here, is our implementation of baseline models and MobileNetV2 is the original model with width multiplier of 0.35,0,5,0,75 and 1. The backbone model for our results are EfficientNet_b0 [28], with multiplier of 0.75, 0.5 and MobileNetV2 with multiplier of 0.75, 0.5, 0.35 and 0.2 respectively.

Results on ImageNet

We first conduct experiments on ImageNet dataset to investigate the effectiveness of our method. An efficient width multiplier has been used to trade off computation and accuracy in MobileNetV2 and some other compact models [8, 7, 20], we thus adopt it in our baseline. We choose four common width multiplier values, i.e.,  [7, 20, 22] for fair comparison with the baselines. Following MobileNetV2, we apply the width multiplier for all the bottleneck blocks except the last convolution layer.

The performance of our method and the baseline is summarized in Table 2. For all the width multiplier values, our method outperforms the baseline by a significant margin. It is observed that for higher compression ratio, the performance gain is larger. The performance of MobilenetV2 drops significantly when the compression ratio is larger than . Noticeably, our auto-sampling method increases the classification performance by 3.78% at the same compression ratio. This is because when the compression ratio is high, each layer in the baseline model does not have enough capacity to learn good representation. Our method, however, can learn to generate more weight representations from the compact weight matrix.

We also compare our method with state-of-the-art model compression methods in Table 3. Since an optimized architecture tends to allocate more channels to the later layers [2][26], we also run experiments with larger size of the compact weight matrix for later layers, denoted with a suffix ‘-A’ in Table 3. It is observed that our method performs even better than neural architecture search based methods as shown in Figure 3 and Table 3. Although our method does not modify the model architecture, the transformation from the compact weight matrix to the convolution kernel optimizes the parameter allocation and enriches model capacity through optimized weights combination and sharing.

Results on CIFAR-10

We also conduct experiments on CIFAR-10 datasets to verify the efficiency of our method. our method achieves FLOPs reduction and 5.64 model size reduction with only 1% classification accuracy drop, outperforming NAS methods as shown in Table 4. More experiments and implementation details are shown in Table 5 in supplementary material.

Methods Parameters FLOPs(M) Accuracy(%) (top1)
MobilenetV2-1 2.2M 93.52
0.4 M 21.59 91.70
Auto-Slim 0.7M 59 93.00
Auto-Slim 0.3M 28 92.00
Our method 0.39M 26.8
Table 4: Comparison with other state-of-the-art models w.r.t. classification accuracy on CIFAR-10. Our method outperforms the recently proposed AUTO-SLIM [26] which is a compression method using AutoML method.

From the results, one can observe significant improvements of our method over competitive baselines, even the latest architecture search based methods. The improvement of our method mainly comes from alleviating the performance degradation due to insufficient model size by learning richer and more reasonable combination of weights parameters and allocating suitable weight parameter sharing among different filters. The learned transformation from the compact weight matrix to the convolution kernel increases the weights representation capability with minimum increase on the memory footprint and the computation cost. This property distinguishes our method from WSNet and improves the model performance significantly.

5 Conclusion

We present a novel sampling method which can reuse the parameters efficiently to reduce the model size and FLOPs with minimum classification accuracy drop or even increased accuracy in certain cases. Motivated by the sampling method introduced in WSNet, we propose a novel method to learn the overlapping between the filters to break the constraints upon WSNet sampling method to extend to 2D or even higher dimensional data. We demonstrate the effectiveness of the method on CIFAR-10 and ImageNet dataset with extensive experimental results.

References