Simultaneously Learning Architectures and Features of Deep Neural Networks

06/11/2019 ∙ by Tinghuai Wang, et al. ∙ 0

This paper presents a novel method which simultaneously learns the number of filters and network features repeatedly over multiple epochs. We propose a novel pruning loss to explicitly enforces the optimizer to focus on promising candidate filters while suppressing contributions of less relevant ones. In the meanwhile, we further propose to enforce the diversities between filters and this diversity-based regularization term improves the trade-off between model sizes and accuracies. It turns out the interplay between architecture and feature optimizations improves the final compressed models, and the proposed method is compared favorably to existing methods, in terms of both models sizes and accuracies for a wide range of applications including image classification, image compression and audio classification.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large and deep neural networks, despite of their great successes in a wide variety of applications, call for compact and efficient model representations to reduce the vast amount of network parameters and computational operations, that are resource-hungry in terms of memory, energy and communication bandwidth consumption. This need is imperative especially for resource constrained devices such as mobile phones, wearable and Internet of Things (IoT) devices. Neural network compression is a set of techniques that address these challenges raised in real life industrial applications.

Minimizing network sizes without compromising original network performances has been pursued by a wealth of methods, which often adopt a three-phase learning process, i.e. training-pruning-tuning. In essence, network features are first learned, followed by the pruning stage to reduce network sizes. The subsequent fine-tuning phase aims to restore deteriorated performances incurred by undue pruning. This ad hoc three phase approach, although empirically justified e.g. in [14, 17, 12, 20, 22], was recently questioned with regards to its efficiency and effectiveness. Specifically [15, 3] argued that the network architecture should be optimized first, and then features should be learned from scratch in subsequent steps.

In contrast to the two aforementioned opposing approaches, the present paper illustrates a novel method which simultaneously learns both the number of filters and network features over multiple optimization epochs. This integrated optimization process brings about immediate benefits and challenges — on the one hand, separated processing steps such as training, pruning, fine-tuning etc, are no longer needed and the integrated optimization step guarantees consistent performances for the given neural network compression scenarios. On the other hand, the dynamic change of network architectures has significant influences on the optimization of features, which in turn might affect the optimal network architectures. It turns out the interplay between architecture and feature optimizations plays a crucial role in improving the final compressed models.

2 Related Work

Network pruning was pioneered [11, 6, 4] in the early development of neural network, since when a broad range of methods have been developed. We focus on neural network compression methods that prune filters or channels. For thorough review of other approaches we refer to a recent survey paper [2].

Li et al. [12] proposed to prune filters with small effects on the output accuracy and managed to reduce about one third of inference cost without compromising original accuracy on CIFAR-10 dataset. Wen et al. [20] proposed a structured sparsity regularization framework, in which the group lasso constrain term was incorporated to penalize and remove unimportant filters and channels. Zhou et al. [22]

also adopted a similar regularization framework, with tensor trace norm and group sparsity incorporated to penalize the number of neurons. Up to 70% of model parameters were reduced without scarifying classification accuracies on CIFAR-10 datasets. Recently Liu

et al. [14] proposed an interesting network slimming method, which imposes L1 regularization on channel-wise scaling factors

in batch-normalization layers and demonstrated remarkable compression ratio and speedup using a surprisingly simple implementation. Nevertheless, network slimming based on scaling factors is not guaranteed to achieve desired accuracies and separate fine-tunings are needed to restore reduced accuracies. Qin

et al. [17] proposed a functionality-oriented filter pruning method to remove less important filters, in terms of their contributions to classification accuracies. It was shown that the efforts for model retraining is moderate but still necessary, as in the most of state-of-the-art compression methods.

DIVNET adopted Determinantal Point Process (DPP) to enforce diversities between individual neural activations [16]. Diversity of filter weights defined in (4) is related to orthogonality of weight matrix, which has been extensively studied. An example being [5], proposed to learn Stiefel layers, which have orthogonal weights, and demonstrated its applicability in compressing network parameters. Interestingly, the notion of diversity regularized machine (DRM) has been proposed to generate an ensemble of SVMs in the PAC learning framework [21], yet its definition of diversity is critically different from our definition in (4), and its applicability to deep neural networks is unclear.

3 Simultaneous Learning of Architecture and Feature

The proposed compression method belongs to the general category of filter-pruning approaches. In contrast to existing methods [14, 17, 12, 20, 22, 15, 3], we adopt following techniques to ensure that simultaneous optimization of network architectures and features is a technically sound approach. First, we introduce an explicit pruning lossestimation as an additional regularization term in the optimization objective function. As demonstrated by experiment results in Section 4, the introduced pruning loss enforces the optimizer to focus on promising candidate filters while suppressing contributions of less relevant ones. Second, based on the importance of filters, we explicitly turn-off unimportant filters below given percentile threshold. We found the explicit shutting down of less relevant filters is indispensable to prevent biased estimation of pruning loss. Third, we also propose to enforce the diversities between filters and this diversity-based regularization term improves the trade-off between model sizes and accuracies, as demonstrated in various applications.

Our proposed method is inspired by network slimming [14] and main differences from this prior art are two-folds: a) we introduce the pruning loss and incorporate explicit pruning into the learning process, without resorting to the multi-pass pruning-retraining cycles; b) we also introduce filter-diversity based regularization term which improves the trade-off between model sizes and accuracies.

3.1 Loss Function

Liu et al. [14] proposed to push towards zero the scaling factor in batch normalization (BN) step during learning, and subsequently, the insignificant channels with small scaling factors are pruned. This sparsity-induced penalty is introduced by regularizing L1-norm of the learnable parameter in the BN step i.e.,


in which denote filter inputs,

the filter-wise mean and variance of inputs,

the scaling and offset parameters of batch normalization (BN) and a small constant to prevent numerical un-stability for small variance. It is assumed that there is always a BN filter appended after each convolution and fully connected filter, so that the scaling factor is directly leveraged to prune unimportant filters with small values. Alternatively, we propose to directly introduce scaling factor to each filter since it is more universal than reusing BN parameters, especially considering the networks which have no BN layers.

By incorporating a filter-wise sparsity term, the object function to be minimized is given by:


where the first term is the task-based loss, and denotes the set of scaling factors for all filters. This pruning scheme, however, suffers from two main drawbacks: 1) since scaling factors are equally minimized for all filterers, it is likely that the pruned filters have unignorable contributions that should not be unduly removed. 2) the pruning process, i.e., architecture selection, is performed independantly w.r.t. the feature learning; the performance of pruned network is inevitably compromised and has to be recovered by single-pass or multi-pass fine-tuning, which impose additional computational burdens.

3.1.1 An integrated optimization

Let denote the sets of neural network weights for, respectively, all filters, those pruned and remained ones i.e. . In the same vein, denote the sets of scaling factors for all filters, those removed and remained ones respectively.

To mitigate the aforementioned drawbacks, we propose to introduce two additional regularization terms to Eq. 2,


where and are defined as in Eq. 2, the third term is the pruning loss and the forth is the diversity loss which are elaborated below. are weights of corresponding regularization terms.

Figure 1: Comparison of scaling factors for three methods, i.e., baseline with no regularization, network-slimming [14], and the proposed method with diversified filters, trained with CIFAR-10 and CIFAR-100. Note that the pruning loss defined in (3.1.1) are 0.2994, 0.0288, , respectively, for three methods. Accuracy deterioration are 60.76% and 0% for network-slimming [14] and the proposed methods, and the baseline networks completely failed after pruning, due to insufficient preserved filters at certain layers.

3.1.2 Estimation of pruning loss

The second regularization term in (3.1.1) i.e. (and its compliment ) is closely related to performance deterioration incurred by undue pruning111In the rest of the paper we refer to it as the estimated pruning loss.. The scaling factors of pruned filters , as in [14], are determined by first ranking all and taking those below the given percentile threshold. Incorporating this pruning loss enforces the optimizer to increase scaling factors of promising filters while suppressing contributions of less relevant ones.

The rationale of this pruning strategy can also be empirically justified in Figure 1, in which scaling factors of three different methods are illustrated. When the proposed regularization terms are added, clearly, we observed a tendency for scaling factors being dominated by few number of filters — when 70% of filters are pruned from a VGG network trained with CIFAR-10 dataset, the estimated pruning loss equals 0.2994, 0.0288, , respectively, for three compared methods. Corresponding accuracy deterioration are 60.76% and 0% for network-slimming [14] and the proposed methods. Therefore, retraining of pruned network is no longer needed for the proposed method, while [14] has to retain the original accuracy through single-pass or multi-pass of pruning-retraining cycles.

3.1.3 Turning off candidate filters

It must be noted that the original loss is independent of the pruning operation. If we adopt this loss in (3.1.1), the estimated pruning loss might be seriously biased because of undue assignments of not being penalized. It seems likely some candidate filters are assigned with rather small scaling factors, nevertheless, they still retain decisive contributions to the final classifications. Pruning these filters blindly leads to serious performance deterioration, according to the empirical study, where we observe over 50 accuracy loss at high pruning ratio.

In order to prevent such biased pruning loss estimation, we therefore explicitly shutdown the outputs of selected filters by setting corresponding scaling factors to absolute zero. The adopted loss function becomes

. This way, the undue loss due to the biased estimation is reflected in , which is minimized during the learning process. We found the turning-off of candidate filters is indispensable.

1:procedure Online Pruning
9:     for each epoch {do
Algorithm 1 Proposed algorithm

3.1.4 Online pruning

We take a global threshold for pruning which is determined by percentile among all channel scaling factors. The pruning process is performed over the whole training process, i.e., simultaneous pruning and learning. To this end, we compute a linearly increasing pruning ratio from the first epoch (e.g., 0%) to the last epoch (e.g., 100%) where the ultimate pruning target ratio is applied. Such an approach endows neurons with sufficient evolutions driven by diversity term and pruning loss, to avoid mis-pruning neurons prematurely which produces crucial features. Consequently our architecture learning is seamlessly integrated with feature learning. After each pruning operation, a narrower and more compact network is obtained and its corresponding weights are copied from the previous network.

3.1.5 Filter-wise diversity

The third regularization term in (3.1.1) encourages high diversities between filter weights as shown below. Empirically, we found that this term improves the trade-off between model sizes and accuracies (see experiment results in Section 4).

We treat each filter weight, at layer

, as a weight (feature) vector

of length , where are filter width and height, the number of channels in the filter. The diversity between two weight vectors of the same length is based on the normalized cross-correlation of two vectors:


in which are normalized weight vectors, and is the dot product of two vectors. Clearly, the diversity is bounded , with value close 0 indicating low diversity between highly correlated vectors and values near 1 meaning high diversity between uncorrelated vectors. In particular, diversity equals 1 also means that two vectors are orthogonal with each other.

The diversities between filters at the same layer is thus characterized by a N-by-N matrix in which elements are pairwise diversities between weight vectors . Note that for diagonal elements are constant 0. The total diversity between all filters is thus defined as the sum of all elements

Models / Pruning Ratio 0.0 0.5 0.6 0.7 0.8
VGG-19 (Base-line) 0.9366 - - - -
VGG-19 (Network-slimming) - - - 0.9380 NA
VGG-19 (Ours) - 0.9353 0.9394 0.9393 0.9302
ResNet-164 (Base-line) 0.9458 - - - -
ResNet-164 (Network-slimming) - - 0.9473 NA NA
ResNet-164 (Ours) - 0.9478 0.9483 0.9401 NA
Table 1: Results on CIFAR-10 dataset
Models / Pruning Ratio 0.0 0.3 0.4 0.5 0.6
VGG-19 (Base-line) 0.7326 - - - -
VGG-19 (Network-slimming) - - 0.7348 - -
VGG-19 (Ours) - 0.7332 0.7435 0.7340 0.7374
ResNet-164 (Base-line) 0.7663 - - - -
ResNet-164 (Network-slimming) - - 0.7713 - 0.7609
ResNet-164 (Ours) - 0.7716 0.7749 0.7727 0.7745
Table 2: Results on CIFAR-100 dataset

4 Experiment Results

In this section, we evaluate the effectiveness of our method on various applications with both visual and audio data.

4.1 Datasets

For visual tasks, we adopt ImageNet and CIFAR datasets. The ImageNet dataset contains 1.2 million training images and 50,000 validation images of 1000 classes. CIFAR-10

[10] which consists of 50K training and 10K testing RGB images with 10 classes. CIFAR-100 is similar to CIFAR-10, except it has 100 classes. The input image is 32

32 randomly cropped from a zero-padded 40

40 image or its flipping. For audio task, we adopt ISMIR Genre dataset [1] which has been assembled for training and development in the ISMIR 2004 Genre Classification contest. It contains 1458 full length audio recordings from distributed across the 6 genre classes: Classical, Electronic, JazzBlues, MetalPunk, RockPop, World.

4.2 Image Classification

We evaluate the performance of our proposed method for image classification on CIFAR-10/100 and ImageNet. We investigate both classical plain network, VGG-Net [18], and deep residual network i.e., ResNet [8]. We evaluate our method on two popular network architecture i.e., VGG-Net [18], and ResNet [8]. We take variations of the original VGG-Net, i.e., VGG-19 used in [14] for comparison purpose. ResNet-164 which has 164-layer pre-activation ResNet with bottleneck structure is adopted. As base-line networks, we compare with the original networks without regularization terms and their counterparts in network-slimming [14]. For ImageNet, we adopt VGG-16 and ResNet-50 in order to compare with the original networks.

To make a fair comparison with [14]

, we adopt BN based scaling factors for optimization and pruning. On CIFAR, we train all the networks from scratch using SGD with mini-batch size 64 for 160 epochs. The learning rate is initially set to 0.1 which is reduced twice by 10 at 50% and 75% respectively. Nesterov momentum

[19] of 0.9 without dampening and a weight decay of are used. The robust weight initialization method proposed by [7]

is adopted. We use the same channel sparse regularization term and its hyperparameter

as defined in [14].

CIFAR10 Methods
ACC orig. 0.9377 0.9330 0.9388
ACC pruned NA 0.3254 0.9389
0.2994 0.0288 1.36e-6
CIFAR100 Methods
ACC orig. 0.7212 0.7205 0.75
ACC pruned NA 0.0531 0.7436
0.2224 0.0569 4.75e-4
Table 3: Accuracies of different methods before (orig.) and after pruning (pruned). For CIFAR10 and CIFAR100, 70% and 50% filters are pruned respectively. Note that ’NA’ indicates the baseline networks completely failed after pruning, due to insufficient preserved filters at certain layers.

4.2.1 Overall performance

The results on CIFAR-10 and CIFAR-100 are shown in Table 1 and Table 2 respectively. On both datasets, we can observe when typically 50-70% fitlers of the evaluated networks are pruned, the new networks can still achieve accuracy higher than the original network. For instance, with 70% filters pruned VGG-19 achieves an accuracy of 0.9393, compared to 0.9366 of the original model on CIFAR-10. We attribute this improvement to the introduced diversities between filter weights, which naturally provides discriminative feature representations in intermediate layers of networks.

As a comparison, our method consistently outperforms network-slimming without resorting to fine-tuning or multi-pass pruning-retraining cycles. It is also worth-noting that our method is capable of pruning networks with prohibitively high ratios which are not possible in network-slimming. Take VGG-19 network on CIFAR-10 dataset as an example, network-slimming prunes as much as 70%, beyond which point the network cannot be reconstructed as some layers are totally destructed. On the contrary, our method is able to reconstruct a very narrower network by pruning 80% filters while producing a marginally degrading accuracy of 0.9302. We conjecture this improvement is enabled by our simultaneous feature and architecture learning which can avoid pruning filters prematurely as in network-slimming where the pruning operation (architecture selection) is isolated from the feature learning process and the performance of the pruned network can be only be restored via fine-tuning.

The results on ImageNet are shown in Table 4 where we also present comparison with [9] which reported top-1 and top-5 errors on ImageNet. On VGG-16, our method provides 1.2% less accuracy loss while saving additionally 20.5M parameters and 0.8B FLOPs compared with [9]. On ResNet-50, our method saves 5M more parameters and 1.4B more FLOPs than [9] while providing 0.21% higher accuracy.

Models Top-1 Top-5 Params FLOPs
VGG-16 [9] 31.47 11.8 130.5M 7.66B
VGG-16 (Ours) 30.29 10.62 44M 6.86B
VGG-16 (Ours) 31.51 11.92 23.5M 5.07B
ResNet-50 [9] 25.82 8.09 18.6M 2.8B
ResNet-50 (Ours) 25.61 7.91 13.6M 1.4B
ResNet-50 (Ours) 26.32 8.35 11.2M 1.1B
Table 4: Results on ImageNet dataset

4.2.2 Ablation study

In this section we investigate the contribution of each proposed component through ablation study.

Filter Diversity
Figure 2: (a) Scaling factors of the VGG-19 network at various epochs during training trained with diversified filters (b) Sorted scaling factors of VGG-19 network trained with various pruning ratios on CIFAR-10.

Fig. 2 (a) shows the sorted scaling factors of VGG-19 network trained with the proposed filter diversity loss at various training epochs. With the progress of training, the scaling factors become increasingly sparse and the number of large scaling factors, i.e., the area under the curve, is decreasing. Fig. 1 shows the sorted scaling factors of VGG-19 network for the baseline model with no regularization, network-slimming [14], and the proposed method with diversified filters, trained with CIFAR-10 and CIFAR-100. We observe significantly improved sparsity by introducing filter diversity to the network compared with network-slimming, indicated by nsf. Remember the scaling factors essentially determine the importance of filters, thus, maximizing nsf ensures that the deterioration due to filter pruning is minimized. Furthermore, the number of filters associated with large scaling factor is largely reduced, rendering more irrelevant filter to be pruned harmlessly. This observation is quantitatively confirmed in Table 3 which lists the accuracies of three schemes before and after pruning for both CIFAR-10 and CIFAR-100 datasets. It is observed that retraining of pruned network is no longer needed for the proposed method, while network-slimming has to restore the original accuracy through single-pass or multi-pass of pruning-retraining cycles. Accuracy deterioration are 60.76% and 0% for network-slimming and the proposed method respectively, whilst the baseline networks completely fails after pruning, due to insufficient preserved filters at certain layers.

Online Pruning

We firstly empirically investigate the effectiveness of the proposed pruning loss. After setting , we train VGG-19 network by switching off/on respectively (set and ) the pruning loss on CIFAR-10 dataset. By adding the proposed pruning loss, we observe improved accuracy of 0.9325 compared to 0.3254 at pruning ratio of 70%. When pruning at 80%, the network without pruning loss can not be constructed due to insufficient preserved filters at certain layers, whereas the network trained with pruning loss can attain an accuracy of 0.9298. This experiment demonstrates that the proposed pruning loss enables online pruning which dynamically selects the architectures while evolving filters to achieve extremely compact structures.

Fig. 2 (b) shows the sorted scaling factors of VGG-19 network trained with pruning loss subject to various target pruning ratios on CIFAR-10. We can observe that given a target pruning ratio, our algorithm adaptively adjusts the distribution of scaling factors to accommodate the pruning operation. Such a dynamic evolution warrants little accuracy loss at a considerably high pruning ratio, as opposed to the static offline pruning approaches, e.g., network-slimming, where pruning operation is isolated from the training process causing considerable accuracy loss or even network destruction.

Figure 3: Network architecure for image compression.
Models PSNR Params Pruned (%) FLOPs Pruned (%)
Base-line 30.13 75888 - 46M -
Ours 29.12 (-3%) 43023 43% 23M 50%
Ours 28.89 (-4%) 31663 58% 17M 63%
Table 5: Results of image compression on CIFAR-100 dataset

4.3 Image Compression

The proposed approach is applied on end-to-end image compression task which follows a general autoencoder architecture as illustrated in Fig.

3. We utilize general scaling layer which is added after each convolutional layer, with each scaling factor initialized as 1. The evaluation is performed on CIFAR-100 dataset. We train all the networks from scratch using Adam with mini-batch size 128 for 600 epochs. The learning rate is set to 0.001 and MSE loss is used. The results are listed in Table. 5 where both parameters and floating-point operations (FLOPs) are reported. Our method can save about 40% - 60% parameters and 50% - 60% computational cost with minor lost of performance (PSNR).

4.4 Audio Classification

We further apply our method in audio classification task, particularly music genre classification. The preprocessing of audio data is similar with [13] and produces Mel spectrogram matrix of size 8080. The network architecture is illutrated in Fig. 4, where the scaling layer is added after both convolutional layers and fully connected layers. The evaluation is performed on ISMIR Genre dataset. We train all the networks from scratch using Adam with mini-batch size 64 for 50 epochs. The learning rate is set to 0.003. The results are listed in Table. 6 where both parameters and FLOPs are reported. Our approach saves about 92% parameters while achieves 1% higher accuracy, saving 80% computational cost. With a minor loss of about 1%, 99.5% parameters are pruned, resulting in an extreme narrow network with 50 times speedup.

Figure 4: Network architecure for music genre classification.
Models Accuracy Params Pruned (%) FLOPs Pruned (%)
Base-line 0.808 106506 - 20.3M -
Ours 0.818 (+1%) 8056 92.5 4M 80.3
Ours 0.798 (-1.3%) 590 99.5 0.44M 98.4
Table 6: Results of music genre classification on ISMIR Genre dataset

5 Conclusions

In this paper, we have proposed a novel approach to simultaneously learning architectures and features in deep neural networks. This is mainly underpinned by a novel pruning loss and online pruning strategy which explicitly guide the optimization toward an optimal architecture driven by a target pruning ratio or model size. The proposed pruning loss enabled online pruning which dynamically selected the architectures while evolving filters to achieve extremely compact structures. In order to improve the feature representation power of the remaining filters, we further proposed to enforce the diversities between filters for more effective feature representation which in turn improved the trade-off between architecture and accuracies. We conducted comprehensive experiments to show that the interplay between architecture and feature optimizations improved the final compressed models in terms of both models sizes and accuracies for various tasks on both visual and audio data.