1 Introduction
These days, network pruning has become the workhorse for network compression, which aims at lightweight and efficient model for fast inference [12, 18, 17, 41, 40, 32]
. This is of particular importance for the deployment of tiny artificial intelligence (Tiny AI) algorithms on smart phones and edge devices
[45]. Since the emerging of network pruning a couple of methods have been proposed based on the analysis of gradients, Hessians or filter distribution [28, 13, 9, 46, 58, 30, 62, 16]. With the advent of AutoML and neural architecture search (NAS) [68, 7], a new trend of network compression and pruning emerges, i.e. pruning with automatic algorithms and targeting distinguishing miniarchitectures (e.g. layers or building blocks.) Among them, reinforcement learning and evolutionary algorithm become the natural choice [17, 40]. The core idea is to search a certain finegrained layerwise dissimalated configuration among the all of the possible choices (population in the terminology of evolutionary algorithm). After the searching stage, the most promising candidate that optimizes the network capacity under constrained budgets is chosen.The advantage of these automatic pruning methods is the final layerwise distinguishing configuration. Thus, handcrafted design is no longer necessary. However, the main concern of these algorithms is the convergence property. For example, reinforcement learning is notorious for its difficulty of convergence under large or even middle level number of states [55]. Evolutionary algorithm needs to choose the best candidate from the already converged algorithm. But the dilemma lies in the impossibility of training the whole population till convergence and the difficulty of choosing the best candidate from unconverged population [40, 14]
. A promising solution to this problem is endowing the searching mechanism with differentiability or directly resorting to an approximately differentiable algorithm. This is due to the fact that differentiability guarantees theoretical convergence and has the potential to make the searching stage efficient. Actually, differentiability has facilitated a couple of machine learning approaches and the typical one among them is NAS. Early works on NAS have insatiable demand for computing resources, consuming tens of thousands of GPU hours for a satisfactory convergence
[68, 69]. The incorporation of differentiable architecture search (DARTS) reduces the insatiable consumption to tens of GPU hours, which has boosted the development of NAS during the past year [39].Another noteworthy direction for automatic pruning is brought by MetaPruning [40] which introduces hypernetworks [11] into network compression. The output of the socalled hypernetwork is used as the parameters of the backbone network. During training, the gradients are also backpropagated to the hypernetworks. This method falls in the paradigm of meta learning since the parameters in the hypernetwork act as the metadata of the parameters in the backbone network. But the problem of this method is that the hypernetworks can only output fixedsize weights, which cannot serve as a layerwise configuration searching mechanism. Thus, a searching algorithm such as evolutionary algorithm is necessary for the discovery of a good candidate. Although this is quite a natural choice, there is still one interesting question, namely, whether one can design a hypernetwork whose output size depends on the input (termed as latent vector in this paper) so that by only dealing with the latent vector, the backbone network can be automatically pruned.
To solve the aforementioned problem, we propose the differentiable meta pruning approach via hypernetworks DHP (D – Differentiable, H – Hyper, P – Pruning). A new design of hypernetwork is proposed to adapt to the requirements of differentiability. Each layer is endowed with a latent vector that controls the output channels of this layer. The hypernetwork takes as input the latent vectors of the current layer and previous layer that controls the input and output channels of the current layer respectively. By forward passing the latent vectors through the hypernetwork, the derived output is used as the parameters of the hypernetwork. To achieve the effect of automatic pruning, sparsity regularizer is applied to the latent vectors. A pruned model is discovered by updating the latent vectors with proximal gradient. The searching stage stops when the compression ratio drops to the target level. After the searching stage, the latent vectors becomes sparse with some elements approaching zero which can be removed. Accordingly, the output of the hypernetwork that is covariant with the latent vector is also compressed. Thus, the advantage of the proposed method is that operating only on the latent vectors makes automatic network pruning easier without the other bells and whistles.
With the fast development of efficient network design and NAS, the usefulness of network pruning is frequently challenged. However, by analyzing the pruning performance on MobileNetV1 [19] and MobileNetV2 [52] in Fig. 1, we conclude that automatic network pruning is of vital importance for further exploring the capacity of efficient networks. Efficient network design and NAS can only result in an overall architecture with building blocks endowed with the same miniarchitecture. By automatic network pruning, the efficient networks obtained by either human experts or NAS can by further compressed, leading to blockwise dissimilated configurations, which can be seen as a finegrained architecture search. We conjuncture that this perlayer distinguishing configuration might help to discover the potential capacities of efficient network without losing too much accuracy of the original network.
Thus, the contribution of this paper is as follows.

A new architecture of hypernetwork is designed. Different from the classical hypernetwork composed of linear layers, the new design is tailored to automatic network pruning. The input latent vector of the hypernetwork controls the parameters of the layers of the backbone network. By only operating on the latent vector, the backbone network can be pruned.

A differentiable automatic networking pruning method is proposed based on the the newly designed hypernetwork. Different from the existing methods based on reinforcement learning or evolutionary algorithms, the proposed method has theoretical convergence guarantee based on the proximal gradient algorithm.

The potential of automatic network pruning as finegrained architecture search is revealed by comparing the compression results from DHP with those from efficient networks.

The proposed differentiable automatic pruning method is not limited to a specific network architecture or layer type. A wide range of networks are compressed including VGG [53], ResNet [15], DenseNet [22], MobileNetV1 [19], MobileNetV2 [52], ShuffleNetV2 [42], MNasNet [56], DnCNN [66], UNet [51], SRResNet [29], and EDSR [35] which contains various layer and block types including standard convolution, depthwise convolution, convolution, transposed convolution combined with bottleneck block, skip connection, and densely connected block.

Extensive experiments are done on both highlevel and lowlevel vision tasks including image classification, single image superresolution, and denoising. The experimental results show that the proposed method sets the new stateoftheart in automatic network pruning.
2 Related Works
Network pruning. Aiming at removing the weak filter connections that have the least influence on the accuracy of the network, network pruning has attracted increasing attention recently. Early attempts emphasize more on the storage consumption, various criteria have been explored to remove inconsequential connections in an unstructural manner [12, 37]. Despite their success in reducing network parameters, unstructural pruning leads to irregular weight parameters, limited in the actual acceleration of the pruned network. To further address the efficiency issue, structured pruning methods directly zero out structured groups of the convolutional filters. For example, Wen et al. [60] and Alvarez et al. [2] firstly proposed to resort to group sparsity regularization during training to reduce the number of feature maps in each layer. Since that, the field has witnessed a variety of regularization strategies[3, 63, 31, 57, 32]. These elaborately designed regularization methods considerably advance the pruning performance. But they often rely on carefully adjusted hyperparameters selected for specific network architecture and dataset.
AutoML. Recently, there is an emerging trend of exploiting the idea of AutoML for automatic network compression. The rationality lies in the exploration among the total population of network configurations for a final best candidate. He et al. exploited reinforcement learning agents to prune the networks where handcrafted design is not longer necessary. Hayashi et al.
utilized genetic algorithm to enumerate candidate in the designed hypergraph for tensor network decomposition
[14]. Liu et al. trained a hypernetwork to generate weights of the backbone network and used evolutionary algorithm to search for the best candidate. The problem of these approach is that the searching algorithm is not differentiable, which is does not result in guaranteed convergence.NAS. NAS automatizes the manual task of neural network architecture design. Optimally, searched networks achieve smaller test error, require fewer parameters and need less computations than their manually designed counterparts [68, 50]. But the main drawback of both early strategies is their almost insatiable demand for computational resources. To alleviate the computational burden several methods [69, 38, 39] proposed to search for a basic building block, i.e. cell, opposed to an entire network. Then, stacking multiple cells with equivalent structure but different weights defined a full network [49, 5]. Another recent trend in NAS are differentiable search methods such as DARTS [39]. The differentiability allows the fast convergence of the searching algorithm and thus boosts the fast development of NAS during the past year. In this paper we propose a differentiable counterpart for automatic network pruning.
Meta learning and hypernetworks Meta learning is a broad family of machine learning techniques that deal with the problem of learning to learn. Recent works have witnessed its application to various vision tasks including object detection [61], instance segmentation [20], and superresolution [21]. An emerging trend of meta learning uses hypernetworks to predict the weight parameters in the backbone network [11]. Since the introduction of hypernetworks, it has found wide applications in NAS [5], multitask learning [47], Bayesian neural networks [27], and also network pruning [40]. In this paper, we propose a new design of hypernetwork which is especially suitable for network pruning and makes differentiability possible for automatic network pruning.
3 Methodology
The pipeline of the proposed method is shown in Fig 2. The two cores of the whole pipeline are the designed hypernetwork and the optimization algorithm, i.e. proximal gradient. In the forward pass, the designed hypernetwork takes as input the latent vectors and predicts the weight parameters for the backbone network. The gradients are backpropagated to the hypernetwork in the backward pass. The sparsity regularizer is enforced on the latent vectors and proximal gradient is used to solve the problem. The dimension of the output of the hypernetwork is covariant with that of the input. Due to this property, the output weights are pruned along with the sparse latent vectors after the optimization step. The differentiability comes with the covariance property of the hypernetworks, the sparsity enforced on the latent vectors, and the proximal gradient used to solve the problem. The automation of pruning is due to the fact that all of the latent vectors are nondiscriminatively regularized and that proximal gradient discovers the potential less important elements automatically.
3.1 Hypernetwork design
We first introduce the design of the hypernetwork shown in Fig. 3. In summary, the hypernetwork consists of three layers. The latent layer takes as input the latent vectors and computes a latent matrix from them. The embedding layer projects elements of the latent vector to an embedding space. The last explicit layer converts the embedded vectors to the final output. This design is inspired by fully connected layers as in [11, 40] but different from those designs in that the output dimension is covariant with the input latent vector. This design is applicable to all types of convolutions including the standard convolution, depthwise convolution, pointwise or convolution, and transposed convolution. And for the simplicity of reference, the term convolution is used to denote any of them. Unless otherwise stated, we use the normal (), minuscule bold (), and capital bold () letters to denote scalars, vectors, and matrices or highdimensional tensors. The elements of a tensor is indexed by the subscript as which could be scalars or vectors depending on the the dimension of the indexed tensor.
Suppose that the given is an layer convolutional neural netowrk (CNN) with layers indexed by . The dimension of the weight parameter of a convolutional layer is , where , , and denote the output channel, input channel, and kernel size of the convolutional layer, respectively. Every convolutional layer is endowed with a latent vector with the same size as the output channel, namely, . Thus, the layer previous to the current one is given a latent vector . The hypernetwork receives the latent vectors and of the current and the previous layer as input. A latent matrix is first derived from the two latent vectors, namely,
(1) 
where and denote matrix transpose and multiplication, . Then every element in the latent matrix is projected to an th dimensional embedding space with the vectors and , namely,
(2) 
where , and are elementwise unique and for the simplicity of notation, the subscript is omitted. We denote the ensemble of the elementwise embedding operation with the following highdimensional tensor operation
(3) 
where , , and are the th slice of and along their first two dimensions, denotes the broadcastable elementwise tensor multiplication, inserts a third dimension for . Note that after the operation in Eqn. 3, all of the elements of are converted to embedded vectors in the embedding space. The final step is to obtain the output that can be explicitly used as the weights of the convolutional layer. To achieve that, every embedded vector is multiplied by an explicit matrix, that is,
(4) 
where , . The operation above can be easily rewritten as batched matrix multiplication as in the convention of tensor operation,
(5) 
where , , denotes batched matrix multiplication. Again note that in Eqn. 4 and are unique for every embedded vector and denote the slices from and .
A simplified representation of the output of the hypernetwork is given by
(6) 
where denotes the functionality of the hypernetwork. The final output can be reshaped to form the weight parameter of th layer. The output is said to be covariant with the input latent vector in that its first two dimensions are the same with the two latent vectors. By imposing sparsity regularization to the latent vectors, the corresponding element in the output can also be removed, thus achieving the effect of network pruning.
For the initialization of the parameters, all biases are initialized as zero, the latent vector with standard normal distribution, and
with Xaiver uniform [10]. The weight of the explicit layer is initialized with Hyperfanin which guarantees stable backbone network weights and fast convergence [6].3.2 Sparsity regularization and proximal gradient
The core of approximate differentiability comes with not only the specifically designed hypernetwork but also the mechanism used to search the the potential candidate. To achieve that, we enforce sparsity constraints to the latent vectors. Thus, the loss function of the aforementioned
layer CNN is denoted as(7) 
where , , and are the general loss function of the CNN, weight decay term, and sparsity regularization term, respectively. For the simplicity of notation, the superscript is omitted. The sparsity regularization takes the form of norm, namely,
(8) 
To solve the problem in Eqn. 7, the weights and biases of the hypernetwork are updated with SGD. Note that the gradient are backpropagated from the backbone network to the hypernetwork. Thus, neither the forward pass nor the backward pass challenges the information flow between the backbone network and the hypernetwork. As for the latent vectors, they are updated with proximal gradient algorithm, that is,
(9) 
where is the step size of the proximal gradient method that is set as the learning rate of SGD updates. As can be seen in the equation, the proximal gradient update contains a gradient descent step and a proximal operation step. When the regularizer has the form of norm, the proximal operator has closedform solution, i.e.
(10) 
where is the intermediate SGD update, the sign operator , the thresholding operator , and the absolute value operator act elementwise on the vector. Eqn. 10 is the wellknown softthresholding function.
In practice, the latent vectors first get SGD updates along with the other parameters and after which the proximal operator is applied. Since the existence of SGD updates and the fact that the proximal operator has closedform solution, we recognize the whole solution as approximately diferentiable (although the norm is not differentiable at 0), which guarantees the fast convergence of the algorithm compared with reinforcement learning or evolutionary algorithm. Actually, the speedup of proximal gradient lies in that instead of searching the best candidate among the total population it forces the solution towards the best sparse one.
The automation of pruning follows the way the sparsity applied in Eqn. 8 and the proximal gradient solution. First of all, all latent vectors are regularized together without any distinguishment between them. During the optimization, information and gradients flows fluently between the backbone network and the hypernetwork. The proximal gradient algorithm forces the potential elements of the latent vectors to approach zero quicker than the others without any human effort and interference in this process. The optimization stops immediately when the target compressed ratio is reached. In total, there are only two additional hyperparameters in the algorithm, i.e. the sparsity regularization factor and the mask threshold in Subsec. 3.3. Thus, running the algorithm is just like turning on the button, which enable the application of the algorithm to all of the CNNs without much interference of domain experts’ knowledge.
3.3 Network pruning
Different from the fully connected layers, the proposed design of hypernetwork can adapt the dimension of the output according to that of the latent vectors. After the searching stage, sparse versions of the latent vectors are derived as and . For those vectors, some of their elements are zero or approaching zero. Thus, 10 masks can by derived by comparing the sparse latent vectors with a predefined small threshold , that is,
(11) 
where the function elementwise compares the latent vector with the threshold and returns 1 if the element is not smaller than and 0 otherwise. By applying the masks to the latent vectors and analyzing the three layers together, we can have an direct impression of how the backbone layers are automatically pruned. That is,
(12)  
(13) 
The equality follows the broadcastability of the the operations and . As shown in the above equations, applying the masks on the latent vectors has the same effect of applying them on the final output. Note that in the above analysis the bias terms , , and are omitted since they have a really small influence on the output of the hypernetwork. Anyway the same mask matrix is applied to the biases. In conclusion, the final output can be pruned according to the same criterion for the latent vectors.
3.4 Latent vector sharing
Network Top1 Error (%)  Compression  Top1 Error (%)  FLOPs Ratio (%)  Parameter Ratio (%) 
Method  
CIFAR10  
ResNet56 7.05  Variational [67]  7.74  79.70  79.51 
GAL0.6 [36]  6.62  63.40  88.20  
56prunedB [30]  6.94  62.40  86.30  
NISP [64]  6.99  56.39  57.40  
DHP50 (Ours)  6.31  50.98  55.62  
CaP [44]  6.78  50.20  –  
ENC [25]  7.00  50.00  –  
AMC [17]  8.10  50.00  –  
KSE [34]  6.77  48.00  45.27  
FPGM [16]  6.74  47.70  –  
GAL0.8 [36]  8.42  39.80  34.10  
DHP38 (Ours)  6.86  39.60  49.00  
ResNet110 5.31  Variational [67]  7.04  63.56  58.73 
DHP62 (Ours)  5.36  62.50  59.28  
GAL0.5 [36]  7.26  51.50  55.20  
DHP20 (Ours)  6.93  21.77  21.30  
ResNet164 4.97  SSS [24]  5.78  53.53  – 
DHP50 (Ours)  5.29  51.86  44.10  
Variational [67]  6.84  50.92  43.30  
DHP20 (Ours)  5.91  21.47  20.24  
DenseNet1240 5.26  Variational [67]  6.84  55.22  40.33 
DHP38 (Ours)  6.06  39.80  63.76  
DHP28 (Ours)  6.51  29.52  26.01  
GAL0.1 [36]  6.77  28.60  25.00  
VGG16 6.34  DHP40 (Ours)  7.61  40.11  35.61 
DHP20 (Ours)  8.60  21.65  17.18  
TinyImageNet 

ResNet50 42.35  DHP40 (Ours)  43.97  41.57  45.89 
MetaPruning [40]  49.82  10.21  37.32  
DHP9 (Ours)  47.00  10.95  13.15  
MobileNetV1 51.87  DHP242 (Ours)  50.47  98.95  79.65 
MobileNetV10.75  53.18  57.74  57.75  
MetaPruning [40]  54.66  56.49  88.06  
DHP50 (Ours)  51.96  51.37  47.76  
MobileNetV2 43.83  DHP242 (Ours)  43.14  96.34  107.57 
MobileNetV20.3  53.21  11.08  11.89  
MetaPruning [40]  56.53  11.08  89.92  
DHP10 (Ours)  47.65  11.94  16.86 
Due to the existence of skip connections in residual networks such as ResNet, MobileNetV2, SRResNet, and EDSR, the residual blocks are interconnected with each other in the way that their input and output dimensions are related. Therefore, the skip connections are notoriously tricky to deal with. But back to the design of the proposed hypernetwork, a quite simple and straightforward solution is to let the hypernetworks of the correlated layers share the same latent vector. Thus, by automatically pruning the single latent vector, all of the relevant layers are pruned together. Actually, we first tried to use unique latent vector for the correlated layers and applied group sparsity to them. But the experimental results showed that this is not a good choice because it shot lower accuracy than sharing the latent vectors (See details in the Supplementary).
4 Experimental Results
To validate the effectiveness of DHP, extensive experiments have been conducted on various network architectures for different computer vision task including VGG [53], ResNet [15], DenseNet [22] for CIFAR10 [26] image classification, ResNet50, MobileNetV1 [19], MobileNetV2 [52] for TinyImageNet [8] image classification, SRResNet [29], EDSR [35] for single image superresolution, and DnCNN [66], UNet [51] for gray image denoising. For superresolution, the network are trained on DIV2K [1] dataset and tested on Set5 [4], Set14 [65], B100 [43], Urban100 [23], and DIV2K validation set. For image denoising, the networks are trained on the gray version of DIV2K dataset and tested on BSD68 and DIV2K validation set.
We train and prune the networks from scratch without relying on the pretrained model. The proximal gradient algorithm is first used to prune the network with initialization detailed in Subsec. 3.1. A target ratio is set for the pruning procedure. When the difference between the target ratio and the actual compression ratio is below 2%, the automatic pruning procedure stops. The training of the pruned network continues with the same protocol as done for the original network. Please refer to the supplementary for the detailed training protocol and the selection of hyperparameters.
4.1 Image classification
The experimental results of different network compression algorithms on image classification is shown in Table 1. For ResNet56, the proposed method is compared with 9 different network compression methods and achieves the best performance, i.e. 6.31% Top1 error rate on the most intensively investigated 50% compression level. Note that this error rate is even lower than the uncompressed baseline. The compression on DenseNet1240 is reasonable compared with the other method. For ResNet110 and ResNet164, the accuracy of our higher operating points DHP62 and DHP50 is not far away from that of the baseline. More results on ResNet110 and ResNet164 is shown in Fig. 4. As can be seen in the figure especially for ResNet164, when the compression ratio is not too severe (above 20%), the accuracy does not drop too much. The extreme compression prunes about 90% of computation and parameters (for ResNet164, the extreme compression only keeps 6.87% parameters). Thus, the drop in the accuracy is reasonable.
On TinyImageNet, DHP achieves lower Top1 error rates than MetaPruning under the same FLOPs constraint. Our lower operating points shoot lower Top1 error rate than the narrowed versions of mobile networks. On MobileNetV1, the error rate of DHP10 is lower than MobileNetV10.75 by 1.08% with 6.37% fewer FLOPs and nearly 10% fewer parameters while on MobileNetV1 the accuracy gain of DHP10 over MobileNetV20.3 goes to 8.88%. Thus, we hypothesize that we can target an error rate lower than the original version by pruning the widened mobile networks. And this is confirmed by comparing the accuracy of our DHP242 with the baseline accuracy in Fig. 1.
4.2 Superresolution
The image superresolution results are shown in Table 2. We compare our method with factorized convolution (Factor) [59], filter basis method (Basis) [33]
, and Kmeans clustering method (Clustering)
[54]. To fairly compare the methods and measure the practical compression effectiveness, five metrics are involved including Peak SignaltoNoise Ratio (PSNR), floating point operations (FLOPs), number of parameters, runtime and GPU memory consumption. By observing the five metrics, we have several conclusions. Firstly, previous methods mainly focus on the reduction of FLOPs and number of parameter without paying special attention to the actual acceleration. Although some methods such as Clustering can reduce substantial parameters while maintaining quite good PSNR accuracy, the actual computing resources are remained (GPU memory) or even increased (runtime) due to the overhead introduced by centroid indexing. Secondly, convolution factorization or decomposition results in additional CUDA kernel calls, which is not efficient for the actual acceleration. Thirdly, for the proposed method, the two model complexity metrics changes consistently across different operating points, which leads to consistent reduction of computation resource including runtime and memory consumption. Fourthly, the proposed DHP results in inference efficient models as well as the accuracypreserving one. The visual results are shown in Fig.
5. As can been seen, the proposed method results in almost indistinguishable images with that of the baseline while achieving the actual acceleration.Network  Method  PSNR  FLOPs  Params  Run time  GPU Mem  
Set5  Set14  B100  Urban100  DIV2K  
SRResNet  Baseline  32.03  28.50  27.52  25.88  28.85  32.81  1535.62  7.80  2.2763 
[29]  Clustering [54]  31.93  28.44  27.47  25.71  28.75  32.81  341.25  10.84  2.2727 
FactorSIC3 [59]  31.86  28.38  27.40  25.58  28.65  20.83  814.73  17.59  2.2737  
DHP60 (Ours)  31.97  28.47  27.48  25.76  28.79  20.27  948.63  6.98  2.0568  
Basis3232 [33]  31.90  28.42  27.44  25.65  28.69  19.77  738.18  13.14  0.9331  
FactorSIC2 [59]  31.68  28.32  27.37  25.47  28.58  18.38  661.13  15.77  2.2763  
Basis6414 [33]  31.84  28.38  27.39  25.54  28.63  17.49  598.91  9.77  0.7267  
DHP40 (Ours)  31.90  28.45  27.47  25.72  28.75  13.68  638.75  5.75  1.7152  
DHP20 (Ours)  31.77  28.34  27.40  25.55  28.60  7.75  357.92  4.52  1.3362  
EDSR  Baseline  32.10  28.55  27.55  26.02  28.93  90.36  3696.67  16.83  0.4438 
[35]  Clustering [54]  31.92  28.46  27.48  25.76  28.80  90.36  821.48  21.70  0.6775 
FactorSIC3 [59]  31.96  28.47  27.49  25.81  28.81  65.49  2189.34  33.22  1.5007  
Basis12840 [33]  32.03  28.45  27.50  25.81  28.82  62.65  2003.46  16.59  0.4679  
FactorSIC2 [59]  31.82  28.40  27.43  25.63  28.70  60.90  1904.67  27.69  1.1247  
Basis12827 [33]  31.95  28.42  27.46  25.76  28.76  58.28  1739.52  17.13  0.4772  
DHP60 (Ours)  31.99  28.52  27.53  25.92  28.88  55.67  2279.64  7.81  0.4658  
DHP40 (Ours)  32.01  28.49  27.52  25.86  28.85  37.77  1529.78  6.01  0.4658  
DHP20 (Ours)  31.94  28.42  27.47  25.69  28.77  19.40  785.85  4.37  0.4658 
4.3 Denoising
The compression results for image denoising is shown in Table 3. The same metrics as super resolution are reported for denoising. An additional method, i.e. filter group approximation (Group) [48] is included. In addition to the same conclusion as in Subsec.0.A.2, another two conclusions are drawn here. The grouped convolution approximation method fails to reduce the actual computation resources although with quite good accuracy and satisfactory reduction of FLOPs and number of parameters. This might due to the introduced additional
convolution and possibly the inefficient implementation of group convolution in current deep learning toolboxes. For DnCNN, one interesting phenomenon is that the Factor method achieves even better accuracy than the baseline but has larger appetite for other resources. This is not surprising due to two facts. The SIC layer of Factor doubles the actual convolutional layers. So FactorSIC3 has five times more convolutioinal layers, which definitely slows down the execution. Aother fact is that Factor has skip connections within the SIC layer. The outperformance of Factor in accuracy just validates the effectiveness of skip connections. The performance of the other method can also be improved if skip connections are added. The visual results is shown in Fig.
6.Network  Method  PSNR  FLOPs  Params  Runtime  GPU Mem  
BSD68  DIV2K  
DnCNN [66]  Baseline  24.93  26.73  9.10  557.06  23.69  3.99 
Clustering [54]  24.90  26.67  9.10  123.79  21.84  3.99  
DHP60 (Ours)  24.91  26.69  5.62  344.61  17.11  2.70  
DHP40 (Ours)  24.89  26.65  3.81  233.75  17.19  2.33  
FactorSIC3 [59]  24.97  26.83  3.53  219.14  126.22  5.96  
Group [48]  24.88  26.64  3.32  204.74  26.81  4.02  
FactorSIC2 [59]  24.93  26.76  2.36  147.14  85.14  5.96  
DHP20 (Ours)  24.84  26.58  2.00  122.72  10.39  1.99  
UNet [51]  Baseline  25.17  27.17  3.41  7759.52  7.27  3.75 
Clustering [54]  25.01  26.90  3.41  1724.34  9.66  3.73  
DHP60 (Ours)  25.14  27.11  2.12  4760.27  6.09  2.93  
FactorSIC3 [59]  25.04  26.94  1.56  3415.65  40.36  4.71  
Group [48]  25.13  27.08  1.49  2063.91  9.00  3.75  
DHP40 (Ours)  25.12  27.08  1.43  3238.12  5.17  2.04  
FactorSIC2 [59]  25.01  26.90  1.22  2510.82  29.31  4.71  
DHP20 (Ours)  25.04  26.97  0.75  1611.03  3.99  1.77 
PSNR/FLOPs/Runtime  32.85/28.59/14.10  32.50/28.59/19.75  32.65/19.82/14.71  32.24/19.28/25.49  32.64/17.61/5.40 
(a) LR  (b) EDSR  (d) Cluster  (c) Basis  (f) Factor  (e) DHP 
PSNR/FLOPs/Runtime  25.60/1.08/7.27  25.30/1.08/9.66  25.37/0.49/40.36  25.51/0.47/9.00  25.57/0.45/6.09 
(a) Noisy  (b) UNet  (f) Cluster  (d) Factor  (e) Group  (c) DHP 
5 Conclusion and Future Work
In this paper, we proposed a differentiable automatic meta pruning method via hypernetwork for network compression. The differentiability comes with the specially designed hypernetwork and the proximal gradient used to search the potential candidate network configurations. The automation of pruning lies in the uniformly applied sparsity on the latent vectors and the proximal gradient that solves the problem. By pruning mobile network with width multiplier , we obtained models with higher accuracy but lower computation complexity than that with . We hypothesize this is due to the perlayer distinguishing configuration resulting from the automatic pruning. Future work might be investigating whether this phenomenon reoccurs for the other networks.
Acknowledgements
This work was partly supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project, an Amazon AWS grant, and an Nvidia grant.
References
 [1] (201707) NTIRE 2017 challenge on single image superresolution: dataset and study. In Proc. CVPRW, Cited by: §4.

[2]
(2016)
Learning the number of neurons in deep networks
. In Proce. NeurIPS, pp. 2270–2278. Cited by: §2.  [3] (2017) Compressionaware training of deep networks. In Proc. NeurIPS, pp. 856–867. Cited by: §2.
 [4] (2012) Lowcomplexity singleimage superresolution based on nonnegative neighbor embedding. In Proc. BMVC, Cited by: §4.
 [5] (2018) SMASH: oneshot model architecture search through hypernetworks. In Proc. ICLR, Cited by: §2, §2.
 [6] (2020) Principled weight initialization for hypernetworks. In Proc. ICLR, Cited by: §3.1.
 [7] (2018) Constraintaware deep neural network compression. In Proc. ECCV, pp. 400–415. Cited by: §1.
 [8] (2009) ImageNet: a largescale hierarchical image database. In Proc. CVPR, pp. 248–255. Cited by: §4.
 [9] (2017) Learning to prune deep neural networks via layerwise optimal brain surgeon. In Proc. NIPS, pp. 4857–4867. Cited by: §1.
 [10] (2010) Understanding the difficulty of training deep feedforward neural networks. In Proc. AISTATS, pp. 249–256. Cited by: §3.1.
 [11] (2017) HyperNetworks. In Proc. ICLR, Cited by: §1, §2, §3.1.
 [12] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proc. ICLR, Cited by: §1, §2.
 [13] (1993) Second order derivatives for network pruning: optimal brain surgeon. In Proc. NeurIPS, pp. 164–171. Cited by: §1.

[14]
(2019)
Einconv: exploring unexplored tensor decompositions for convolutional neural networks
. In Proc. NeurIPS, pp. 5553–5563. Cited by: §1, §2.  [15] (2016) Deep residual learning for image recognition. In Proc. CVPR, pp. 770–778. Cited by: §0.A.1.1, item (iv), §4.
 [16] (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proc. CVPR, pp. 4340–4349. Cited by: Table 5, §1, Table 1.
 [17] (2018) AMC: autoML for model compression and acceleration on mobile devices. In Proc. ECCV, pp. 784–800. Cited by: Table 5, §1, Table 1.
 [18] (2017) Channel pruning for accelerating very deep neural networks. In Proc. ICCV, pp. 1389–1397. Cited by: §1.
 [19] (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: Table 6, item (iv), §1, §4.
 [20] (2018) Learning to segment every thing. In Proc. CVPR, pp. 4233–4241. Cited by: §2.
 [21] (2019) MetaSR: a magnificationarbitrary network for superresolution. In Proc. CVPR, pp. 1575–1584. Cited by: §2.
 [22] (2017) Densely connected convolutional networks. In Proc. CVPR, pp. 2261–2269. Cited by: §0.A.1.1, item (iv), §4.
 [23] (2015) Single image superresolution from transformed selfexemplars. In Proc. CVPR, pp. 5197–5206. Cited by: §4.
 [24] (2018) Datadriven sparse structure selection for deep neural networks. In Proc. ECCV, pp. 304–320. Cited by: Table 5, Table 1.
 [25] (201906) Efficient neural network compression. In Proc. CVPR, Cited by: Table 5, Table 1.
 [26] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §0.A.1.1, §4.
 [27] (2017) Bayesian hypernetworks. arXiv preprint arXiv:1710.04759. Cited by: §2.
 [28] (1990) Optimal brain damage. In Proc. NeurIPS, pp. 598–605. Cited by: §1.
 [29] (2017) Photorealistic single image superresolution using a generative adversarial network. In Proc. CVPR, pp. 105–114. Cited by: item (iv), Table 2, §4.
 [30] (2017) Pruning filters for efficient convnets. In Proc. ICLR, Cited by: Table 5, §1, Table 1.
 [31] (2019) OICSR: outinchannel sparsity regularization for compact deep neural networks. In Proc. CVPR, pp. 7046–7055. Cited by: §2.
 [32] (2020) Group sparsity: the hinge between filter pruning and decomposition for network compression. In Proc. CVPR, Cited by: §1, §2.
 [33] (2019) Learning filter basis for convolutional neural network compression. In Proc. ICCV, pp. 5623–5632. Cited by: §4.2, Table 2.
 [34] (2019) Exploiting kernel sparsity and entropy for interpretable CNN compression. In Proc. CVPR, Cited by: Table 5, Table 1.
 [35] (2017) Enhanced deep residual networks for single image superresolution. In Proc. CVPRW, pp. 1132–1140. Cited by: item (iv), Table 2, §4.
 [36] (2019) Towards optimal structured cnn pruning via generative adversarial learning. In Proc. CVPR, pp. 2790–2799. Cited by: Table 5, Table 1.
 [37] (2015) Sparse convolutional neural networks. In Proc. CVPR, pp. 806–814. Cited by: §2.
 [38] (201809) Progressive neural architecture search. In Proc. ECCV, Cited by: §2.
 [39] (2019) DARTS: differentiable architecture search. In Proc. ICLR, Cited by: §1, §2.
 [40] (2019) MetaPruning: meta learning for automatic neural network channel pruning. In Proc. ICCV, Cited by: Table 6, §1, §1, §1, §2, §3.1, Table 1.
 [41] (2019) Rethinking the value of network pruning. In Proc. ICLR, Cited by: §1.
 [42] (2018) ShuffleNet V2: practical guidelines for efficient cnn architecture design. In Proc ECCV, pp. 116–131. Cited by: §0.B.6, Table 6, Appendix 0.C, item (iv).
 [43] (200107) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. ICCV, Vol. 2, pp. 416–423. Cited by: §4.
 [44] (201906) Cascaded projection: endtoend network compression and acceleration. In Proc. CVPR, Cited by: Table 5, Table 1.
 [45] MIT technology review: 10 breakthrough technologies 2020. Note: https://www.technologyreview.com/lists/technologies/2020/#tinyaiAccessed: 20200301 Cited by: §1.

[46]
(2019)
Importance estimation for neural network pruning
. In Proc. CVPR, pp. 11264–11272. Cited by: §1.  [47] (2018) HyperSTnet: hypernetworks for spatiotemporal forecasting. arXiv preprint arXiv:1809.10889. Cited by: §2.
 [48] (2018) Extreme network compression via filter group approximation. In Proc. ECCV, pp. 300–316. Cited by: §4.3, Table 3.
 [49] (2018) Efficient neural architecture search via parameters sharing. In Proc. ICML, pp. 4095–4104. Cited by: §2.

[50]
(2019)
Regularized evolution for image classifier architecture search
. In Proc. AAAI, Vol. 33, pp. 4780–4789. Cited by: §2.  [51] (2015) UNet: convolutional networks for biomedical image segmentation. In Proc. MICCAI, pp. 234–241. Cited by: item (iv), Table 3, §4.
 [52] (2018) MobileNetV2: inverted residuals and linear bottlenecks. In Proc. CVPR, pp. 4510–4520. Cited by: Table 6, item (iv), §1, §4.
 [53] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: item (iv), §4.
 [54] (2018) Clustering convolutional kernels to compress deep neural networks. In Proc. ECCV, pp. 216–232. Cited by: §4.2, Table 2, Table 3.
 [55] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
 [56] (2019) Mnasnet: platformaware neural architecture search for mobile. In Proc. CVPR, pp. 2820–2828. Cited by: Table 6, Appendix 0.C, item (iv).
 [57] (2019) GASL: guided attention for sparsity learning in deep neural networks. arXiv preprint arXiv:1901.01939. Cited by: §2.
 [58] (2019) EigenDamage: structured pruning in the kroneckerfactored eigenbasis. In Proc. ICML, pp. 6566–6575. Cited by: §1.
 [59] (2017) Factorized convolutional neural networks. In Proc. ICCVW, pp. 545–553. Cited by: §4.2, Table 2, Table 3.
 [60] (2016) Learning structured sparsity in deep neural networks. In Proc. NeurIPS, pp. 2074–2082. Cited by: §2.
 [61] (2018) Metaanchor: learning to detect objects with customized anchors. In Proc. NeurIPS, pp. 320–330. Cited by: §2.
 [62] (2018) Rethinking the smallernormlessinformative assumption in channel pruning of convolution layers. In Proc. ICLR, Cited by: §1.
 [63] (2017) Combined group and exclusive sparsity for deep neural networks. In Proc. ICML, pp. 3958–3966. Cited by: §2.
 [64] (2018) NISP: pruning networks using neuron importance score propagation. In Proc. CVPR, pp. 9194–9203. Cited by: Table 5, Table 1.
 [65] (2010) On single image scaleup using sparserepresentations. In International Conference on Curves and Surfaces, pp. 711–730. Cited by: §4.
 [66] (2017) Beyond a Gaussian denoiser: residual learning of deep CNN for image denoising. IEEE TIP 26 (7), pp. 3142–3155. Cited by: item (iv), Table 3, §4.
 [67] (2019) Variational convolutional neural network pruning. In Proc. CVPR, pp. 2780–2789. Cited by: Table 5, Table 1.
 [68] (2017) Neural architecture search with reinforcement learning. In Proc. ICLR, Cited by: §1, §1, §2.
 [69] (201806) Learning transferable architectures for scalable image recognition. In Proc. CVPR, Cited by: §1, §2.
Appendix 0.A Training Protocol
As explained in the main paper, the proposed DHP method does not rely on the pretrained model. Thus, all of the networks are trained and pruned from scratch. The hypernetworks are first initialized and used along with the proximal gradient to sparsify the latent vectors. When the difference between the target and the actual FLOPs compression ratio is below 2%, the pruning procedure stops. Then the pruned latent vectors as well as the pruned outputs of the hypernetworks are derived. After that, the outputs of hypernetworks are used as the weight parameters of the backbone network and updated by SGD or Adam algorithm directly. The hypernetworks are of course removed. After the pruning procedure, the training continues and the training protocol are the same as that of training the original uncompressed network. The number of pruning epochs is much smaller than that used for training the original network. Usually, the pruning procedure continues for about 10 epochs compared with the hundreds of epochs for training the uncompressed network. The following of the section describes the training protocols of different tasks.
0.a.1 Image Classification
0.a.1.1 Cifar10
We evaluate the performance of compressed models on CIFAR10 [26] dataset. The dataset contains 10 different classes. The training and testing subset contains 50,000 and 10,000 images with resolution , respectively. As is done by prior works [15, 22]
, we normalize all images using channelwise mean and standard deviation of the the training set. Standard data augmentation is also applied. We train the networks for 300 epochs with SGD optimizer and an initial learning rate of 0.1. The learning rate is decayed by 10 after 50% and 75% of the epochs. The momentum of SGD is 0.9. Weight decay factor is set to 0.0001. The batch size is 64.
0.a.1.2 TinyImageNet
For image classification, we also apply the pruning method on TinyImagenet. It has 200 classes. Each class has 500 training images and 50 validation images. And the resolution of the images is . The images are normalized with channelwise mean and standard deviation. Horizontal flip is used to augment the dataset. The networks are trained for 220 epochs with SGD and an initial learning rate of 0.1. The learning rate is decayed by a factor of 10 at Epoch 200, Epoch 205, Epoch 210, and Epoch 215. The momentum of SGD is 0.9. Weight decay factor is set to 0.0001. The batch size is 64.
0.a.2 SuperResolution
0.a.2.1 Training protocol
For image superresolution, we train the networks on DIV2K dataset. It contains 800 training images, 100 validation images, and 100 test images. We use the 800 training images to train our network, and validate on the validation subset. Image patches are extracted from the training images. For EDSR, the patch size of the lowresolution input patch is while for SRResNet the patch size is . The batch size is 16. The networks are optimized with Adam optimizer. We use the default hyperparameter for Adam optimizer. The weight decay factor is 0.0001. The networks are trained for 300 epochs. The learning rate starts from 0.0001 and decays by 10 after 200 epochs.
0.a.2.2 Simplified EDSR architecture
Note that, in order to speed up the training of EDSR, we adopt a simplified version of EDSR. The original EDSR contains 32 residual blocks and each convolutional layer in the residual blocks has 256 channels. Our simplified version has 8 residual blocks and each has two convolutional layers with 128 channels.
0.a.3 Denoising
For image denoising, we also train the networks on DIV2K dataset. But all of the images are converted to gray images. As done for superresolution, image patches are extracted from the training images. For DnCNN, the patch size of the input image is and the batch size is 64. For UNet, the patch size and the batch size are and 16, respectively. Gaussian noise is added to degrade the input patches on the fly with noise level . Still, Adam optimizer is used to train the network. The weight decay factor is 0.0001. The networks are trained for 60 epochs and each epoch contains 10,000 iterations. So in total, it’s 600k iterations. The learning rate starts with 0.0001 and decays by 10 at Epoch 40.
Appendix 0.B Latent Vector Sharing Strategy for Different Networks
0.b.1 Basic criteria
Every convolutional layer including standard convolution, depthwise convolution, pointwise convolution, group convolution, and transposed convolution is attached a latent vector. The latent vectors control the channels of the convolutional layer. Thus, by dealing with only the latent vector, we can control how the convolutionaly layers are pruned. Actually, some complicated cases occur in the modern network architecture and the latent vectors has to be shared among different layers. Thus, during the development of the algorithm, we summarize some basic rules for latent vectors. In the following, we first describe the general rules for latent vectors and then detail the specific rules for special network blocks.

Every convolutional layer is attached a latent vector.

The channels that the latent vector controls and the dimension of the latent vector varies with the types of convolutional layers.

For standard convolution, pointwise convolution and transposed convolution, the latent vector controls the output channel of the layer and the dimension of the latent vector is the same as the number of output channels.

For depthwise convolution and group convolution, the latent vector controls the input channels per group and the dimension of the latent vector is the same as the number of input channels per group. That is, the latent vector of depthwise convolution contains only one element.


Since the output and input channels of consecutive layers are correlated, the latent vectors has to be shared among consecutive layers. That is, the hypernetworks have to receive the latent vectors of the previous layer and the current layer as input in order to make the input and output channels consistent with the previous and latter layers.

Not every latent vector needs to be sparsified. And the latent vectors free from sparsifying are list as follows.

The latent vector that controls the input channel of the first convolutional layer. This latent vector has the same dimension with the input image channels, e.g. 3 for RGB images and 1 for gray images. Of course, the input images do not need to be pruned.

The latent vector that controls the output channel of the last convolutional layer of image classification network. This latent vector is related to the fully connected linear layers of image classifiers. Since we do not intend to prune the fully connected layer, the correlated latent vector is not pruned either.

The latent vector attached to depthwise convolution and group convolution. This latent vector controls the input channels per group. The input and output channels of depthwise and group convolution are correlated. In order to compress depthwise and group convolution, we prune the output channels. This has the same effect of reducing the number of groups which is controlled by the latent vectors of the previous layer instead of the input channels per group controlled by the latent vectors in the current layer.

0.b.2 Residual block
The residual networks including ResNet, SRResNet, and EDSR are constructed by stacking a number of residual blocks. Depending on the dimension of the feature maps, the residual networks contains several stages with progressively reducing feature map dimension and increasing number of feature maps. (Note that the feature map dimension of EDSR and SRResNet does not change for all of the residual blocks. So there is only one stage for those networks.) For the residual blocks within the same stage, their output channels are correlated due to the existence of the skip connections. In order to prune the second convolution of the residual blocks within the same stage, we set a shared latent vector for them. Thus, by only dealing with this shared latent vector, all of the second convolutions of the residual blocks can be pruned together. Please refer to Table 4 for the ablation study on latent vector sharing and nonsharing strategies.
0.b.3 Dense block
Similar to residual networks, DenseNet also contains several stages with different feature map configurations. But different from residual networks, each dense block concatenates its input and output to form the final output of the block. As a result, each dense block receives as input the outputs of all of the previous dense blocks within the same stage. Thus, the hypernetwork of a dense block also has to receive the latent vectors of the corresponding dense blocks as input.
0.b.4 Inverted residual block
The inverted residual blocks are just a special case of residual blocks. So how the latent vectors are shared across different blocks is the same with the normal residual blocks. Here we specifically address the sharing strategy within the block due to the existence of depthwise convolution. The inverted residual block has the architecture of “pointwise conv + depthwise conv + pointwise conv”. As explained earlier, the latent vector of depthwise convolution controls the input channels per group, i.e. 1 here. Thus, the latent vector of the first pointwise convolution controls not only its output channels but also the input channels of the depthwise convolution and the second pointwise convolution. Thus, this latent vector has to be passed to the hypernetworks of the those convolutional layers.
0.b.5 Upsampler of superresolution networks
The image superresolution networks are attached upsampler blocks at the tail part of the network to increase the spatial resolution of the feature map. For the scaling factor of , two upsamplers are attached and each doubles the spatial resolution. Each of the upsampler block contains a standard convolutional layer that increase the number of feature maps by a factor a 4 and a pixel shuffler that shuffles every 4 consecutive feature maps into the spatial dimension. Thus, the output channels of the convolutional layer in the upsmapler is correlated to its input channels. If one input channel is pruned, then 4 corresponding consecutive output channels should also be pruned. To achieve this way of pruning control, a common latent vector is used for the input and output channels and the vector is interleavedly repeated to form the one controlling the output channels.
Share  Regularizer  Target  Actual  Actual  Top1  

FLOPs Ratio (%)  FLOPs Ratio (%)  Parameter Ratio (%)  Error (%)  
Yes  38  39.96  52.49  7.41  
No  38  39.75  55.43  7.32  
No  38  39.40  54.24  7.91  
Yes  38  39.60  49.00  6.86  
No  38  39.30  58.35  7.98  
No  38  39.54  54.04  7.03  
Yes  50  51.27  56.84  7.13  
No  50  51.44  65.47  6.85  
No  50  50.96  64.05  6.85  
Yes  50  51.68  57.74  6.52  
No  50  50.23  62.83  7.11  
No  50  50.18  59.15  6.74 
Network Top1 Error (%)  Compression  Top1 Error (%)  FLOPs Ratio (%)  Parameter Ratio (%) 
Method  
ResNet56 7.05  Variational [67]  7.74  79.70  79.51 
GAL0.6 [36]  6.62  63.40  88.20  
56prunedB [30]  6.94  62.40  86.30  
NISP [64]  6.99  56.39  57.40  
DHP50 (Ours)  6.31  50.98  55.62  
CaP [44]  6.78  50.20  –  
ENC [25]  7.00  50.00  –  
AMC [17]  8.10  50.00  –  
KSE [34]  6.77  48.00  45.27  
FPGM [16]  6.74  47.70  –  
GAL0.8 [36]  8.42  39.80  34.10  
DHP38 (Ours)  6.86  39.60  49.00  
ResNet110 5.31  Variational [67]  7.04  63.56  58.73 
DHP62 (Ours)  5.36  62.50  59.28  
GAL0.5 [36]  7.26  51.50  55.20  
DHP20 (Ours)  6.93  21.77  21.30  
ResNet164 4.97  SSS [24]  5.78  53.53  – 
DHP50 (Ours)  5.29  51.86  44.10  
Variational [67]  6.84  50.92  43.30  
DHP20 (Ours)  5.91  21.47  20.24  
DenseNet1240 5.26  Variational [67]  6.84  55.22  40.33 
DHP38 (Ours)  6.06  39.80  63.76  
DHP28 (Ours)  6.51  29.52  26.01  
GAL0.1 [36]  6.77  28.60  25.00  
DHP24 (Ours)  6.53  27.12  25.76  
DHP20 (Ours)  7.17  22.85  20.38  
VGG16 6.34  DHP40 (Ours)  7.61  40.11  35.61 
DHP20 (Ours)  8.60  21.65  17.18 
Network Top1 Error (%)  Compression  Top1 Error (%)  FLOPs Ratio (%)  Parameter Ratio (%) 

Method  
ResNet50 42.35  DHP40 (Ours)  43.97  41.57  45.89 
MetaPruning [40]  49.82  10.21  37.32  
DHP9 (Ours)  47.00  10.95  13.15  
MobileNetV1 51.87  DHP242 (Ours)  50.47  98.95  79.65 
MobileNetV10.75 [19]  53.18  57.74  57.75  
MetaPruning [40]  54.66  56.49  88.06  
DHP50 (Ours)  51.96  51.37  47.76  
DHP30 (Ours)  54.47  31.79  26.28  
MobileNetV10.5 [19]  55.44  26.99  27.00  
DHP10 (Ours)  55.71  11.8  9.18  
MobileNetV10.3 [19]  60.54  10.46  10.63  
MobileNetV2 43.83  DHP242 (Ours)  43.14  96.34  107.57 
MobileNetV20.75 [52]  45.00  58.60  58.94  
DHP50 (Ours)  44.13  51.71  56.32  
DHP30 (Ours)  44.42  31.85  36.29  
MobileNetV20.5 [52]  48.01  28.17  28.59  
DHP10 (Ours)  47.65  11.94  16.86  
MobileNetV20.3 [52]  53.21  11.08  11.89  
MetaPruning [40]  56.53  11.08  89.92  
MNasNet 50.96  DHP282 (Ours)  50.11  98.52  58.51 
MNasNet0.75 [56]  52.32  71.22  63.89  
DHP69 (Ours)  50.95  70.86  63.5  
MNasNet0.5 [56]  52.99  41.70  35.61  
DHP39 (Ours)  51.73  40.93  31.26  
MNasNet0.3 [56]  55.87  24.72  19.90  
DHP22 (Ours)  53.92  23.72  15.76  
ShuffleNetV21.5 43.71  ShuffleNetV21.0 [42]  46.35  48.29  54.35 
DHP46 (Ours)  45.13  47.34  48.1 
0.b.6 Channel shuffle in ShuffleNetV2
The speciality of ShuffleNetV2 [42] is due to the channel shuffle operation. For each inverted residual block in ShuffleNetV2, the input feature map are divided into two branches. Different operations are applied to the two branches, after which channel shuffle operation is conducted between the feature maps of the two branches for the purpose of information communication between them. Due to this operation, the branches in all of the inverted residual blocks within the same stage are interconnected. Thus, if one channel in a branch is ought to be pruned, the corresponding channel in the other branch should also be pruned. This is not a problem before the channel shuffle operation of the current inverted residual block. But after the channel shuffle operation, the two pruned channels appear in the same branch of the next inverted residual block while none of the channels in the other branch are pruned. This causes imbalanced branches. Considering the fact that the channels are also shuffled in the next inverted residual block and that shuffle operation need balanced branches, pruning ShuffleNetV2 is almost impossible for the traditional network pruning method. But for the proposed DHP method, we just assign a shared latent vector to all of the interconnected branches of the inverted residual blocks within the same stage. By pruning the shared latent vector, all of the branches are compressed. Although the channel shuffle operation still complicates the situation and the pruned channels before and after channel shuffle operation do not match exactly, the proposed DHP still makes it possible to prune ShuffleNetV2. To the best of the authors’ knowledge, this is the first try to automatically prune ShuffleNetV2. Of course, a layerwise distinguishing configuration is found by the automatic pruning method.
Appendix 0.C More Results
The ablation study on the latent vector sharing is shown in Table 4. As shown by the table, the latent vector sharing strategy outperforms the nonsharing strategy consistently except for the case where the gap between the parameter compression ratio of different strategies is relatively large. Due to this fact, we develop various latent vector sharing rules for easier and better automatic network pruning.
More results on image classification networks are shown in Table 5 and Table 6. In addition to the results in the main paper, compression results on MNasNet [56] and ShuffleNetV2 [42] are also shown. Note that MobileNetV1, MobileNetV2, ShuffleNetV2, and MNasNet are quite efficient networks designed by human experts or automatic architecture search method. The proposed DHP can lead to more efficient versions of those networks with different width multipliers. This phenomenon validates the importance of perlayer dissimalated configurations.
Comments
There are no comments yet.