DHP: Differentiable Meta Pruning via HyperNetworks

03/30/2020 ∙ by Yawei Li, et al. ∙ 7

Network pruning has been the driving force for the efficient inference of neural networks and the alleviation of model storage and transmission burden. Traditional network pruning methods focus on the per-filter influence on the network accuracy by analyzing the filter distribution. With the advent of AutoML and neural architecture search (NAS), pruning has become topical with automatic mechanism and searching based architecture optimization. However, current automatic designs rely on either reinforcement learning or evolutionary algorithm, which often do not have a theoretical convergence guarantee or do not converge in a meaningful time limit. In this paper, we propose a differentiable pruning method via hypernetworks for automatic network pruning and layer-wise configuration optimization. A hypernetwork is designed to generate the weights of the backbone network. The input of the hypernetwork, namely, the latent vectors control the output channels of the layers of backbone network. By applying ℓ_1 sparsity regularization to the latent vectors and utilizing proximal gradient, sparse latent vectors can be obtained with removed zero elements. Thus, the corresponding elements of the hypernetwork outputs can also be removed, achieving the effect of network pruning. The latent vectors of all the layers are pruned together, resulting in an automatic layer configuration. Extensive experiments are conducted on various networks for image classification, single image super-resolution, and denoising. And the experimental results validate the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) FLOP compression ratio.
(b) Parameter compression ratio.
Figure 1: Top-1 error vs. FLOP and parameter compression ratio on MobileNetV1 and MobileNetV2. Comparison between the models compressed by the proposed DHP method and the original models with different width multipliers . The DHP operating points near 100% FLOPs ratio is obtained by pruning the two networks with width multiplier .

These days, network pruning has become the workhorse for network compression, which aims at lightweight and efficient model for fast inference [12, 18, 17, 41, 40, 32]

. This is of particular importance for the deployment of tiny artificial intelligence (Tiny AI) algorithms on smart phones and edge devices 

[45]. Since the emerging of network pruning a couple of methods have been proposed based on the analysis of gradients, Hessians or filter distribution [28, 13, 9, 46, 58, 30, 62, 16]. With the advent of AutoML and neural architecture search (NAS) [68, 7], a new trend of network compression and pruning emerges, i.e. pruning with automatic algorithms and targeting distinguishing mini-architectures (e.g. layers or building blocks.) Among them, reinforcement learning and evolutionary algorithm become the natural choice [17, 40]. The core idea is to search a certain fine-grained layer-wise dissimalated configuration among the all of the possible choices (population in the terminology of evolutionary algorithm). After the searching stage, the most promising candidate that optimizes the network capacity under constrained budgets is chosen.

The advantage of these automatic pruning methods is the final layer-wise distinguishing configuration. Thus, hand-crafted design is no longer necessary. However, the main concern of these algorithms is the convergence property. For example, reinforcement learning is notorious for its difficulty of convergence under large or even middle level number of states [55]. Evolutionary algorithm needs to choose the best candidate from the already converged algorithm. But the dilemma lies in the impossibility of training the whole population till convergence and the difficulty of choosing the best candidate from unconverged population [40, 14]

. A promising solution to this problem is endowing the searching mechanism with differentiability or directly resorting to an approximately differentiable algorithm. This is due to the fact that differentiability guarantees theoretical convergence and has the potential to make the searching stage efficient. Actually, differentiability has facilitated a couple of machine learning approaches and the typical one among them is NAS. Early works on NAS have insatiable demand for computing resources, consuming tens of thousands of GPU hours for a satisfactory convergence 

[68, 69]. The incorporation of differentiable architecture search (DARTS) reduces the insatiable consumption to tens of GPU hours, which has boosted the development of NAS during the past year [39].

Another noteworthy direction for automatic pruning is brought by MetaPruning [40] which introduces hypernetworks [11] into network compression. The output of the so-called hypernetwork is used as the parameters of the backbone network. During training, the gradients are also back-propagated to the hypernetworks. This method falls in the paradigm of meta learning since the parameters in the hypernetwork act as the meta-data of the parameters in the backbone network. But the problem of this method is that the hypernetworks can only output fixed-size weights, which cannot serve as a layer-wise configuration searching mechanism. Thus, a searching algorithm such as evolutionary algorithm is necessary for the discovery of a good candidate. Although this is quite a natural choice, there is still one interesting question, namely, whether one can design a hypernetwork whose output size depends on the input (termed as latent vector in this paper) so that by only dealing with the latent vector, the backbone network can be automatically pruned.

Figure 2: The workflow the the proposed differentiable pruning method. Orange arrows denote the information flow in the backbone network while the green the flow between the hypernetwork and the backbone. Pink arrow denotes how proximal gradient plays a role in the pipeline. Every convolutional layer in the backbone network is anchored a latent vector that controls the output dimension of the layer. The hypernetwork receives as input the latent vectors of the current and previous layer and emits output as the weight of the backbone layer. The differentiability comes with the hypernetwork tailored to pruning and the proximal gradient exploited to solve the sparsity regularization. After the pruning optimization stage, sparse latent vectors are obtained which results in contracted weights after being passed through the hypernetwork.

To solve the aforementioned problem, we propose the differentiable meta pruning approach via hypernetworks DHP (D – Differentiable, H – Hyper, P – Pruning). A new design of hypernetwork is proposed to adapt to the requirements of differentiability. Each layer is endowed with a latent vector that controls the output channels of this layer. The hypernetwork takes as input the latent vectors of the current layer and previous layer that controls the input and output channels of the current layer respectively. By forward passing the latent vectors through the hypernetwork, the derived output is used as the parameters of the hypernetwork. To achieve the effect of automatic pruning, sparsity regularizer is applied to the latent vectors. A pruned model is discovered by updating the latent vectors with proximal gradient. The searching stage stops when the compression ratio drops to the target level. After the searching stage, the latent vectors becomes sparse with some elements approaching zero which can be removed. Accordingly, the output of the hypernetwork that is covariant with the latent vector is also compressed. Thus, the advantage of the proposed method is that operating only on the latent vectors makes automatic network pruning easier without the other bells and whistles.

With the fast development of efficient network design and NAS, the usefulness of network pruning is frequently challenged. However, by analyzing the pruning performance on MobileNetV1 [19] and MobileNetV2 [52] in Fig. 1, we conclude that automatic network pruning is of vital importance for further exploring the capacity of efficient networks. Efficient network design and NAS can only result in an overall architecture with building blocks endowed with the same mini-architecture. By automatic network pruning, the efficient networks obtained by either human experts or NAS can by further compressed, leading to block-wise dissimilated configurations, which can be seen as a fine-grained architecture search. We conjuncture that this per-layer distinguishing configuration might help to discover the potential capacities of efficient network without losing too much accuracy of the original network.

Thus, the contribution of this paper is as follows.

  • A new architecture of hypernetwork is designed. Different from the classical hypernetwork composed of linear layers, the new design is tailored to automatic network pruning. The input latent vector of the hypernetwork controls the parameters of the layers of the backbone network. By only operating on the latent vector, the backbone network can be pruned.

  • A differentiable automatic networking pruning method is proposed based on the the newly designed hypernetwork. Different from the existing methods based on reinforcement learning or evolutionary algorithms, the proposed method has theoretical convergence guarantee based on the proximal gradient algorithm.

  • The potential of automatic network pruning as fine-grained architecture search is revealed by comparing the compression results from DHP with those from efficient networks.

  • The proposed differentiable automatic pruning method is not limited to a specific network architecture or layer type. A wide range of networks are compressed including VGG [53], ResNet [15], DenseNet [22], MobileNetV1 [19], MobileNetV2 [52], ShuffleNetV2 [42], MNasNet [56], DnCNN [66], UNet [51], SRResNet [29], and EDSR [35] which contains various layer and block types including standard convolution, depth-wise convolution, convolution, transposed convolution combined with bottleneck block, skip connection, and densely connected block.

  • Extensive experiments are done on both high-level and low-level vision tasks including image classification, single image super-resolution, and denoising. The experimental results show that the proposed method sets the new state-of-the-art in automatic network pruning.

The rest of the paper is organized as follows. Sec. 2 introduces the related works. Sec. 3 explains the proposed differentiable pruning method in detail. Experimental results are shown in Sec. 4. And Sec. 5 concludes the paper.

2 Related Works

Network pruning. Aiming at removing the weak filter connections that have the least influence on the accuracy of the network, network pruning has attracted increasing attention recently. Early attempts emphasize more on the storage consumption, various criteria have been explored to remove inconsequential connections in an unstructural manner [12, 37]. Despite their success in reducing network parameters, unstructural pruning leads to irregular weight parameters, limited in the actual acceleration of the pruned network. To further address the efficiency issue, structured pruning methods directly zero out structured groups of the convolutional filters. For example, Wen et al.  [60] and Alvarez et al.  [2] firstly proposed to resort to group sparsity regularization during training to reduce the number of feature maps in each layer. Since that, the field has witnessed a variety of regularization strategies[3, 63, 31, 57, 32]. These elaborately designed regularization methods considerably advance the pruning performance. But they often rely on carefully adjusted hyper-parameters selected for specific network architecture and dataset.

AutoML. Recently, there is an emerging trend of exploiting the idea of AutoML for automatic network compression. The rationality lies in the exploration among the total population of network configurations for a final best candidate. He et al. exploited reinforcement learning agents to prune the networks where hand-crafted design is not longer necessary. Hayashi et al. 

utilized genetic algorithm to enumerate candidate in the designed hypergraph for tensor network decomposition 

[14]. Liu et al. trained a hypernetwork to generate weights of the backbone network and used evolutionary algorithm to search for the best candidate. The problem of these approach is that the searching algorithm is not differentiable, which is does not result in guaranteed convergence.

NAS. NAS automatizes the manual task of neural network architecture design. Optimally, searched networks achieve smaller test error, require fewer parameters and need less computations than their manually designed counterparts [68, 50]. But the main drawback of both early strategies is their almost insatiable demand for computational resources. To alleviate the computational burden several methods [69, 38, 39] proposed to search for a basic building block, i.e. cell, opposed to an entire network. Then, stacking multiple cells with equivalent structure but different weights defined a full network [49, 5]. Another recent trend in NAS are differentiable search methods such as DARTS [39]. The differentiability allows the fast convergence of the searching algorithm and thus boosts the fast development of NAS during the past year. In this paper we propose a differentiable counterpart for automatic network pruning.

Meta learning and hypernetworks Meta learning is a broad family of machine learning techniques that deal with the problem of learning to learn. Recent works have witnessed its application to various vision tasks including object detection [61], instance segmentation [20], and super-resolution [21]. An emerging trend of meta learning uses hypernetworks to predict the weight parameters in the backbone network [11]. Since the introduction of hypernetworks, it has found wide applications in NAS [5], multi-task learning [47], Bayesian neural networks [27], and also network pruning [40]. In this paper, we propose a new design of hypernetwork which is especially suitable for network pruning and makes differentiability possible for automatic network pruning.

3 Methodology

The pipeline of the proposed method is shown in Fig 2. The two cores of the whole pipeline are the designed hypernetwork and the optimization algorithm, i.e. proximal gradient. In the forward pass, the designed hypernetwork takes as input the latent vectors and predicts the weight parameters for the backbone network. The gradients are back-propagated to the hypernetwork in the backward pass. The sparsity regularizer is enforced on the latent vectors and proximal gradient is used to solve the problem. The dimension of the output of the hypernetwork is covariant with that of the input. Due to this property, the output weights are pruned along with the sparse latent vectors after the optimization step. The differentiability comes with the covariance property of the hypernetworks, the sparsity enforced on the latent vectors, and the proximal gradient used to solve the problem. The automation of pruning is due to the fact that all of the latent vectors are non-discriminatively regularized and that proximal gradient discovers the potential less important elements automatically.

Figure 3: The functionality of the hypernetwork. It contains three layers in total, i.e. the latent layer, the embedding layer, and the explicit layer. See Subsec. 3.1 for details.

3.1 Hypernetwork design

We first introduce the design of the hypernetwork shown in Fig. 3. In summary, the hypernetwork consists of three layers. The latent layer takes as input the latent vectors and computes a latent matrix from them. The embedding layer projects elements of the latent vector to an embedding space. The last explicit layer converts the embedded vectors to the final output. This design is inspired by fully connected layers as in [11, 40] but different from those designs in that the output dimension is covariant with the input latent vector. This design is applicable to all types of convolutions including the standard convolution, depth-wise convolution, point-wise or convolution, and transposed convolution. And for the simplicity of reference, the term convolution is used to denote any of them. Unless otherwise stated, we use the normal (), minuscule bold (), and capital bold () letters to denote scalars, vectors, and matrices or high-dimensional tensors. The elements of a tensor is indexed by the subscript as which could be scalars or vectors depending on the the dimension of the indexed tensor.

Suppose that the given is an -layer convolutional neural netowrk (CNN) with layers indexed by . The dimension of the weight parameter of a convolutional layer is , where , , and denote the output channel, input channel, and kernel size of the convolutional layer, respectively. Every convolutional layer is endowed with a latent vector with the same size as the output channel, namely, . Thus, the layer previous to the current one is given a latent vector . The hypernetwork receives the latent vectors and of the current and the previous layer as input. A latent matrix is first derived from the two latent vectors, namely,

(1)

where and denote matrix transpose and multiplication, . Then every element in the latent matrix is projected to an -th dimensional embedding space with the vectors and , namely,

(2)

where , and are element-wise unique and for the simplicity of notation, the subscript is omitted. We denote the ensemble of the element-wise embedding operation with the following high-dimensional tensor operation

(3)

where , , and are the -th slice of and along their first two dimensions, denotes the broadcastable element-wise tensor multiplication, inserts a third dimension for . Note that after the operation in Eqn. 3, all of the elements of are converted to embedded vectors in the embedding space. The final step is to obtain the output that can be explicitly used as the weights of the convolutional layer. To achieve that, every embedded vector is multiplied by an explicit matrix, that is,

(4)

where , . The operation above can be easily rewritten as batched matrix multiplication as in the convention of tensor operation,

(5)

where , , denotes batched matrix multiplication. Again note that in Eqn. 4 and are unique for every embedded vector and denote the slices from and .

A simplified representation of the output of the hypernetwork is given by

(6)

where denotes the functionality of the hypernetwork. The final output can be reshaped to form the weight parameter of -th layer. The output is said to be covariant with the input latent vector in that its first two dimensions are the same with the two latent vectors. By imposing sparsity regularization to the latent vectors, the corresponding element in the output can also be removed, thus achieving the effect of network pruning.

For the initialization of the parameters, all biases are initialized as zero, the latent vector with standard normal distribution, and

with Xaiver uniform [10]. The weight of the explicit layer is initialized with Hyperfan-in which guarantees stable backbone network weights and fast convergence [6].

3.2 Sparsity regularization and proximal gradient

The core of approximate differentiability comes with not only the specifically designed hypernetwork but also the mechanism used to search the the potential candidate. To achieve that, we enforce sparsity constraints to the latent vectors. Thus, the loss function of the aforementioned

-layer CNN is denoted as

(7)

where , , and are the general loss function of the CNN, weight decay term, and sparsity regularization term, respectively. For the simplicity of notation, the superscript is omitted. The sparsity regularization takes the form of norm, namely,

(8)

To solve the problem in Eqn. 7, the weights and biases of the hypernetwork are updated with SGD. Note that the gradient are back-propagated from the backbone network to the hypernetwork. Thus, neither the forward pass nor the backward pass challenges the information flow between the backbone network and the hypernetwork. As for the latent vectors, they are updated with proximal gradient algorithm, that is,

(9)

where is the step size of the proximal gradient method that is set as the learning rate of SGD updates. As can be seen in the equation, the proximal gradient update contains a gradient descent step and a proximal operation step. When the regularizer has the form of norm, the proximal operator has closed-form solution, i.e.

(10)

where is the intermediate SGD update, the sign operator , the thresholding operator , and the absolute value operator act element-wise on the vector. Eqn. 10 is the well-known soft-thresholding function.

In practice, the latent vectors first get SGD updates along with the other parameters and after which the proximal operator is applied. Since the existence of SGD updates and the fact that the proximal operator has closed-form solution, we recognize the whole solution as approximately diferentiable (although the norm is not differentiable at 0), which guarantees the fast convergence of the algorithm compared with reinforcement learning or evolutionary algorithm. Actually, the speed-up of proximal gradient lies in that instead of searching the best candidate among the total population it forces the solution towards the best sparse one.

The automation of pruning follows the way the sparsity applied in Eqn. 8 and the proximal gradient solution. First of all, all latent vectors are regularized together without any distinguishment between them. During the optimization, information and gradients flows fluently between the backbone network and the hypernetwork. The proximal gradient algorithm forces the potential elements of the latent vectors to approach zero quicker than the others without any human effort and interference in this process. The optimization stops immediately when the target compressed ratio is reached. In total, there are only two additional hyper-parameters in the algorithm, i.e. the sparsity regularization factor and the mask threshold in Subsec. 3.3. Thus, running the algorithm is just like turning on the button, which enable the application of the algorithm to all of the CNNs without much interference of domain experts’ knowledge.

3.3 Network pruning

Different from the fully connected layers, the proposed design of hypernetwork can adapt the dimension of the output according to that of the latent vectors. After the searching stage, sparse versions of the latent vectors are derived as and . For those vectors, some of their elements are zero or approaching zero. Thus, 1-0 masks can by derived by comparing the sparse latent vectors with a predefined small threshold , that is,

(11)

where the function element-wise compares the latent vector with the threshold and returns 1 if the element is not smaller than and 0 otherwise. By applying the masks to the latent vectors and analyzing the three layers together, we can have an direct impression of how the backbone layers are automatically pruned. That is,

(12)
(13)

The equality follows the broadcastability of the the operations and . As shown in the above equations, applying the masks on the latent vectors has the same effect of applying them on the final output. Note that in the above analysis the bias terms , , and are omitted since they have a really small influence on the output of the hypernetwork. Anyway the same mask matrix is applied to the biases. In conclusion, the final output can be pruned according to the same criterion for the latent vectors.

3.4 Latent vector sharing

Network Top-1 Error (%) Compression Top-1 Error (%) FLOPs Ratio (%) Parameter Ratio (%)
Method
CIFAR10
ResNet-56 7.05 Variational [67] 7.74 79.70 79.51
GAL-0.6 [36] 6.62 63.40 88.20
56-pruned-B [30] 6.94 62.40 86.30
NISP [64] 6.99 56.39 57.40
DHP-50 (Ours) 6.31 50.98 55.62
CaP [44] 6.78 50.20
ENC [25] 7.00 50.00
AMC [17] 8.10 50.00
KSE [34] 6.77 48.00 45.27
FPGM [16] 6.74 47.70
GAL-0.8 [36] 8.42 39.80 34.10
DHP-38 (Ours) 6.86 39.60 49.00
ResNet-110 5.31 Variational [67] 7.04 63.56 58.73
DHP-62 (Ours) 5.36 62.50 59.28
GAL-0.5 [36] 7.26 51.50 55.20
DHP-20 (Ours) 6.93 21.77 21.30
ResNet-164 4.97 SSS [24] 5.78 53.53
DHP-50 (Ours) 5.29 51.86 44.10
Variational [67] 6.84 50.92 43.30
DHP-20 (Ours) 5.91 21.47 20.24
DenseNet-12-40 5.26 Variational [67] 6.84 55.22 40.33
DHP-38 (Ours) 6.06 39.80 63.76
DHP-28 (Ours) 6.51 29.52 26.01
GAL-0.1 [36] 6.77 28.60 25.00
VGG-16 6.34 DHP-40 (Ours) 7.61 40.11 35.61
DHP-20 (Ours) 8.60 21.65 17.18

Tiny-ImageNet

ResNet50 42.35 DHP-40 (Ours) 43.97 41.57 45.89
MetaPruning [40] 49.82 10.21 37.32
DHP-9 (Ours) 47.00 10.95 13.15
MobileNetV1 51.87 DHP-24-2 (Ours) 50.47 98.95 79.65
MobileNetV1-0.75 53.18 57.74 57.75
MetaPruning [40] 54.66 56.49 88.06
DHP-50 (Ours) 51.96 51.37 47.76
MobileNetV2 43.83 DHP-24-2 (Ours) 43.14 96.34 107.57
MobileNetV2-0.3 53.21 11.08 11.89
MetaPruning [40] 56.53 11.08 89.92
DHP-10 (Ours) 47.65 11.94 16.86
Table 1: Results on image classification networks. Sorted wrt FLOPs. CIFAR10 and Tiny-ImageNet are used respectively. The FLOPs ratio and parameter ratio denote the remained quantities respectively, and the lower the better. ‘DHP-**’ represents the target compression ratio set for our pruning method. As in Fig. 1, the operating point DHP-24-2 is derived by compressing the mobile networks with width multiplier set to 2 and the target compression ratio 24%. More results in the supplementary.

Due to the existence of skip connections in residual networks such as ResNet, MobileNetV2, SRResNet, and EDSR, the residual blocks are interconnected with each other in the way that their input and output dimensions are related. Therefore, the skip connections are notoriously tricky to deal with. But back to the design of the proposed hypernetwork, a quite simple and straightforward solution is to let the hypernetworks of the correlated layers share the same latent vector. Thus, by automatically pruning the single latent vector, all of the relevant layers are pruned together. Actually, we first tried to use unique latent vector for the correlated layers and applied group sparsity to them. But the experimental results showed that this is not a good choice because it shot lower accuracy than sharing the latent vectors (See details in the Supplementary).

4 Experimental Results

To validate the effectiveness of DHP, extensive experiments have been conducted on various network architectures for different computer vision task including VGG [53], ResNet [15], DenseNet [22] for CIFAR10 [26] image classification, ResNet50, MobileNetV1 [19], MobileNetV2 [52] for Tiny-ImageNet [8] image classification, SRResNet [29], EDSR [35] for single image super-resolution, and DnCNN [66], UNet [51] for gray image denoising. For super-resolution, the network are trained on DIV2K [1] dataset and tested on Set5 [4], Set14 [65], B100 [43], Urban100 [23], and DIV2K validation set. For image denoising, the networks are trained on the gray version of DIV2K dataset and tested on BSD68 and DIV2K validation set.

We train and prune the networks from scratch without relying on the pre-trained model. The proximal gradient algorithm is first used to prune the network with initialization detailed in Subsec. 3.1. A target ratio is set for the pruning procedure. When the difference between the target ratio and the actual compression ratio is below 2%, the automatic pruning procedure stops. The training of the pruned network continues with the same protocol as done for the original network. Please refer to the supplementary for the detailed training protocol and the selection of hyper-parameters.

4.1 Image classification

The experimental results of different network compression algorithms on image classification is shown in Table 1. For ResNet-56, the proposed method is compared with 9 different network compression methods and achieves the best performance, i.e. 6.31% Top-1 error rate on the most intensively investigated 50% compression level. Note that this error rate is even lower than the uncompressed baseline. The compression on DenseNet-12-40 is reasonable compared with the other method. For ResNet-110 and ResNet-164, the accuracy of our higher operating points DHP-62 and DHP-50 is not far away from that of the baseline. More results on ResNet-110 and ResNet-164 is shown in Fig. 4. As can be seen in the figure especially for ResNet-164, when the compression ratio is not too severe (above 20%), the accuracy does not drop too much. The extreme compression prunes about 90% of computation and parameters (for ResNet-164, the extreme compression only keeps 6.87% parameters). Thus, the drop in the accuracy is reasonable.

On Tiny-ImageNet, DHP achieves lower Top-1 error rates than MetaPruning under the same FLOPs constraint. Our lower operating points shoot lower Top-1 error rate than the narrowed versions of mobile networks. On MobileNetV1, the error rate of DHP-10 is lower than MobileNetV1-0.75 by 1.08% with 6.37% fewer FLOPs and nearly 10% fewer parameters while on MobileNetV1 the accuracy gain of DHP-10 over MobileNetV2-0.3 goes to 8.88%. Thus, we hypothesize that we can target an error rate lower than the original version by pruning the widened mobile networks. And this is confirmed by comparing the accuracy of our DHP-24-2 with the baseline accuracy in Fig. 1.

(a) FLOP compression ratio.
(b) Parameter compression ratio.
Figure 4: Top-1 error vs. FLOP and parameter compression ratio on ResNet164 and ResNet110.

4.2 Super-resolution

The image super-resolution results are shown in Table 2. We compare our method with factorized convolution (Factor) [59], filter basis method (Basis) [33]

, and K-means clustering method (Clustering) 

[54]

. To fairly compare the methods and measure the practical compression effectiveness, five metrics are involved including Peak Signal-to-Noise Ratio (PSNR), floating point operations (FLOPs), number of parameters, runtime and GPU memory consumption. By observing the five metrics, we have several conclusions. Firstly, previous methods mainly focus on the reduction of FLOPs and number of parameter without paying special attention to the actual acceleration. Although some methods such as Clustering can reduce substantial parameters while maintaining quite good PSNR accuracy, the actual computing resources are remained (GPU memory) or even increased (runtime) due to the overhead introduced by centroid indexing. Secondly, convolution factorization or decomposition results in additional CUDA kernel calls, which is not efficient for the actual acceleration. Thirdly, for the proposed method, the two model complexity metrics changes consistently across different operating points, which leads to consistent reduction of computation resource including runtime and memory consumption. Fourthly, the proposed DHP results in inference efficient models as well as the accuracy-preserving one. The visual results are shown in Fig. 

5. As can been seen, the proposed method results in almost indistinguishable images with that of the baseline while achieving the actual acceleration.

Network Method PSNR FLOPs Params Run- time GPU Mem
Set5 Set14 B100 Urban100 DIV2K
SRResNet Baseline 32.03 28.50 27.52 25.88 28.85 32.81 1535.62 7.80 2.2763
[29] Clustering [54] 31.93 28.44 27.47 25.71 28.75 32.81 341.25 10.84 2.2727
Factor-SIC3 [59] 31.86 28.38 27.40 25.58 28.65 20.83 814.73 17.59 2.2737
DHP-60 (Ours) 31.97 28.47 27.48 25.76 28.79 20.27 948.63 6.98 2.0568
Basis-32-32 [33] 31.90 28.42 27.44 25.65 28.69 19.77 738.18 13.14 0.9331
Factor-SIC2 [59] 31.68 28.32 27.37 25.47 28.58 18.38 661.13 15.77 2.2763
Basis-64-14 [33] 31.84 28.38 27.39 25.54 28.63 17.49 598.91 9.77 0.7267
DHP-40 (Ours) 31.90 28.45 27.47 25.72 28.75 13.68 638.75 5.75 1.7152
DHP-20 (Ours) 31.77 28.34 27.40 25.55 28.60 7.75 357.92 4.52 1.3362
EDSR Baseline 32.10 28.55 27.55 26.02 28.93 90.36 3696.67 16.83 0.4438
[35] Clustering [54] 31.92 28.46 27.48 25.76 28.80 90.36 821.48 21.70 0.6775
Factor-SIC3 [59] 31.96 28.47 27.49 25.81 28.81 65.49 2189.34 33.22 1.5007
Basis-128-40 [33] 32.03 28.45 27.50 25.81 28.82 62.65 2003.46 16.59 0.4679
Factor-SIC2 [59] 31.82 28.40 27.43 25.63 28.70 60.90 1904.67 27.69 1.1247
Basis-128-27 [33] 31.95 28.42 27.46 25.76 28.76 58.28 1739.52 17.13 0.4772
DHP-60 (Ours) 31.99 28.52 27.53 25.92 28.88 55.67 2279.64 7.81 0.4658
DHP-40 (Ours) 32.01 28.49 27.52 25.86 28.85 37.77 1529.78 6.01 0.4658
DHP-20 (Ours) 31.94 28.42 27.47 25.69 28.77 19.40 785.85 4.37 0.4658
Table 2: Compression results on single image super-resolution networks. The upscaling factor is . ‘SIC*’ denotes the number of SIC layers in Factor [59] method. The practical FLOPs instead of the theoretical FLOPs is reported for Clustering [54] method since it is cumbersome to implement the ideal acceleration method and thus to calculate the actual FLOPs. ‘DHP-*’ denotes the target FLOPs compression ratio of the proposed automatic pruning method. The runtime is averaged on B100 dataset with image size , GPU memory reported for testing B100 images, and FLOPs reported for a image patch.

4.3 Denoising

The compression results for image denoising is shown in Table 3. The same metrics as super resolution are reported for denoising. An additional method, i.e. filter group approximation (Group) [48] is included. In addition to the same conclusion as in Subsec.0.A.2, another two conclusions are drawn here. The grouped convolution approximation method fails to reduce the actual computation resources although with quite good accuracy and satisfactory reduction of FLOPs and number of parameters. This might due to the introduced additional

convolution and possibly the inefficient implementation of group convolution in current deep learning toolboxes. For DnCNN, one interesting phenomenon is that the Factor method achieves even better accuracy than the baseline but has larger appetite for other resources. This is not surprising due to two facts. The SIC layer of Factor doubles the actual convolutional layers. So Factor-SIC3 has five times more convolutioinal layers, which definitely slows down the execution. Aother fact is that Factor has skip connections within the SIC layer. The outperformance of Factor in accuracy just validates the effectiveness of skip connections. The performance of the other method can also be improved if skip connections are added. The visual results is shown in Fig. 

6.

Network Method PSNR FLOPs Params Runtime GPU Mem
BSD68 DIV2K
DnCNN [66] Baseline 24.93 26.73 9.10 557.06 23.69 3.99
Clustering [54] 24.90 26.67 9.10 123.79 21.84 3.99
DHP-60 (Ours) 24.91 26.69 5.62 344.61 17.11 2.70
DHP-40 (Ours) 24.89 26.65 3.81 233.75 17.19 2.33
Factor-SIC3 [59] 24.97 26.83 3.53 219.14 126.22 5.96
Group [48] 24.88 26.64 3.32 204.74 26.81 4.02
Factor-SIC2 [59] 24.93 26.76 2.36 147.14 85.14 5.96
DHP-20 (Ours) 24.84 26.58 2.00 122.72 10.39 1.99
UNet [51] Baseline 25.17 27.17 3.41 7759.52 7.27 3.75
Clustering [54] 25.01 26.90 3.41 1724.34 9.66 3.73
DHP-60 (Ours) 25.14 27.11 2.12 4760.27 6.09 2.93
Factor-SIC3 [59] 25.04 26.94 1.56 3415.65 40.36 4.71
Group [48] 25.13 27.08 1.49 2063.91 9.00 3.75
DHP-40 (Ours) 25.12 27.08 1.43 3238.12 5.17 2.04
Factor-SIC2 [59] 25.01 26.90 1.22 2510.82 29.31 4.71
DHP-20 (Ours) 25.04 26.97 0.75 1611.03 3.99 1.77
Table 3: Compression results on image denoising networks. The noise level is 70. The naming convention follows Table 2.
PSNR/FLOPs/Runtime 32.85/28.59/14.10 32.50/28.59/19.75 32.65/19.82/14.71 32.24/19.28/25.49 32.64/17.61/5.40
(a) LR (b) EDSR (d) Cluster (c) Basis (f) Factor (e) DHP
Figure 5: Single image super-resolution visual results. PSNR and FLOPs measured on the image. Runtime averaged on Set5.
PSNR/FLOPs/Runtime 25.60/1.08/7.27 25.30/1.08/9.66 25.37/0.49/40.36 25.51/0.47/9.00 25.57/0.45/6.09
(a) Noisy (b) UNet (f) Cluster (d) Factor (e) Group (c) DHP
Figure 6: Image denoising visual results. PSNR and FLOPs measured on the image. Runtime averaged on B100.

5 Conclusion and Future Work

In this paper, we proposed a differentiable automatic meta pruning method via hypernetwork for network compression. The differentiability comes with the specially designed hypernetwork and the proximal gradient used to search the potential candidate network configurations. The automation of pruning lies in the uniformly applied sparsity on the latent vectors and the proximal gradient that solves the problem. By pruning mobile network with width multiplier , we obtained models with higher accuracy but lower computation complexity than that with . We hypothesize this is due to the per-layer distinguishing configuration resulting from the automatic pruning. Future work might be investigating whether this phenomenon reoccurs for the other networks.

Acknowledgements

This work was partly supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project, an Amazon AWS grant, and an Nvidia grant.

References

  • [1] E. Agustsson and R. Timofte (2017-07) NTIRE 2017 challenge on single image super-resolution: dataset and study. In Proc. CVPRW, Cited by: §4.
  • [2] J. M. Alvarez and M. Salzmann (2016)

    Learning the number of neurons in deep networks

    .
    In Proce. NeurIPS, pp. 2270–2278. Cited by: §2.
  • [3] J. M. Alvarez and M. Salzmann (2017) Compression-aware training of deep networks. In Proc. NeurIPS, pp. 856–867. Cited by: §2.
  • [4] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proc. BMVC, Cited by: §4.
  • [5] A. Brock, T. Lim, J.M. Ritchie, and N. Weston (2018) SMASH: one-shot model architecture search through hypernetworks. In Proc. ICLR, Cited by: §2, §2.
  • [6] O. Chang, L. Flokas, and H. Lipson (2020) Principled weight initialization for hypernetworks. In Proc. ICLR, Cited by: §3.1.
  • [7] C. Chen, F. Tung, N. Vedula, and G. Mori (2018) Constraint-aware deep neural network compression. In Proc. ECCV, pp. 400–415. Cited by: §1.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In Proc. CVPR, pp. 248–255. Cited by: §4.
  • [9] X. Dong, S. Chen, and S. Pan (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Proc. NIPS, pp. 4857–4867. Cited by: §1.
  • [10] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proc. AISTATS, pp. 249–256. Cited by: §3.1.
  • [11] D. Ha, A. Dai, and Q. V. Le (2017) HyperNetworks. In Proc. ICLR, Cited by: §1, §2, §3.1.
  • [12] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proc. ICLR, Cited by: §1, §2.
  • [13] B. Hassibi and D. G. Stork (1993) Second order derivatives for network pruning: optimal brain surgeon. In Proc. NeurIPS, pp. 164–171. Cited by: §1.
  • [14] K. Hayashi, T. Yamaguchi, Y. Sugawara, and S. Maeda (2019)

    Einconv: exploring unexplored tensor decompositions for convolutional neural networks

    .
    In Proc. NeurIPS, pp. 5553–5563. Cited by: §1, §2.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. CVPR, pp. 770–778. Cited by: §0.A.1.1, item (iv), §4.
  • [16] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proc. CVPR, pp. 4340–4349. Cited by: Table 5, §1, Table 1.
  • [17] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) AMC: autoML for model compression and acceleration on mobile devices. In Proc. ECCV, pp. 784–800. Cited by: Table 5, §1, Table 1.
  • [18] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proc. ICCV, pp. 1389–1397. Cited by: §1.
  • [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: Table 6, item (iv), §1, §4.
  • [20] R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick (2018) Learning to segment every thing. In Proc. CVPR, pp. 4233–4241. Cited by: §2.
  • [21] X. Hu, H. Mu, X. Zhang, Z. Wang, T. Tan, and J. Sun (2019) Meta-SR: a magnification-arbitrary network for super-resolution. In Proc. CVPR, pp. 1575–1584. Cited by: §2.
  • [22] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proc. CVPR, pp. 2261–2269. Cited by: §0.A.1.1, item (iv), §4.
  • [23] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In Proc. CVPR, pp. 5197–5206. Cited by: §4.
  • [24] Z. Huang and N. Wang (2018) Data-driven sparse structure selection for deep neural networks. In Proc. ECCV, pp. 304–320. Cited by: Table 5, Table 1.
  • [25] H. Kim, M. Umar Karim Khan, and C. Kyung (2019-06) Efficient neural network compression. In Proc. CVPR, Cited by: Table 5, Table 1.
  • [26] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §0.A.1.1, §4.
  • [27] D. Krueger, C. Huang, R. Islam, R. Turner, A. Lacoste, and A. Courville (2017) Bayesian hypernetworks. arXiv preprint arXiv:1710.04759. Cited by: §2.
  • [28] Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Proc. NeurIPS, pp. 598–605. Cited by: §1.
  • [29] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proc. CVPR, pp. 105–114. Cited by: item (iv), Table 2, §4.
  • [30] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. In Proc. ICLR, Cited by: Table 5, §1, Table 1.
  • [31] J. Li, Q. Qi, J. Wang, C. Ge, Y. Li, Z. Yue, and H. Sun (2019) OICSR: out-in-channel sparsity regularization for compact deep neural networks. In Proc. CVPR, pp. 7046–7055. Cited by: §2.
  • [32] Y. Li, S. Gu, C. Mayer, L. Van Gool, and R. Timofte (2020) Group sparsity: the hinge between filter pruning and decomposition for network compression. In Proc. CVPR, Cited by: §1, §2.
  • [33] Y. Li, S. Gu, L. Van Gool, and R. Timofte (2019) Learning filter basis for convolutional neural network compression. In Proc. ICCV, pp. 5623–5632. Cited by: §4.2, Table 2.
  • [34] Y. Li, S. Lin, B. Zhang, J. Liu, D. Doermann, Y. Wu, F. Huang, and R. Ji (2019) Exploiting kernel sparsity and entropy for interpretable CNN compression. In Proc. CVPR, Cited by: Table 5, Table 1.
  • [35] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proc. CVPRW, pp. 1132–1140. Cited by: item (iv), Table 2, §4.
  • [36] S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, and D. Doermann (2019) Towards optimal structured cnn pruning via generative adversarial learning. In Proc. CVPR, pp. 2790–2799. Cited by: Table 5, Table 1.
  • [37] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky (2015) Sparse convolutional neural networks. In Proc. CVPR, pp. 806–814. Cited by: §2.
  • [38] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018-09) Progressive neural architecture search. In Proc. ECCV, Cited by: §2.
  • [39] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In Proc. ICLR, Cited by: §1, §2.
  • [40] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, T. K. Cheng, and J. Sun (2019) MetaPruning: meta learning for automatic neural network channel pruning. In Proc. ICCV, Cited by: Table 6, §1, §1, §1, §2, §3.1, Table 1.
  • [41] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019) Rethinking the value of network pruning. In Proc. ICLR, Cited by: §1.
  • [42] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) ShuffleNet V2: practical guidelines for efficient cnn architecture design. In Proc ECCV, pp. 116–131. Cited by: §0.B.6, Table 6, Appendix 0.C, item (iv).
  • [43] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001-07) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. ICCV, Vol. 2, pp. 416–423. Cited by: §4.
  • [44] B. Minnehan and A. Savakis (2019-06) Cascaded projection: end-to-end network compression and acceleration. In Proc. CVPR, Cited by: Table 5, Table 1.
  • [45] MIT technology review: 10 breakthrough technologies 2020. Note: https://www.technologyreview.com/lists/technologies/2020/#tiny-aiAccessed: 2020-03-01 Cited by: §1.
  • [46] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz (2019)

    Importance estimation for neural network pruning

    .
    In Proc. CVPR, pp. 11264–11272. Cited by: §1.
  • [47] Z. Pan, Y. Liang, J. Zhang, X. Yi, Y. Yu, and Y. Zheng (2018) HyperST-net: hypernetworks for spatio-temporal forecasting. arXiv preprint arXiv:1809.10889. Cited by: §2.
  • [48] B. Peng, W. Tan, Z. Li, S. Zhang, D. Xie, and S. Pu (2018) Extreme network compression via filter group approximation. In Proc. ECCV, pp. 300–316. Cited by: §4.3, Table 3.
  • [49] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean (2018) Efficient neural architecture search via parameters sharing. In Proc. ICML, pp. 4095–4104. Cited by: §2.
  • [50] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    .
    In Proc. AAAI, Vol. 33, pp. 4780–4789. Cited by: §2.
  • [51] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In Proc. MICCAI, pp. 234–241. Cited by: item (iv), Table 3, §4.
  • [52] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In Proc. CVPR, pp. 4510–4520. Cited by: Table 6, item (iv), §1, §4.
  • [53] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: item (iv), §4.
  • [54] S. Son, S. Nah, and K. Mu Lee (2018) Clustering convolutional kernels to compress deep neural networks. In Proc. ECCV, pp. 216–232. Cited by: §4.2, Table 2, Table 3.
  • [55] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
  • [56] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proc. CVPR, pp. 2820–2828. Cited by: Table 6, Appendix 0.C, item (iv).
  • [57] A. Torfi, R. A. Shirvani, S. Soleymani, and N. M. Nasrabadi (2019) GASL: guided attention for sparsity learning in deep neural networks. arXiv preprint arXiv:1901.01939. Cited by: §2.
  • [58] C. Wang, R. Grosse, S. Fidler, and G. Zhang (2019) EigenDamage: structured pruning in the kronecker-factored eigenbasis. In Proc. ICML, pp. 6566–6575. Cited by: §1.
  • [59] M. Wang, B. Liu, and H. Foroosh (2017) Factorized convolutional neural networks. In Proc. ICCVW, pp. 545–553. Cited by: §4.2, Table 2, Table 3.
  • [60] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Proc. NeurIPS, pp. 2074–2082. Cited by: §2.
  • [61] T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun (2018) Metaanchor: learning to detect objects with customized anchors. In Proc. NeurIPS, pp. 320–330. Cited by: §2.
  • [62] J. Ye, X. Lu, Z. Lin, and J. Z. Wang (2018) Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In Proc. ICLR, Cited by: §1.
  • [63] J. Yoon and S. J. Hwang (2017) Combined group and exclusive sparsity for deep neural networks. In Proc. ICML, pp. 3958–3966. Cited by: §2.
  • [64] R. Yu, A. Li, C. Chen, J. Lai, V. I. Morariu, X. Han, M. Gao, C. Lin, and L. S. Davis (2018) NISP: pruning networks using neuron importance score propagation. In Proc. CVPR, pp. 9194–9203. Cited by: Table 5, Table 1.
  • [65] R. Zeyde, M. Elad, and M. Protter (2010) On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, pp. 711–730. Cited by: §4.
  • [66] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a Gaussian denoiser: residual learning of deep CNN for image denoising. IEEE TIP 26 (7), pp. 3142–3155. Cited by: item (iv), Table 3, §4.
  • [67] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian (2019) Variational convolutional neural network pruning. In Proc. CVPR, pp. 2780–2789. Cited by: Table 5, Table 1.
  • [68] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In Proc. ICLR, Cited by: §1, §1, §2.
  • [69] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018-06) Learning transferable architectures for scalable image recognition. In Proc. CVPR, Cited by: §1, §2.

Appendix 0.A Training Protocol

As explained in the main paper, the proposed DHP method does not rely on the pretrained model. Thus, all of the networks are trained and pruned from scratch. The hypernetworks are first initialized and used along with the proximal gradient to sparsify the latent vectors. When the difference between the target and the actual FLOPs compression ratio is below 2%, the pruning procedure stops. Then the pruned latent vectors as well as the pruned outputs of the hypernetworks are derived. After that, the outputs of hypernetworks are used as the weight parameters of the backbone network and updated by SGD or Adam algorithm directly. The hypernetworks are of course removed. After the pruning procedure, the training continues and the training protocol are the same as that of training the original uncompressed network. The number of pruning epochs is much smaller than that used for training the original network. Usually, the pruning procedure continues for about 10 epochs compared with the hundreds of epochs for training the uncompressed network. The following of the section describes the training protocols of different tasks.

0.a.1 Image Classification

0.a.1.1 Cifar10

We evaluate the performance of compressed models on CIFAR10 [26] dataset. The dataset contains 10 different classes. The training and testing subset contains 50,000 and 10,000 images with resolution , respectively. As is done by prior works [15, 22]

, we normalize all images using channel-wise mean and standard deviation of the the training set. Standard data augmentation is also applied. We train the networks for 300 epochs with SGD optimizer and an initial learning rate of 0.1. The learning rate is decayed by 10 after 50% and 75% of the epochs. The momentum of SGD is 0.9. Weight decay factor is set to 0.0001. The batch size is 64.

0.a.1.2 Tiny-ImageNet

For image classification, we also apply the pruning method on Tiny-Imagenet. It has 200 classes. Each class has 500 training images and 50 validation images. And the resolution of the images is . The images are normalized with channel-wise mean and standard deviation. Horizontal flip is used to augment the dataset. The networks are trained for 220 epochs with SGD and an initial learning rate of 0.1. The learning rate is decayed by a factor of 10 at Epoch 200, Epoch 205, Epoch 210, and Epoch 215. The momentum of SGD is 0.9. Weight decay factor is set to 0.0001. The batch size is 64.

0.a.2 Super-Resolution

0.a.2.1 Training protocol

For image super-resolution, we train the networks on DIV2K dataset. It contains 800 training images, 100 validation images, and 100 test images. We use the 800 training images to train our network, and validate on the validation subset. Image patches are extracted from the training images. For EDSR, the patch size of the low-resolution input patch is while for SRResNet the patch size is . The batch size is 16. The networks are optimized with Adam optimizer. We use the default hyper-parameter for Adam optimizer. The weight decay factor is 0.0001. The networks are trained for 300 epochs. The learning rate starts from 0.0001 and decays by 10 after 200 epochs.

0.a.2.2 Simplified EDSR architecture

Note that, in order to speed up the training of EDSR, we adopt a simplified version of EDSR. The original EDSR contains 32 residual blocks and each convolutional layer in the residual blocks has 256 channels. Our simplified version has 8 residual blocks and each has two convolutional layers with 128 channels.

0.a.3 Denoising

For image denoising, we also train the networks on DIV2K dataset. But all of the images are converted to gray images. As done for super-resolution, image patches are extracted from the training images. For DnCNN, the patch size of the input image is and the batch size is 64. For UNet, the patch size and the batch size are and 16, respectively. Gaussian noise is added to degrade the input patches on the fly with noise level . Still, Adam optimizer is used to train the network. The weight decay factor is 0.0001. The networks are trained for 60 epochs and each epoch contains 10,000 iterations. So in total, it’s 600k iterations. The learning rate starts with 0.0001 and decays by 10 at Epoch 40.

Appendix 0.B Latent Vector Sharing Strategy for Different Networks

0.b.1 Basic criteria

Every convolutional layer including standard convolution, depth-wise convolution, point-wise convolution, group convolution, and transposed convolution is attached a latent vector. The latent vectors control the channels of the convolutional layer. Thus, by dealing with only the latent vector, we can control how the convolutionaly layers are pruned. Actually, some complicated cases occur in the modern network architecture and the latent vectors has to be shared among different layers. Thus, during the development of the algorithm, we summarize some basic rules for latent vectors. In the following, we first describe the general rules for latent vectors and then detail the specific rules for special network blocks.

  1. Every convolutional layer is attached a latent vector.

  2. The channels that the latent vector controls and the dimension of the latent vector varies with the types of convolutional layers.

    1. For standard convolution, point-wise convolution and transposed convolution, the latent vector controls the output channel of the layer and the dimension of the latent vector is the same as the number of output channels.

    2. For depth-wise convolution and group convolution, the latent vector controls the input channels per group and the dimension of the latent vector is the same as the number of input channels per group. That is, the latent vector of depth-wise convolution contains only one element.

  3. Since the output and input channels of consecutive layers are correlated, the latent vectors has to be shared among consecutive layers. That is, the hypernetworks have to receive the latent vectors of the previous layer and the current layer as input in order to make the input and output channels consistent with the previous and latter layers.

  4. Not every latent vector needs to be sparsified. And the latent vectors free from sparsifying are list as follows.

    1. The latent vector that controls the input channel of the first convolutional layer. This latent vector has the same dimension with the input image channels, e.g. 3 for RGB images and 1 for gray images. Of course, the input images do not need to be pruned.

    2. The latent vector that controls the output channel of the last convolutional layer of image classification network. This latent vector is related to the fully connected linear layers of image classifiers. Since we do not intend to prune the fully connected layer, the correlated latent vector is not pruned either.

    3. The latent vector attached to depth-wise convolution and group convolution. This latent vector controls the input channels per group. The input and output channels of depth-wise and group convolution are correlated. In order to compress depth-wise and group convolution, we prune the output channels. This has the same effect of reducing the number of groups which is controlled by the latent vectors of the previous layer instead of the input channels per group controlled by the latent vectors in the current layer.

0.b.2 Residual block

The residual networks including ResNet, SRResNet, and EDSR are constructed by stacking a number of residual blocks. Depending on the dimension of the feature maps, the residual networks contains several stages with progressively reducing feature map dimension and increasing number of feature maps. (Note that the feature map dimension of EDSR and SRResNet does not change for all of the residual blocks. So there is only one stage for those networks.) For the residual blocks within the same stage, their output channels are correlated due to the existence of the skip connections. In order to prune the second convolution of the residual blocks within the same stage, we set a shared latent vector for them. Thus, by only dealing with this shared latent vector, all of the second convolutions of the residual blocks can be pruned together. Please refer to Table 4 for the ablation study on latent vector sharing and non-sharing strategies.

0.b.3 Dense block

Similar to residual networks, DenseNet also contains several stages with different feature map configurations. But different from residual networks, each dense block concatenates its input and output to form the final output of the block. As a result, each dense block receives as input the outputs of all of the previous dense blocks within the same stage. Thus, the hypernetwork of a dense block also has to receive the latent vectors of the corresponding dense blocks as input.

0.b.4 Inverted residual block

The inverted residual blocks are just a special case of residual blocks. So how the latent vectors are shared across different blocks is the same with the normal residual blocks. Here we specifically address the sharing strategy within the block due to the existence of depth-wise convolution. The inverted residual block has the architecture of “point-wise conv + depth-wise conv + point-wise conv”. As explained earlier, the latent vector of depth-wise convolution controls the input channels per group, i.e. 1 here. Thus, the latent vector of the first point-wise convolution controls not only its output channels but also the input channels of the depth-wise convolution and the second point-wise convolution. Thus, this latent vector has to be passed to the hypernetworks of the those convolutional layers.

0.b.5 Upsampler of super-resolution networks

The image super-resolution networks are attached upsampler blocks at the tail part of the network to increase the spatial resolution of the feature map. For the scaling factor of , two upsamplers are attached and each doubles the spatial resolution. Each of the upsampler block contains a standard convolutional layer that increase the number of feature maps by a factor a 4 and a pixel shuffler that shuffles every 4 consecutive feature maps into the spatial dimension. Thus, the output channels of the convolutional layer in the upsmapler is correlated to its input channels. If one input channel is pruned, then 4 corresponding consecutive output channels should also be pruned. To achieve this way of pruning control, a common latent vector is used for the input and output channels and the vector is interleavedly repeated to form the one controlling the output channels.

Share Regularizer Target Actual Actual Top-1
FLOPs Ratio (%) FLOPs Ratio (%) Parameter Ratio (%) Error (%)
Yes 38 39.96 52.49 7.41
No 38 39.75 55.43 7.32
No 38 39.40 54.24 7.91
Yes 38 39.60 49.00 6.86
No 38 39.30 58.35 7.98
No 38 39.54 54.04 7.03
Yes 50 51.27 56.84 7.13
No 50 51.44 65.47 6.85
No 50 50.96 64.05 6.85
Yes 50 51.68 57.74 6.52
No 50 50.23 62.83 7.11
No 50 50.18 59.15 6.74
Table 4: Ablation study on ResNet56 for CIFAR10 image classification: exploring latent vector sharing strategy among correlated convolutional layers. “Share” denotes whether the latent sharing strategy described in Subsec. 0.B.2 is adopted. The regularizer means that when the latent vectors are not shared among the correlated layers, the group sparsity regularizer is enforced on their latent vectors. Otherwise, the normal sparsity regularizer is used. and are the regularization factor and mask threshold introduced in the main paper. The FLOPs ratio and parameter ratio denote the remained quantities respectively, and the lower the better. As shown in this table, the entry corresponding to latent vector sharing consistently outperforms the non-sharing counterparts except for the third configuration , and 50% target FLOPs compression ratio. The inconsistency is largely due to the gap between the actual parameter compression ratio of vector sharing and non-sharing strategies.
Network Top-1 Error (%) Compression Top-1 Error (%) FLOPs Ratio (%) Parameter Ratio (%)
Method
ResNet-56 7.05 Variational [67] 7.74 79.70 79.51
GAL-0.6 [36] 6.62 63.40 88.20
56-pruned-B [30] 6.94 62.40 86.30
NISP [64] 6.99 56.39 57.40
DHP-50 (Ours) 6.31 50.98 55.62
CaP [44] 6.78 50.20
ENC [25] 7.00 50.00
AMC [17] 8.10 50.00
KSE [34] 6.77 48.00 45.27
FPGM [16] 6.74 47.70
GAL-0.8 [36] 8.42 39.80 34.10
DHP-38 (Ours) 6.86 39.60 49.00
ResNet-110 5.31 Variational [67] 7.04 63.56 58.73
DHP-62 (Ours) 5.36 62.50 59.28
GAL-0.5 [36] 7.26 51.50 55.20
DHP-20 (Ours) 6.93 21.77 21.30
ResNet-164 4.97 SSS [24] 5.78 53.53
DHP-50 (Ours) 5.29 51.86 44.10
Variational [67] 6.84 50.92 43.30
DHP-20 (Ours) 5.91 21.47 20.24
DenseNet-12-40 5.26 Variational [67] 6.84 55.22 40.33
DHP-38 (Ours) 6.06 39.80 63.76
DHP-28 (Ours) 6.51 29.52 26.01
GAL-0.1 [36] 6.77 28.60 25.00
DHP-24 (Ours) 6.53 27.12 25.76
DHP-20 (Ours) 7.17 22.85 20.38
VGG-16 6.34 DHP-40 (Ours) 7.61 40.11 35.61
DHP-20 (Ours) 8.60 21.65 17.18
Table 5: Results on CIFAR10 image classification networks. Entries are sorted wrt FLOPs. The FLOPs ratio and parameter ratio denote the remained quantities respectively, and the lower the better. ‘DHP-**’ represents the target compression ratio set for our pruning method. A few more operating points are reported in addition to those in the table of the main paper.
Network Top-1 Error (%) Compression Top-1 Error (%) FLOPs Ratio (%) Parameter Ratio (%)
Method
ResNet50 42.35 DHP-40 (Ours) 43.97 41.57 45.89
MetaPruning [40] 49.82 10.21 37.32
DHP-9 (Ours) 47.00 10.95 13.15
MobileNetV1 51.87 DHP-24-2 (Ours) 50.47 98.95 79.65
MobileNetV1-0.75 [19] 53.18 57.74 57.75
MetaPruning [40] 54.66 56.49 88.06
DHP-50 (Ours) 51.96 51.37 47.76
DHP-30 (Ours) 54.47 31.79 26.28
MobileNetV1-0.5 [19] 55.44 26.99 27.00
DHP-10 (Ours) 55.71 11.8 9.18
MobileNetV1-0.3 [19] 60.54 10.46 10.63
MobileNetV2 43.83 DHP-24-2 (Ours) 43.14 96.34 107.57
MobileNetV2-0.75 [52] 45.00 58.60 58.94
DHP-50 (Ours) 44.13 51.71 56.32
DHP-30 (Ours) 44.42 31.85 36.29
MobileNetV2-0.5 [52] 48.01 28.17 28.59
DHP-10 (Ours) 47.65 11.94 16.86
MobileNetV2-0.3 [52] 53.21 11.08 11.89
MetaPruning [40] 56.53 11.08 89.92
MNasNet 50.96 DHP-28-2 (Ours) 50.11 98.52 58.51
MNasNet-0.75 [56] 52.32 71.22 63.89
DHP-69 (Ours) 50.95 70.86 63.5
MNasNet-0.5 [56] 52.99 41.70 35.61
DHP-39 (Ours) 51.73 40.93 31.26
MNasNet-0.3 [56] 55.87 24.72 19.90
DHP-22 (Ours) 53.92 23.72 15.76
ShuffleNetV2-1.5 43.71 ShuffleNetV2-1.0 [42] 46.35 48.29 54.35
DHP-46 (Ours) 45.13 47.34 48.1
Table 6: Results on Tiny-ImageNet image classification networks. Entries are sorted wrt FLOPs. The FLOPs ratio and parameter ratio denote the remained quantities respectively, and the lower the better. ‘DHP-**’ represents the target compression ratio set for our pruning method. The operating points DHP-24-2 and DHP-28-2 are derived by compressing the corresponding networks with width multiplier set to 2 and the target compression ratio 24% and 28%, respectively. In addition to the entries in the main paper, compression results on MNasNet [56] and ShuffleNetV2 [42] are also reported.

0.b.6 Channel shuffle in ShuffleNetV2

The speciality of ShuffleNetV2 [42] is due to the channel shuffle operation. For each inverted residual block in ShuffleNetV2, the input feature map are divided into two branches. Different operations are applied to the two branches, after which channel shuffle operation is conducted between the feature maps of the two branches for the purpose of information communication between them. Due to this operation, the branches in all of the inverted residual blocks within the same stage are inter-connected. Thus, if one channel in a branch is ought to be pruned, the corresponding channel in the other branch should also be pruned. This is not a problem before the channel shuffle operation of the current inverted residual block. But after the channel shuffle operation, the two pruned channels appear in the same branch of the next inverted residual block while none of the channels in the other branch are pruned. This causes imbalanced branches. Considering the fact that the channels are also shuffled in the next inverted residual block and that shuffle operation need balanced branches, pruning ShuffleNetV2 is almost impossible for the traditional network pruning method. But for the proposed DHP method, we just assign a shared latent vector to all of the inter-connected branches of the inverted residual blocks within the same stage. By pruning the shared latent vector, all of the branches are compressed. Although the channel shuffle operation still complicates the situation and the pruned channels before and after channel shuffle operation do not match exactly, the proposed DHP still makes it possible to prune ShuffleNetV2. To the best of the authors’ knowledge, this is the first try to automatically prune ShuffleNetV2. Of course, a layer-wise distinguishing configuration is found by the automatic pruning method.

Appendix 0.C More Results

The ablation study on the latent vector sharing is shown in Table 4. As shown by the table, the latent vector sharing strategy outperforms the non-sharing strategy consistently except for the case where the gap between the parameter compression ratio of different strategies is relatively large. Due to this fact, we develop various latent vector sharing rules for easier and better automatic network pruning.

More results on image classification networks are shown in Table 5 and Table 6. In addition to the results in the main paper, compression results on MNasNet [56] and ShuffleNetV2 [42] are also shown. Note that MobileNetV1, MobileNetV2, ShuffleNetV2, and MNasNet are quite efficient networks designed by human experts or automatic architecture search method. The proposed DHP can lead to more efficient versions of those networks with different width multipliers. This phenomenon validates the importance of per-layer dissimalated configurations.