dhp
This is the official implementation of "DHP: Differentiable Meta Pruning via HyperNetworks".
view repo
Network pruning has been the driving force for the efficient inference of neural networks and the alleviation of model storage and transmission burden. Traditional network pruning methods focus on the per-filter influence on the network accuracy by analyzing the filter distribution. With the advent of AutoML and neural architecture search (NAS), pruning has become topical with automatic mechanism and searching based architecture optimization. However, current automatic designs rely on either reinforcement learning or evolutionary algorithm, which often do not have a theoretical convergence guarantee or do not converge in a meaningful time limit. In this paper, we propose a differentiable pruning method via hypernetworks for automatic network pruning and layer-wise configuration optimization. A hypernetwork is designed to generate the weights of the backbone network. The input of the hypernetwork, namely, the latent vectors control the output channels of the layers of backbone network. By applying ℓ_1 sparsity regularization to the latent vectors and utilizing proximal gradient, sparse latent vectors can be obtained with removed zero elements. Thus, the corresponding elements of the hypernetwork outputs can also be removed, achieving the effect of network pruning. The latent vectors of all the layers are pruned together, resulting in an automatic layer configuration. Extensive experiments are conducted on various networks for image classification, single image super-resolution, and denoising. And the experimental results validate the proposed method.
READ FULL TEXT VIEW PDF
Budgeted pruning is the problem of pruning under resource constraints. I...
read it
Automatic methods for Neural Architecture Search (NAS) have been shown t...
read it
This paper proposes a trilevel neural architecture search (NAS) method f...
read it
Single Image Super-Resolution (SISR) tasks have achieved significant
per...
read it
In this paper, we propose a novel meta learning approach for automatic
c...
read it
Neural networks have made great progress in pixel to pixel image process...
read it
Diverse user preferences over images have recently led to a great amount...
read it
This is the official implementation of "DHP: Differentiable Meta Pruning via HyperNetworks".
None
These days, network pruning has become the workhorse for network compression, which aims at lightweight and efficient model for fast inference [12, 18, 17, 41, 40, 32]
. This is of particular importance for the deployment of tiny artificial intelligence (Tiny AI) algorithms on smart phones and edge devices
[45]. Since the emerging of network pruning a couple of methods have been proposed based on the analysis of gradients, Hessians or filter distribution [28, 13, 9, 46, 58, 30, 62, 16]. With the advent of AutoML and neural architecture search (NAS) [68, 7], a new trend of network compression and pruning emerges, i.e. pruning with automatic algorithms and targeting distinguishing mini-architectures (e.g. layers or building blocks.) Among them, reinforcement learning and evolutionary algorithm become the natural choice [17, 40]. The core idea is to search a certain fine-grained layer-wise dissimalated configuration among the all of the possible choices (population in the terminology of evolutionary algorithm). After the searching stage, the most promising candidate that optimizes the network capacity under constrained budgets is chosen.The advantage of these automatic pruning methods is the final layer-wise distinguishing configuration. Thus, hand-crafted design is no longer necessary. However, the main concern of these algorithms is the convergence property. For example, reinforcement learning is notorious for its difficulty of convergence under large or even middle level number of states [55]. Evolutionary algorithm needs to choose the best candidate from the already converged algorithm. But the dilemma lies in the impossibility of training the whole population till convergence and the difficulty of choosing the best candidate from unconverged population [40, 14]
. A promising solution to this problem is endowing the searching mechanism with differentiability or directly resorting to an approximately differentiable algorithm. This is due to the fact that differentiability guarantees theoretical convergence and has the potential to make the searching stage efficient. Actually, differentiability has facilitated a couple of machine learning approaches and the typical one among them is NAS. Early works on NAS have insatiable demand for computing resources, consuming tens of thousands of GPU hours for a satisfactory convergence
[68, 69]. The incorporation of differentiable architecture search (DARTS) reduces the insatiable consumption to tens of GPU hours, which has boosted the development of NAS during the past year [39].Another noteworthy direction for automatic pruning is brought by MetaPruning [40] which introduces hypernetworks [11] into network compression. The output of the so-called hypernetwork is used as the parameters of the backbone network. During training, the gradients are also back-propagated to the hypernetworks. This method falls in the paradigm of meta learning since the parameters in the hypernetwork act as the meta-data of the parameters in the backbone network. But the problem of this method is that the hypernetworks can only output fixed-size weights, which cannot serve as a layer-wise configuration searching mechanism. Thus, a searching algorithm such as evolutionary algorithm is necessary for the discovery of a good candidate. Although this is quite a natural choice, there is still one interesting question, namely, whether one can design a hypernetwork whose output size depends on the input (termed as latent vector in this paper) so that by only dealing with the latent vector, the backbone network can be automatically pruned.
To solve the aforementioned problem, we propose the differentiable meta pruning approach via hypernetworks DHP (D – Differentiable, H – Hyper, P – Pruning). A new design of hypernetwork is proposed to adapt to the requirements of differentiability. Each layer is endowed with a latent vector that controls the output channels of this layer. The hypernetwork takes as input the latent vectors of the current layer and previous layer that controls the input and output channels of the current layer respectively. By forward passing the latent vectors through the hypernetwork, the derived output is used as the parameters of the hypernetwork. To achieve the effect of automatic pruning, sparsity regularizer is applied to the latent vectors. A pruned model is discovered by updating the latent vectors with proximal gradient. The searching stage stops when the compression ratio drops to the target level. After the searching stage, the latent vectors becomes sparse with some elements approaching zero which can be removed. Accordingly, the output of the hypernetwork that is covariant with the latent vector is also compressed. Thus, the advantage of the proposed method is that operating only on the latent vectors makes automatic network pruning easier without the other bells and whistles.
With the fast development of efficient network design and NAS, the usefulness of network pruning is frequently challenged. However, by analyzing the pruning performance on MobileNetV1 [19] and MobileNetV2 [52] in Fig. 1, we conclude that automatic network pruning is of vital importance for further exploring the capacity of efficient networks. Efficient network design and NAS can only result in an overall architecture with building blocks endowed with the same mini-architecture. By automatic network pruning, the efficient networks obtained by either human experts or NAS can by further compressed, leading to block-wise dissimilated configurations, which can be seen as a fine-grained architecture search. We conjuncture that this per-layer distinguishing configuration might help to discover the potential capacities of efficient network without losing too much accuracy of the original network.
Thus, the contribution of this paper is as follows.
A new architecture of hypernetwork is designed. Different from the classical hypernetwork composed of linear layers, the new design is tailored to automatic network pruning. The input latent vector of the hypernetwork controls the parameters of the layers of the backbone network. By only operating on the latent vector, the backbone network can be pruned.
A differentiable automatic networking pruning method is proposed based on the the newly designed hypernetwork. Different from the existing methods based on reinforcement learning or evolutionary algorithms, the proposed method has theoretical convergence guarantee based on the proximal gradient algorithm.
The potential of automatic network pruning as fine-grained architecture search is revealed by comparing the compression results from DHP with those from efficient networks.
The proposed differentiable automatic pruning method is not limited to a specific network architecture or layer type. A wide range of networks are compressed including VGG [53], ResNet [15], DenseNet [22], MobileNetV1 [19], MobileNetV2 [52], ShuffleNetV2 [42], MNasNet [56], DnCNN [66], UNet [51], SRResNet [29], and EDSR [35] which contains various layer and block types including standard convolution, depth-wise convolution, convolution, transposed convolution combined with bottleneck block, skip connection, and densely connected block.
Extensive experiments are done on both high-level and low-level vision tasks including image classification, single image super-resolution, and denoising. The experimental results show that the proposed method sets the new state-of-the-art in automatic network pruning.
Network pruning. Aiming at removing the weak filter connections that have the least influence on the accuracy of the network, network pruning has attracted increasing attention recently. Early attempts emphasize more on the storage consumption, various criteria have been explored to remove inconsequential connections in an unstructural manner [12, 37]. Despite their success in reducing network parameters, unstructural pruning leads to irregular weight parameters, limited in the actual acceleration of the pruned network. To further address the efficiency issue, structured pruning methods directly zero out structured groups of the convolutional filters. For example, Wen et al. [60] and Alvarez et al. [2] firstly proposed to resort to group sparsity regularization during training to reduce the number of feature maps in each layer. Since that, the field has witnessed a variety of regularization strategies[3, 63, 31, 57, 32]. These elaborately designed regularization methods considerably advance the pruning performance. But they often rely on carefully adjusted hyper-parameters selected for specific network architecture and dataset.
AutoML. Recently, there is an emerging trend of exploiting the idea of AutoML for automatic network compression. The rationality lies in the exploration among the total population of network configurations for a final best candidate. He et al. exploited reinforcement learning agents to prune the networks where hand-crafted design is not longer necessary. Hayashi et al.
utilized genetic algorithm to enumerate candidate in the designed hypergraph for tensor network decomposition
[14]. Liu et al. trained a hypernetwork to generate weights of the backbone network and used evolutionary algorithm to search for the best candidate. The problem of these approach is that the searching algorithm is not differentiable, which is does not result in guaranteed convergence.NAS. NAS automatizes the manual task of neural network architecture design. Optimally, searched networks achieve smaller test error, require fewer parameters and need less computations than their manually designed counterparts [68, 50]. But the main drawback of both early strategies is their almost insatiable demand for computational resources. To alleviate the computational burden several methods [69, 38, 39] proposed to search for a basic building block, i.e. cell, opposed to an entire network. Then, stacking multiple cells with equivalent structure but different weights defined a full network [49, 5]. Another recent trend in NAS are differentiable search methods such as DARTS [39]. The differentiability allows the fast convergence of the searching algorithm and thus boosts the fast development of NAS during the past year. In this paper we propose a differentiable counterpart for automatic network pruning.
Meta learning and hypernetworks Meta learning is a broad family of machine learning techniques that deal with the problem of learning to learn. Recent works have witnessed its application to various vision tasks including object detection [61], instance segmentation [20], and super-resolution [21]. An emerging trend of meta learning uses hypernetworks to predict the weight parameters in the backbone network [11]. Since the introduction of hypernetworks, it has found wide applications in NAS [5], multi-task learning [47], Bayesian neural networks [27], and also network pruning [40]. In this paper, we propose a new design of hypernetwork which is especially suitable for network pruning and makes differentiability possible for automatic network pruning.
The pipeline of the proposed method is shown in Fig 2. The two cores of the whole pipeline are the designed hypernetwork and the optimization algorithm, i.e. proximal gradient. In the forward pass, the designed hypernetwork takes as input the latent vectors and predicts the weight parameters for the backbone network. The gradients are back-propagated to the hypernetwork in the backward pass. The sparsity regularizer is enforced on the latent vectors and proximal gradient is used to solve the problem. The dimension of the output of the hypernetwork is covariant with that of the input. Due to this property, the output weights are pruned along with the sparse latent vectors after the optimization step. The differentiability comes with the covariance property of the hypernetworks, the sparsity enforced on the latent vectors, and the proximal gradient used to solve the problem. The automation of pruning is due to the fact that all of the latent vectors are non-discriminatively regularized and that proximal gradient discovers the potential less important elements automatically.
We first introduce the design of the hypernetwork shown in Fig. 3. In summary, the hypernetwork consists of three layers. The latent layer takes as input the latent vectors and computes a latent matrix from them. The embedding layer projects elements of the latent vector to an embedding space. The last explicit layer converts the embedded vectors to the final output. This design is inspired by fully connected layers as in [11, 40] but different from those designs in that the output dimension is covariant with the input latent vector. This design is applicable to all types of convolutions including the standard convolution, depth-wise convolution, point-wise or convolution, and transposed convolution. And for the simplicity of reference, the term convolution is used to denote any of them. Unless otherwise stated, we use the normal (), minuscule bold (), and capital bold () letters to denote scalars, vectors, and matrices or high-dimensional tensors. The elements of a tensor is indexed by the subscript as which could be scalars or vectors depending on the the dimension of the indexed tensor.
Suppose that the given is an -layer convolutional neural netowrk (CNN) with layers indexed by . The dimension of the weight parameter of a convolutional layer is , where , , and denote the output channel, input channel, and kernel size of the convolutional layer, respectively. Every convolutional layer is endowed with a latent vector with the same size as the output channel, namely, . Thus, the layer previous to the current one is given a latent vector . The hypernetwork receives the latent vectors and of the current and the previous layer as input. A latent matrix is first derived from the two latent vectors, namely,
(1) |
where and denote matrix transpose and multiplication, . Then every element in the latent matrix is projected to an -th dimensional embedding space with the vectors and , namely,
(2) |
where , and are element-wise unique and for the simplicity of notation, the subscript is omitted. We denote the ensemble of the element-wise embedding operation with the following high-dimensional tensor operation
(3) |
where , , and are the -th slice of and along their first two dimensions, denotes the broadcastable element-wise tensor multiplication, inserts a third dimension for . Note that after the operation in Eqn. 3, all of the elements of are converted to embedded vectors in the embedding space. The final step is to obtain the output that can be explicitly used as the weights of the convolutional layer. To achieve that, every embedded vector is multiplied by an explicit matrix, that is,
(4) |
where , . The operation above can be easily rewritten as batched matrix multiplication as in the convention of tensor operation,
(5) |
where , , denotes batched matrix multiplication. Again note that in Eqn. 4 and are unique for every embedded vector and denote the slices from and .
A simplified representation of the output of the hypernetwork is given by
(6) |
where denotes the functionality of the hypernetwork. The final output can be reshaped to form the weight parameter of -th layer. The output is said to be covariant with the input latent vector in that its first two dimensions are the same with the two latent vectors. By imposing sparsity regularization to the latent vectors, the corresponding element in the output can also be removed, thus achieving the effect of network pruning.
For the initialization of the parameters, all biases are initialized as zero, the latent vector with standard normal distribution, and
with Xaiver uniform [10]. The weight of the explicit layer is initialized with Hyperfan-in which guarantees stable backbone network weights and fast convergence [6].The core of approximate differentiability comes with not only the specifically designed hypernetwork but also the mechanism used to search the the potential candidate. To achieve that, we enforce sparsity constraints to the latent vectors. Thus, the loss function of the aforementioned
-layer CNN is denoted as(7) |
where , , and are the general loss function of the CNN, weight decay term, and sparsity regularization term, respectively. For the simplicity of notation, the superscript is omitted. The sparsity regularization takes the form of norm, namely,
(8) |
To solve the problem in Eqn. 7, the weights and biases of the hypernetwork are updated with SGD. Note that the gradient are back-propagated from the backbone network to the hypernetwork. Thus, neither the forward pass nor the backward pass challenges the information flow between the backbone network and the hypernetwork. As for the latent vectors, they are updated with proximal gradient algorithm, that is,
(9) |
where is the step size of the proximal gradient method that is set as the learning rate of SGD updates. As can be seen in the equation, the proximal gradient update contains a gradient descent step and a proximal operation step. When the regularizer has the form of norm, the proximal operator has closed-form solution, i.e.
(10) |
where is the intermediate SGD update, the sign operator , the thresholding operator , and the absolute value operator act element-wise on the vector. Eqn. 10 is the well-known soft-thresholding function.
In practice, the latent vectors first get SGD updates along with the other parameters and after which the proximal operator is applied. Since the existence of SGD updates and the fact that the proximal operator has closed-form solution, we recognize the whole solution as approximately diferentiable (although the norm is not differentiable at 0), which guarantees the fast convergence of the algorithm compared with reinforcement learning or evolutionary algorithm. Actually, the speed-up of proximal gradient lies in that instead of searching the best candidate among the total population it forces the solution towards the best sparse one.
The automation of pruning follows the way the sparsity applied in Eqn. 8 and the proximal gradient solution. First of all, all latent vectors are regularized together without any distinguishment between them. During the optimization, information and gradients flows fluently between the backbone network and the hypernetwork. The proximal gradient algorithm forces the potential elements of the latent vectors to approach zero quicker than the others without any human effort and interference in this process. The optimization stops immediately when the target compressed ratio is reached. In total, there are only two additional hyper-parameters in the algorithm, i.e. the sparsity regularization factor and the mask threshold in Subsec. 3.3. Thus, running the algorithm is just like turning on the button, which enable the application of the algorithm to all of the CNNs without much interference of domain experts’ knowledge.
Different from the fully connected layers, the proposed design of hypernetwork can adapt the dimension of the output according to that of the latent vectors. After the searching stage, sparse versions of the latent vectors are derived as and . For those vectors, some of their elements are zero or approaching zero. Thus, 1-0 masks can by derived by comparing the sparse latent vectors with a predefined small threshold , that is,
(11) |
where the function element-wise compares the latent vector with the threshold and returns 1 if the element is not smaller than and 0 otherwise. By applying the masks to the latent vectors and analyzing the three layers together, we can have an direct impression of how the backbone layers are automatically pruned. That is,
(12) | ||||
(13) |
The equality follows the broadcastability of the the operations and . As shown in the above equations, applying the masks on the latent vectors has the same effect of applying them on the final output. Note that in the above analysis the bias terms , , and are omitted since they have a really small influence on the output of the hypernetwork. Anyway the same mask matrix is applied to the biases. In conclusion, the final output can be pruned according to the same criterion for the latent vectors.
Network Top-1 Error (%) | Compression | Top-1 Error (%) | FLOPs Ratio (%) | Parameter Ratio (%) |
Method | ||||
CIFAR10 | ||||
ResNet-56 7.05 | Variational [67] | 7.74 | 79.70 | 79.51 |
GAL-0.6 [36] | 6.62 | 63.40 | 88.20 | |
56-pruned-B [30] | 6.94 | 62.40 | 86.30 | |
NISP [64] | 6.99 | 56.39 | 57.40 | |
DHP-50 (Ours) | 6.31 | 50.98 | 55.62 | |
CaP [44] | 6.78 | 50.20 | – | |
ENC [25] | 7.00 | 50.00 | – | |
AMC [17] | 8.10 | 50.00 | – | |
KSE [34] | 6.77 | 48.00 | 45.27 | |
FPGM [16] | 6.74 | 47.70 | – | |
GAL-0.8 [36] | 8.42 | 39.80 | 34.10 | |
DHP-38 (Ours) | 6.86 | 39.60 | 49.00 | |
ResNet-110 5.31 | Variational [67] | 7.04 | 63.56 | 58.73 |
DHP-62 (Ours) | 5.36 | 62.50 | 59.28 | |
GAL-0.5 [36] | 7.26 | 51.50 | 55.20 | |
DHP-20 (Ours) | 6.93 | 21.77 | 21.30 | |
ResNet-164 4.97 | SSS [24] | 5.78 | 53.53 | – |
DHP-50 (Ours) | 5.29 | 51.86 | 44.10 | |
Variational [67] | 6.84 | 50.92 | 43.30 | |
DHP-20 (Ours) | 5.91 | 21.47 | 20.24 | |
DenseNet-12-40 5.26 | Variational [67] | 6.84 | 55.22 | 40.33 |
DHP-38 (Ours) | 6.06 | 39.80 | 63.76 | |
DHP-28 (Ours) | 6.51 | 29.52 | 26.01 | |
GAL-0.1 [36] | 6.77 | 28.60 | 25.00 | |
VGG-16 6.34 | DHP-40 (Ours) | 7.61 | 40.11 | 35.61 |
DHP-20 (Ours) | 8.60 | 21.65 | 17.18 | |
Tiny-ImageNet |
||||
ResNet50 42.35 | DHP-40 (Ours) | 43.97 | 41.57 | 45.89 |
MetaPruning [40] | 49.82 | 10.21 | 37.32 | |
DHP-9 (Ours) | 47.00 | 10.95 | 13.15 | |
MobileNetV1 51.87 | DHP-24-2 (Ours) | 50.47 | 98.95 | 79.65 |
MobileNetV1-0.75 | 53.18 | 57.74 | 57.75 | |
MetaPruning [40] | 54.66 | 56.49 | 88.06 | |
DHP-50 (Ours) | 51.96 | 51.37 | 47.76 | |
MobileNetV2 43.83 | DHP-24-2 (Ours) | 43.14 | 96.34 | 107.57 |
MobileNetV2-0.3 | 53.21 | 11.08 | 11.89 | |
MetaPruning [40] | 56.53 | 11.08 | 89.92 | |
DHP-10 (Ours) | 47.65 | 11.94 | 16.86 |
Due to the existence of skip connections in residual networks such as ResNet, MobileNetV2, SRResNet, and EDSR, the residual blocks are interconnected with each other in the way that their input and output dimensions are related. Therefore, the skip connections are notoriously tricky to deal with. But back to the design of the proposed hypernetwork, a quite simple and straightforward solution is to let the hypernetworks of the correlated layers share the same latent vector. Thus, by automatically pruning the single latent vector, all of the relevant layers are pruned together. Actually, we first tried to use unique latent vector for the correlated layers and applied group sparsity to them. But the experimental results showed that this is not a good choice because it shot lower accuracy than sharing the latent vectors (See details in the Supplementary).
To validate the effectiveness of DHP, extensive experiments have been conducted on various network architectures for different computer vision task including VGG [53], ResNet [15], DenseNet [22] for CIFAR10 [26] image classification, ResNet50, MobileNetV1 [19], MobileNetV2 [52] for Tiny-ImageNet [8] image classification, SRResNet [29], EDSR [35] for single image super-resolution, and DnCNN [66], UNet [51] for gray image denoising. For super-resolution, the network are trained on DIV2K [1] dataset and tested on Set5 [4], Set14 [65], B100 [43], Urban100 [23], and DIV2K validation set. For image denoising, the networks are trained on the gray version of DIV2K dataset and tested on BSD68 and DIV2K validation set.
We train and prune the networks from scratch without relying on the pre-trained model. The proximal gradient algorithm is first used to prune the network with initialization detailed in Subsec. 3.1. A target ratio is set for the pruning procedure. When the difference between the target ratio and the actual compression ratio is below 2%, the automatic pruning procedure stops. The training of the pruned network continues with the same protocol as done for the original network. Please refer to the supplementary for the detailed training protocol and the selection of hyper-parameters.
The experimental results of different network compression algorithms on image classification is shown in Table 1. For ResNet-56, the proposed method is compared with 9 different network compression methods and achieves the best performance, i.e. 6.31% Top-1 error rate on the most intensively investigated 50% compression level. Note that this error rate is even lower than the uncompressed baseline. The compression on DenseNet-12-40 is reasonable compared with the other method. For ResNet-110 and ResNet-164, the accuracy of our higher operating points DHP-62 and DHP-50 is not far away from that of the baseline. More results on ResNet-110 and ResNet-164 is shown in Fig. 4. As can be seen in the figure especially for ResNet-164, when the compression ratio is not too severe (above 20%), the accuracy does not drop too much. The extreme compression prunes about 90% of computation and parameters (for ResNet-164, the extreme compression only keeps 6.87% parameters). Thus, the drop in the accuracy is reasonable.
On Tiny-ImageNet, DHP achieves lower Top-1 error rates than MetaPruning under the same FLOPs constraint. Our lower operating points shoot lower Top-1 error rate than the narrowed versions of mobile networks. On MobileNetV1, the error rate of DHP-10 is lower than MobileNetV1-0.75 by 1.08% with 6.37% fewer FLOPs and nearly 10% fewer parameters while on MobileNetV1 the accuracy gain of DHP-10 over MobileNetV2-0.3 goes to 8.88%. Thus, we hypothesize that we can target an error rate lower than the original version by pruning the widened mobile networks. And this is confirmed by comparing the accuracy of our DHP-24-2 with the baseline accuracy in Fig. 1.
The image super-resolution results are shown in Table 2. We compare our method with factorized convolution (Factor) [59], filter basis method (Basis) [33]
, and K-means clustering method (Clustering)
[54]. To fairly compare the methods and measure the practical compression effectiveness, five metrics are involved including Peak Signal-to-Noise Ratio (PSNR), floating point operations (FLOPs), number of parameters, runtime and GPU memory consumption. By observing the five metrics, we have several conclusions. Firstly, previous methods mainly focus on the reduction of FLOPs and number of parameter without paying special attention to the actual acceleration. Although some methods such as Clustering can reduce substantial parameters while maintaining quite good PSNR accuracy, the actual computing resources are remained (GPU memory) or even increased (runtime) due to the overhead introduced by centroid indexing. Secondly, convolution factorization or decomposition results in additional CUDA kernel calls, which is not efficient for the actual acceleration. Thirdly, for the proposed method, the two model complexity metrics changes consistently across different operating points, which leads to consistent reduction of computation resource including runtime and memory consumption. Fourthly, the proposed DHP results in inference efficient models as well as the accuracy-preserving one. The visual results are shown in Fig.
5. As can been seen, the proposed method results in almost indistinguishable images with that of the baseline while achieving the actual acceleration.Network | Method | PSNR | FLOPs | Params | Run- time | GPU Mem | ||||
Set5 | Set14 | B100 | Urban100 | DIV2K | ||||||
SRResNet | Baseline | 32.03 | 28.50 | 27.52 | 25.88 | 28.85 | 32.81 | 1535.62 | 7.80 | 2.2763 |
[29] | Clustering [54] | 31.93 | 28.44 | 27.47 | 25.71 | 28.75 | 32.81 | 341.25 | 10.84 | 2.2727 |
Factor-SIC3 [59] | 31.86 | 28.38 | 27.40 | 25.58 | 28.65 | 20.83 | 814.73 | 17.59 | 2.2737 | |
DHP-60 (Ours) | 31.97 | 28.47 | 27.48 | 25.76 | 28.79 | 20.27 | 948.63 | 6.98 | 2.0568 | |
Basis-32-32 [33] | 31.90 | 28.42 | 27.44 | 25.65 | 28.69 | 19.77 | 738.18 | 13.14 | 0.9331 | |
Factor-SIC2 [59] | 31.68 | 28.32 | 27.37 | 25.47 | 28.58 | 18.38 | 661.13 | 15.77 | 2.2763 | |
Basis-64-14 [33] | 31.84 | 28.38 | 27.39 | 25.54 | 28.63 | 17.49 | 598.91 | 9.77 | 0.7267 | |
DHP-40 (Ours) | 31.90 | 28.45 | 27.47 | 25.72 | 28.75 | 13.68 | 638.75 | 5.75 | 1.7152 | |
DHP-20 (Ours) | 31.77 | 28.34 | 27.40 | 25.55 | 28.60 | 7.75 | 357.92 | 4.52 | 1.3362 | |
EDSR | Baseline | 32.10 | 28.55 | 27.55 | 26.02 | 28.93 | 90.36 | 3696.67 | 16.83 | 0.4438 |
[35] | Clustering [54] | 31.92 | 28.46 | 27.48 | 25.76 | 28.80 | 90.36 | 821.48 | 21.70 | 0.6775 |
Factor-SIC3 [59] | 31.96 | 28.47 | 27.49 | 25.81 | 28.81 | 65.49 | 2189.34 | 33.22 | 1.5007 | |
Basis-128-40 [33] | 32.03 | 28.45 | 27.50 | 25.81 | 28.82 | 62.65 | 2003.46 | 16.59 | 0.4679 | |
Factor-SIC2 [59] | 31.82 | 28.40 | 27.43 | 25.63 | 28.70 | 60.90 | 1904.67 | 27.69 | 1.1247 | |
Basis-128-27 [33] | 31.95 | 28.42 | 27.46 | 25.76 | 28.76 | 58.28 | 1739.52 | 17.13 | 0.4772 | |
DHP-60 (Ours) | 31.99 | 28.52 | 27.53 | 25.92 | 28.88 | 55.67 | 2279.64 | 7.81 | 0.4658 | |
DHP-40 (Ours) | 32.01 | 28.49 | 27.52 | 25.86 | 28.85 | 37.77 | 1529.78 | 6.01 | 0.4658 | |
DHP-20 (Ours) | 31.94 | 28.42 | 27.47 | 25.69 | 28.77 | 19.40 | 785.85 | 4.37 | 0.4658 |
The compression results for image denoising is shown in Table 3. The same metrics as super resolution are reported for denoising. An additional method, i.e. filter group approximation (Group) [48] is included. In addition to the same conclusion as in Subsec.0.A.2, another two conclusions are drawn here. The grouped convolution approximation method fails to reduce the actual computation resources although with quite good accuracy and satisfactory reduction of FLOPs and number of parameters. This might due to the introduced additional
convolution and possibly the inefficient implementation of group convolution in current deep learning toolboxes. For DnCNN, one interesting phenomenon is that the Factor method achieves even better accuracy than the baseline but has larger appetite for other resources. This is not surprising due to two facts. The SIC layer of Factor doubles the actual convolutional layers. So Factor-SIC3 has five times more convolutioinal layers, which definitely slows down the execution. Aother fact is that Factor has skip connections within the SIC layer. The outperformance of Factor in accuracy just validates the effectiveness of skip connections. The performance of the other method can also be improved if skip connections are added. The visual results is shown in Fig.
6.Network | Method | PSNR | FLOPs | Params | Runtime | GPU Mem | |
BSD68 | DIV2K | ||||||
DnCNN [66] | Baseline | 24.93 | 26.73 | 9.10 | 557.06 | 23.69 | 3.99 |
Clustering [54] | 24.90 | 26.67 | 9.10 | 123.79 | 21.84 | 3.99 | |
DHP-60 (Ours) | 24.91 | 26.69 | 5.62 | 344.61 | 17.11 | 2.70 | |
DHP-40 (Ours) | 24.89 | 26.65 | 3.81 | 233.75 | 17.19 | 2.33 | |
Factor-SIC3 [59] | 24.97 | 26.83 | 3.53 | 219.14 | 126.22 | 5.96 | |
Group [48] | 24.88 | 26.64 | 3.32 | 204.74 | 26.81 | 4.02 | |
Factor-SIC2 [59] | 24.93 | 26.76 | 2.36 | 147.14 | 85.14 | 5.96 | |
DHP-20 (Ours) | 24.84 | 26.58 | 2.00 | 122.72 | 10.39 | 1.99 | |
UNet [51] | Baseline | 25.17 | 27.17 | 3.41 | 7759.52 | 7.27 | 3.75 |
Clustering [54] | 25.01 | 26.90 | 3.41 | 1724.34 | 9.66 | 3.73 | |
DHP-60 (Ours) | 25.14 | 27.11 | 2.12 | 4760.27 | 6.09 | 2.93 | |
Factor-SIC3 [59] | 25.04 | 26.94 | 1.56 | 3415.65 | 40.36 | 4.71 | |
Group [48] | 25.13 | 27.08 | 1.49 | 2063.91 | 9.00 | 3.75 | |
DHP-40 (Ours) | 25.12 | 27.08 | 1.43 | 3238.12 | 5.17 | 2.04 | |
Factor-SIC2 [59] | 25.01 | 26.90 | 1.22 | 2510.82 | 29.31 | 4.71 | |
DHP-20 (Ours) | 25.04 | 26.97 | 0.75 | 1611.03 | 3.99 | 1.77 |
PSNR/FLOPs/Runtime | 32.85/28.59/14.10 | 32.50/28.59/19.75 | 32.65/19.82/14.71 | 32.24/19.28/25.49 | 32.64/17.61/5.40 |
(a) LR | (b) EDSR | (d) Cluster | (c) Basis | (f) Factor | (e) DHP |
PSNR/FLOPs/Runtime | 25.60/1.08/7.27 | 25.30/1.08/9.66 | 25.37/0.49/40.36 | 25.51/0.47/9.00 | 25.57/0.45/6.09 |
(a) Noisy | (b) UNet | (f) Cluster | (d) Factor | (e) Group | (c) DHP |
In this paper, we proposed a differentiable automatic meta pruning method via hypernetwork for network compression. The differentiability comes with the specially designed hypernetwork and the proximal gradient used to search the potential candidate network configurations. The automation of pruning lies in the uniformly applied sparsity on the latent vectors and the proximal gradient that solves the problem. By pruning mobile network with width multiplier , we obtained models with higher accuracy but lower computation complexity than that with . We hypothesize this is due to the per-layer distinguishing configuration resulting from the automatic pruning. Future work might be investigating whether this phenomenon reoccurs for the other networks.
This work was partly supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project, an Amazon AWS grant, and an Nvidia grant.
Learning the number of neurons in deep networks
. In Proce. NeurIPS, pp. 2270–2278. Cited by: §2.Einconv: exploring unexplored tensor decompositions for convolutional neural networks
. In Proc. NeurIPS, pp. 5553–5563. Cited by: §1, §2.Importance estimation for neural network pruning
. In Proc. CVPR, pp. 11264–11272. Cited by: §1.Regularized evolution for image classifier architecture search
. In Proc. AAAI, Vol. 33, pp. 4780–4789. Cited by: §2.As explained in the main paper, the proposed DHP method does not rely on the pretrained model. Thus, all of the networks are trained and pruned from scratch. The hypernetworks are first initialized and used along with the proximal gradient to sparsify the latent vectors. When the difference between the target and the actual FLOPs compression ratio is below 2%, the pruning procedure stops. Then the pruned latent vectors as well as the pruned outputs of the hypernetworks are derived. After that, the outputs of hypernetworks are used as the weight parameters of the backbone network and updated by SGD or Adam algorithm directly. The hypernetworks are of course removed. After the pruning procedure, the training continues and the training protocol are the same as that of training the original uncompressed network. The number of pruning epochs is much smaller than that used for training the original network. Usually, the pruning procedure continues for about 10 epochs compared with the hundreds of epochs for training the uncompressed network. The following of the section describes the training protocols of different tasks.
We evaluate the performance of compressed models on CIFAR10 [26] dataset. The dataset contains 10 different classes. The training and testing subset contains 50,000 and 10,000 images with resolution , respectively. As is done by prior works [15, 22]
, we normalize all images using channel-wise mean and standard deviation of the the training set. Standard data augmentation is also applied. We train the networks for 300 epochs with SGD optimizer and an initial learning rate of 0.1. The learning rate is decayed by 10 after 50% and 75% of the epochs. The momentum of SGD is 0.9. Weight decay factor is set to 0.0001. The batch size is 64.
For image classification, we also apply the pruning method on Tiny-Imagenet. It has 200 classes. Each class has 500 training images and 50 validation images. And the resolution of the images is . The images are normalized with channel-wise mean and standard deviation. Horizontal flip is used to augment the dataset. The networks are trained for 220 epochs with SGD and an initial learning rate of 0.1. The learning rate is decayed by a factor of 10 at Epoch 200, Epoch 205, Epoch 210, and Epoch 215. The momentum of SGD is 0.9. Weight decay factor is set to 0.0001. The batch size is 64.
For image super-resolution, we train the networks on DIV2K dataset. It contains 800 training images, 100 validation images, and 100 test images. We use the 800 training images to train our network, and validate on the validation subset. Image patches are extracted from the training images. For EDSR, the patch size of the low-resolution input patch is while for SRResNet the patch size is . The batch size is 16. The networks are optimized with Adam optimizer. We use the default hyper-parameter for Adam optimizer. The weight decay factor is 0.0001. The networks are trained for 300 epochs. The learning rate starts from 0.0001 and decays by 10 after 200 epochs.
Note that, in order to speed up the training of EDSR, we adopt a simplified version of EDSR. The original EDSR contains 32 residual blocks and each convolutional layer in the residual blocks has 256 channels. Our simplified version has 8 residual blocks and each has two convolutional layers with 128 channels.
For image denoising, we also train the networks on DIV2K dataset. But all of the images are converted to gray images. As done for super-resolution, image patches are extracted from the training images. For DnCNN, the patch size of the input image is and the batch size is 64. For UNet, the patch size and the batch size are and 16, respectively. Gaussian noise is added to degrade the input patches on the fly with noise level . Still, Adam optimizer is used to train the network. The weight decay factor is 0.0001. The networks are trained for 60 epochs and each epoch contains 10,000 iterations. So in total, it’s 600k iterations. The learning rate starts with 0.0001 and decays by 10 at Epoch 40.
Every convolutional layer including standard convolution, depth-wise convolution, point-wise convolution, group convolution, and transposed convolution is attached a latent vector. The latent vectors control the channels of the convolutional layer. Thus, by dealing with only the latent vector, we can control how the convolutionaly layers are pruned. Actually, some complicated cases occur in the modern network architecture and the latent vectors has to be shared among different layers. Thus, during the development of the algorithm, we summarize some basic rules for latent vectors. In the following, we first describe the general rules for latent vectors and then detail the specific rules for special network blocks.
Every convolutional layer is attached a latent vector.
The channels that the latent vector controls and the dimension of the latent vector varies with the types of convolutional layers.
For standard convolution, point-wise convolution and transposed convolution, the latent vector controls the output channel of the layer and the dimension of the latent vector is the same as the number of output channels.
For depth-wise convolution and group convolution, the latent vector controls the input channels per group and the dimension of the latent vector is the same as the number of input channels per group. That is, the latent vector of depth-wise convolution contains only one element.
Since the output and input channels of consecutive layers are correlated, the latent vectors has to be shared among consecutive layers. That is, the hypernetworks have to receive the latent vectors of the previous layer and the current layer as input in order to make the input and output channels consistent with the previous and latter layers.
Not every latent vector needs to be sparsified. And the latent vectors free from sparsifying are list as follows.
The latent vector that controls the input channel of the first convolutional layer. This latent vector has the same dimension with the input image channels, e.g. 3 for RGB images and 1 for gray images. Of course, the input images do not need to be pruned.
The latent vector that controls the output channel of the last convolutional layer of image classification network. This latent vector is related to the fully connected linear layers of image classifiers. Since we do not intend to prune the fully connected layer, the correlated latent vector is not pruned either.
The latent vector attached to depth-wise convolution and group convolution. This latent vector controls the input channels per group. The input and output channels of depth-wise and group convolution are correlated. In order to compress depth-wise and group convolution, we prune the output channels. This has the same effect of reducing the number of groups which is controlled by the latent vectors of the previous layer instead of the input channels per group controlled by the latent vectors in the current layer.
The residual networks including ResNet, SRResNet, and EDSR are constructed by stacking a number of residual blocks. Depending on the dimension of the feature maps, the residual networks contains several stages with progressively reducing feature map dimension and increasing number of feature maps. (Note that the feature map dimension of EDSR and SRResNet does not change for all of the residual blocks. So there is only one stage for those networks.) For the residual blocks within the same stage, their output channels are correlated due to the existence of the skip connections. In order to prune the second convolution of the residual blocks within the same stage, we set a shared latent vector for them. Thus, by only dealing with this shared latent vector, all of the second convolutions of the residual blocks can be pruned together. Please refer to Table 4 for the ablation study on latent vector sharing and non-sharing strategies.
Similar to residual networks, DenseNet also contains several stages with different feature map configurations. But different from residual networks, each dense block concatenates its input and output to form the final output of the block. As a result, each dense block receives as input the outputs of all of the previous dense blocks within the same stage. Thus, the hypernetwork of a dense block also has to receive the latent vectors of the corresponding dense blocks as input.
The inverted residual blocks are just a special case of residual blocks. So how the latent vectors are shared across different blocks is the same with the normal residual blocks. Here we specifically address the sharing strategy within the block due to the existence of depth-wise convolution. The inverted residual block has the architecture of “point-wise conv + depth-wise conv + point-wise conv”. As explained earlier, the latent vector of depth-wise convolution controls the input channels per group, i.e. 1 here. Thus, the latent vector of the first point-wise convolution controls not only its output channels but also the input channels of the depth-wise convolution and the second point-wise convolution. Thus, this latent vector has to be passed to the hypernetworks of the those convolutional layers.
The image super-resolution networks are attached upsampler blocks at the tail part of the network to increase the spatial resolution of the feature map. For the scaling factor of , two upsamplers are attached and each doubles the spatial resolution. Each of the upsampler block contains a standard convolutional layer that increase the number of feature maps by a factor a 4 and a pixel shuffler that shuffles every 4 consecutive feature maps into the spatial dimension. Thus, the output channels of the convolutional layer in the upsmapler is correlated to its input channels. If one input channel is pruned, then 4 corresponding consecutive output channels should also be pruned. To achieve this way of pruning control, a common latent vector is used for the input and output channels and the vector is interleavedly repeated to form the one controlling the output channels.
Share | Regularizer | Target | Actual | Actual | Top-1 | ||
---|---|---|---|---|---|---|---|
FLOPs Ratio (%) | FLOPs Ratio (%) | Parameter Ratio (%) | Error (%) | ||||
Yes | 38 | 39.96 | 52.49 | 7.41 | |||
No | 38 | 39.75 | 55.43 | 7.32 | |||
No | 38 | 39.40 | 54.24 | 7.91 | |||
Yes | 38 | 39.60 | 49.00 | 6.86 | |||
No | 38 | 39.30 | 58.35 | 7.98 | |||
No | 38 | 39.54 | 54.04 | 7.03 | |||
Yes | 50 | 51.27 | 56.84 | 7.13 | |||
No | 50 | 51.44 | 65.47 | 6.85 | |||
No | 50 | 50.96 | 64.05 | 6.85 | |||
Yes | 50 | 51.68 | 57.74 | 6.52 | |||
No | 50 | 50.23 | 62.83 | 7.11 | |||
No | 50 | 50.18 | 59.15 | 6.74 |
Network Top-1 Error (%) | Compression | Top-1 Error (%) | FLOPs Ratio (%) | Parameter Ratio (%) |
Method | ||||
ResNet-56 7.05 | Variational [67] | 7.74 | 79.70 | 79.51 |
GAL-0.6 [36] | 6.62 | 63.40 | 88.20 | |
56-pruned-B [30] | 6.94 | 62.40 | 86.30 | |
NISP [64] | 6.99 | 56.39 | 57.40 | |
DHP-50 (Ours) | 6.31 | 50.98 | 55.62 | |
CaP [44] | 6.78 | 50.20 | – | |
ENC [25] | 7.00 | 50.00 | – | |
AMC [17] | 8.10 | 50.00 | – | |
KSE [34] | 6.77 | 48.00 | 45.27 | |
FPGM [16] | 6.74 | 47.70 | – | |
GAL-0.8 [36] | 8.42 | 39.80 | 34.10 | |
DHP-38 (Ours) | 6.86 | 39.60 | 49.00 | |
ResNet-110 5.31 | Variational [67] | 7.04 | 63.56 | 58.73 |
DHP-62 (Ours) | 5.36 | 62.50 | 59.28 | |
GAL-0.5 [36] | 7.26 | 51.50 | 55.20 | |
DHP-20 (Ours) | 6.93 | 21.77 | 21.30 | |
ResNet-164 4.97 | SSS [24] | 5.78 | 53.53 | – |
DHP-50 (Ours) | 5.29 | 51.86 | 44.10 | |
Variational [67] | 6.84 | 50.92 | 43.30 | |
DHP-20 (Ours) | 5.91 | 21.47 | 20.24 | |
DenseNet-12-40 5.26 | Variational [67] | 6.84 | 55.22 | 40.33 |
DHP-38 (Ours) | 6.06 | 39.80 | 63.76 | |
DHP-28 (Ours) | 6.51 | 29.52 | 26.01 | |
GAL-0.1 [36] | 6.77 | 28.60 | 25.00 | |
DHP-24 (Ours) | 6.53 | 27.12 | 25.76 | |
DHP-20 (Ours) | 7.17 | 22.85 | 20.38 | |
VGG-16 6.34 | DHP-40 (Ours) | 7.61 | 40.11 | 35.61 |
DHP-20 (Ours) | 8.60 | 21.65 | 17.18 |
Network Top-1 Error (%) | Compression | Top-1 Error (%) | FLOPs Ratio (%) | Parameter Ratio (%) |
---|---|---|---|---|
Method | ||||
ResNet50 42.35 | DHP-40 (Ours) | 43.97 | 41.57 | 45.89 |
MetaPruning [40] | 49.82 | 10.21 | 37.32 | |
DHP-9 (Ours) | 47.00 | 10.95 | 13.15 | |
MobileNetV1 51.87 | DHP-24-2 (Ours) | 50.47 | 98.95 | 79.65 |
MobileNetV1-0.75 [19] | 53.18 | 57.74 | 57.75 | |
MetaPruning [40] | 54.66 | 56.49 | 88.06 | |
DHP-50 (Ours) | 51.96 | 51.37 | 47.76 | |
DHP-30 (Ours) | 54.47 | 31.79 | 26.28 | |
MobileNetV1-0.5 [19] | 55.44 | 26.99 | 27.00 | |
DHP-10 (Ours) | 55.71 | 11.8 | 9.18 | |
MobileNetV1-0.3 [19] | 60.54 | 10.46 | 10.63 | |
MobileNetV2 43.83 | DHP-24-2 (Ours) | 43.14 | 96.34 | 107.57 |
MobileNetV2-0.75 [52] | 45.00 | 58.60 | 58.94 | |
DHP-50 (Ours) | 44.13 | 51.71 | 56.32 | |
DHP-30 (Ours) | 44.42 | 31.85 | 36.29 | |
MobileNetV2-0.5 [52] | 48.01 | 28.17 | 28.59 | |
DHP-10 (Ours) | 47.65 | 11.94 | 16.86 | |
MobileNetV2-0.3 [52] | 53.21 | 11.08 | 11.89 | |
MetaPruning [40] | 56.53 | 11.08 | 89.92 | |
MNasNet 50.96 | DHP-28-2 (Ours) | 50.11 | 98.52 | 58.51 |
MNasNet-0.75 [56] | 52.32 | 71.22 | 63.89 | |
DHP-69 (Ours) | 50.95 | 70.86 | 63.5 | |
MNasNet-0.5 [56] | 52.99 | 41.70 | 35.61 | |
DHP-39 (Ours) | 51.73 | 40.93 | 31.26 | |
MNasNet-0.3 [56] | 55.87 | 24.72 | 19.90 | |
DHP-22 (Ours) | 53.92 | 23.72 | 15.76 | |
ShuffleNetV2-1.5 43.71 | ShuffleNetV2-1.0 [42] | 46.35 | 48.29 | 54.35 |
DHP-46 (Ours) | 45.13 | 47.34 | 48.1 |
The speciality of ShuffleNetV2 [42] is due to the channel shuffle operation. For each inverted residual block in ShuffleNetV2, the input feature map are divided into two branches. Different operations are applied to the two branches, after which channel shuffle operation is conducted between the feature maps of the two branches for the purpose of information communication between them. Due to this operation, the branches in all of the inverted residual blocks within the same stage are inter-connected. Thus, if one channel in a branch is ought to be pruned, the corresponding channel in the other branch should also be pruned. This is not a problem before the channel shuffle operation of the current inverted residual block. But after the channel shuffle operation, the two pruned channels appear in the same branch of the next inverted residual block while none of the channels in the other branch are pruned. This causes imbalanced branches. Considering the fact that the channels are also shuffled in the next inverted residual block and that shuffle operation need balanced branches, pruning ShuffleNetV2 is almost impossible for the traditional network pruning method. But for the proposed DHP method, we just assign a shared latent vector to all of the inter-connected branches of the inverted residual blocks within the same stage. By pruning the shared latent vector, all of the branches are compressed. Although the channel shuffle operation still complicates the situation and the pruned channels before and after channel shuffle operation do not match exactly, the proposed DHP still makes it possible to prune ShuffleNetV2. To the best of the authors’ knowledge, this is the first try to automatically prune ShuffleNetV2. Of course, a layer-wise distinguishing configuration is found by the automatic pruning method.
The ablation study on the latent vector sharing is shown in Table 4. As shown by the table, the latent vector sharing strategy outperforms the non-sharing strategy consistently except for the case where the gap between the parameter compression ratio of different strategies is relatively large. Due to this fact, we develop various latent vector sharing rules for easier and better automatic network pruning.
More results on image classification networks are shown in Table 5 and Table 6. In addition to the results in the main paper, compression results on MNasNet [56] and ShuffleNetV2 [42] are also shown. Note that MobileNetV1, MobileNetV2, ShuffleNetV2, and MNasNet are quite efficient networks designed by human experts or automatic architecture search method. The proposed DHP can lead to more efficient versions of those networks with different width multipliers. This phenomenon validates the importance of per-layer dissimalated configurations.
Comments
There are no comments yet.