# Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon

How to develop slim and accurate deep neural networks has become crucial for real- world applications, especially for those employed in embedded systems. Though previous work along this research line has shown some promising results, most existing methods either fail to significantly compress a well-trained deep network or require a heavy retraining process for the pruned deep network to re-boost its prediction performance. In this paper, we propose a new layer-wise pruning method for deep neural networks. In our proposed method, parameters of each individual layer are pruned independently based on second order derivatives of a layer-wise error function with respect to the corresponding parameters. We prove that the final prediction performance drop after pruning is bounded by a linear combination of the reconstructed errors caused at each layer. Therefore, there is a guarantee that one only needs to perform a light retraining process on the pruned network to resume its original prediction performance. We conduct extensive experiments on benchmark datasets to demonstrate the effectiveness of our pruning method compared with several state-of-the-art baseline methods.

## Authors

• 16 publications
• 6 publications
• 9 publications
• ### BT-Nets: Simplifying Deep Neural Networks via Block Term Decomposition

Recently, deep neural networks (DNNs) have been regarded as the state-of...
12/15/2017 ∙ by Guangxi Li, et al. ∙ 0

• ### Multi-loss-aware Channel Pruning of Deep Networks

Channel pruning, which seeks to reduce the model size by removing redund...
02/27/2019 ∙ by Yiming Hu, et al. ∙ 0

• ### How Compact?: Assessing Compactness of Representations through Layer-Wise Pruning

Various forms of representations may arise in the many layers embedded i...
01/09/2019 ∙ by Hyun-Joo Jung, et al. ∙ 0

• ### Layer Pruning for Accelerating Very Deep Neural Networks

In this paper, we propose an adaptive pruning method. This method can cu...
10/28/2019 ∙ by Weiwei Zhang, et al. ∙ 0

• ### Few Shot Network Compression via Cross Distillation

Model compression has been widely adopted to obtain light-weighted deep ...
11/21/2019 ∙ by Haoli Bai, et al. ∙ 0

• ### Learning Instance-wise Sparsity for Accelerating Deep Models

Exploring deep convolutional neural networks of high efficiency and low ...
07/27/2019 ∙ by Chuanjian Liu, et al. ∙ 5

• ### Network Compression via Recursive Bayesian Pruning

Recently, compression and acceleration of deep neural networks are in cr...
12/02/2018 ∙ by Yuefu Zhou, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Intuitively, deep neural networks deepnature

can approximate predictive functions of arbitrary complexity well when they are of a huge amount of parameters, i.e., a lot of layers and neurons. In practice, the size of deep neural networks has been being tremendously increased, from LeNet-5 with less than 1M parameters

lenet5 to VGG-16 with 133M parameters vgg16

. Such a large number of parameters not only make deep models memory intensive and computationally expensive, but also urge researchers to dig into redundancy of deep neural networks. On one hand, in neuroscience, recent studies point out that there are significant redundant neurons in human brain, and memory may have relation with vanishment of specific synapses

de2017ultrastructural

. On the other hand, in machine learning, both theoretical analysis and empirical experiments have shown the evidence of redundancy in several deep models

r1 ; nettrim . Therefore, it is possible to compress deep neural networks without or with little loss in prediction by pruning parameters with carefully designed criteria.

However, finding an optimal pruning solution is NP-hard because the search space for pruning is exponential in terms of parameter size. Recent work mainly focuses on developing efficient algorithms to obtain a near-optimal pruning solution reed1993pruning ; gong2014compressing ; hannips ; sun2016sparsifying ; dynamic

. A common idea behind most exiting approaches is to select parameters for pruning based on certain criteria, such as increase in training error, magnitude of the parameter values, etc. As most of the existing pruning criteria are designed heuristically, there is no guarantee that prediction performance of a deep neural network can be preserved after pruning. Therefore, a time-consuming retraining process is usually needed to boost the performance of the trimmed neural network.

Instead of consuming efforts on a whole deep network, a layer-wise pruning method, Net-Trim, was proposed to learn sparse parameters by minimizing reconstructed error for each individual layer nettrim . A theoretical analysis is provided that the overall performance drop of the deep network is bounded by the sum of reconstructed errors for each layer. In this way, the pruned deep network has a theoretical guarantee on its error. However, as Net-Trim adopts -norm to induce sparsity for pruning, it fails to obtain high compression ratio compared with other methods hannips ; dynamic .

In this paper, we propose a new layer-wise pruning method for deep neural networks, aiming to achieve the following three goals: 1) For each layer, parameters can be highly compressed after pruning, while the reconstructed error is small. 2) There is a theoretical guarantee on the overall prediction performance of the pruned deep neural network in terms of reconstructed errors for each layer. 3) After the deep network is pruned, only a light retraining process is required to resume its original prediction performance.

To achieve our first goal, we borrow an idea from some classic pruning approaches for shallow neural networks, such as optimal brain damage (OBD) obd and optimal brain surgeon (OBS) obs . These classic methods approximate a change in the error function via functional Taylor Series, and identify unimportant weights based on second order derivatives. Though these approaches have proven to be effective for shallow neural networks, it remains challenging to extend them for deep neural networks because of the high computational cost on computing second order derivatives, i.e., the inverse of the Hessian matrix over all the parameters. In this work, as we restrict the computation on second order derivatives w.r.t. the parameters of each individual layer only, i.e., the Hessian matrix is only over parameters for a specific layer, the computation becomes tractable. Moreover, we utilize characteristics of back-propagation for fully-connected layers in well-trained deep networks to further reduce computational complexity of the inverse operation of the Hessian matrix.

To achieve our second goal, based on the theoretical results in nettrim , we provide a proof on the bound of performance drop before and after pruning in terms of the reconstructed errors for each layer. With such a layer-wise pruning framework using second-order derivatives for trimming parameters for each layer, we empirically show that after significantly pruning parameters, there is only a little drop of prediction performance compared with that before pruning. Therefore, only a light retraining process is needed to resume the performance, which achieves our third goal.

The contributions of this paper are summarized as follows. 1) We propose a new layer-wise pruning method for deep neural networks, which is able to significantly trim networks and preserve the prediction performance of networks after pruning with a theoretical guarantee. In addition, with the proposed method, a time-consuming retraining process for re-boosting the performance of the pruned network is waived. 2) We conduct extensive experiments to verify the effectiveness of our proposed method compared with several state-of-the-art approaches.

## 2 Related Works and Preliminary

Pruning methods have been widely used for model compression in early neural networks reed1993pruning and modern deep neural networks nettrim ; gong2014compressing ; hannips ; sun2016sparsifying ; dynamic . In the past, with relatively small size of training data, pruning is crucial to avoid overfitting. Classical methods include OBD and OBS. These methods aim to prune parameters with the least increase of error approximated by second order derivatives. However, computation of the Hessian inverse over all the parameters is expensive. In OBD, the Hessian matrix is restricted to be a diagonal matrix to make it computationally tractable. However, this approach implicitly assumes parameters have no interactions, which may hurt the pruning performance. Different from OBD, OBS makes use of the full Hessian matrix for pruning. It obtains better performance while is much more computationally expensive even using Woodbury matrix identity kailath1980linear , which is an iterative method to compute the Hessian inverse. For example, using OBS on VGG-16 naturally requires to compute inverse of the Hessian matrix with a size of .

Regarding pruning for modern deep models, Han et al.  hannips proposed to delete unimportant parameters based on magnitude of their absolute values, and retrain the remaining ones to recover the original prediction performance. This method achieves considerable compression ratio in practice. However, as pointed out by pioneer research work obd ; obs , parameters with low magnitude of their absolute values can be necessary for low error. Therefore, magnitude-based approaches may eliminate wrong parameters, resulting in a big prediction performance drop right after pruning, and poor robustness before retraining r2 . Though some variants have tried to find better magnitude-based criteria cuhk ; li2016pruning , the significant drop of prediction performance after pruning still remains. To avoid pruning wrong parameters, Guo et al.  dynamic introduced a mask matrix to indicate the state of network connection for dynamically pruning after each gradient decent step. Jin et al.  ith proposed an iterative hard thresholding approach to re-activate the pruned parameters after each pruning phase.

Besides Net-trim, which is a layer-wise pruning method discussed in the previous section, there is some other work proposed to induce sparsity or low-rank approximation on certain layers for pruning lowrank ; lasso . However, as the -norm or the -norm sparsity-induced regularization term increases difficulty in optimization, the pruned deep neural networks using these methods either obtain much smaller compression ratio nettrim compared with direct pruning methods or require retraining of the whole network to prevent accumulation of errors sun2016sparsifying .

Optimal Brain Surgeon As our proposed layer-wise pruning method is an extension of OBS on deep neural networks, we briefly review the basic of OBS here. Consider a network in terms of parameters trained to a local minimum in error. The functional Taylor series of the error w.r.t. is: , where denotes a perturbation of a corresponding variable, is the Hessian matrix, where is the number of parameters, and is the third and all higher order terms. For a network trained to a local minimum in error, the first term vanishes, and the term can be ignored. In OBS, the goal is to set one of the parameters to zero, denoted by (scalar), to minimize in each pruning iteration. The resultant optimization problem is written as follows,

 minq12δw⊤Hδw,s.t. e⊤qδw+wq=0, (1)

where

is the unit selecting vector whose

-th element is 1 and otherwise 0. As shown in lag , the optimization problem (1) can be solved by the Lagrange multipliers method. Note that a computation bottleneck of OBS is to calculate and store the non-diagonal Hesssian matrix and its inverse, which makes it impractical on pruning deep models which are usually of a huge number of parameters.

## 3 Layer-wise Optimal Brain Surgeon

### 3.1 Problem Statement

Given a training set of instances, , and a well-trained deep neural network of layers (excluding the input layer)111For simplicity in presentation, we suppose the neural network is a feed-forward (fully-connected) network. In Section 3.4

, we will show how to extend our method to filter layers in Convolutional Neural Networks.

. Denote the input and the output of the whole deep neural network by and , respectively. For a layer , we denote the input and output of the layer by and , respectively, where can be considered as a representation of in layer , and , , and . Using one forward-pass step, we have , where with being the matrix of parameters for layer , and

is the activation function. For convenience in presentation and proof, we define the activation function

as the rectified linear unit (ReLU

relu . We further denote by the vectorization of . For a well-trained neural network, , and are all fixed matrixes and contain most information of the neural network. The goal of pruning is to set the values of some elements in to be zero.

### 3.2 Layer-Wise Error

During layer-wise pruning in layer , the input is fixed as the same as the well-trained network. Suppose we set the -th element of , denoted by , to be zero, and get a new parameter vector, denoted by . With , we obtain a new output for layer , denoted by . Consider the root of mean square error between and over the whole training data as the layer-wise error:

 (2)

where is the Frobenius Norm. Note that for any single parameter pruning, one can compute its error , where , and use it as a pruning criterion. This idea has been adopted by some existing methods r2 . However, in this way, for each parameter at each layer, one has to pass the whole training data once to compute its error measure, which is very computationally expensive. A more efficient approach is to make use of the second order derivatives of the error function to help identify importance of each parameter.

We first define an error function as

 El=E(^Zl)=1n∥∥^Zl−Zl∥∥2F, (3)

where is outcome of the weighted sum operation right before performing the activation function at layer of the well-trained neural network, and is outcome of the weighted sum operation after pruning at layer . Note that is considered as the desired output of layer before activation. The following lemma shows that the layer-wise error is bounded by the error defined in (3).

###### Lemma 3.1.

With the error function (3) and , the following holds: .

Therefore, to find parameters whose deletion (set to be zero) minimizes (2) can be translated to find parameters those deletion minimizes the error function (3). Following obd ; obs , the error function can be approximated by functional Taylor series as follows,

 E(^Zl)−E(Zl)=δEl=(∂El∂Θl)⊤δΘl+12δΘl⊤HlδΘl+O(∥δΘl∥3), (4)

where denotes a perturbation of a corresponding variable, is the Hessian matrix w.r.t. , and is the third and all higher order terms. It can be proven that with the error function defined in (3), the first (linear) term and are equal to .

Suppose every time one aims to find a parameter to set to be zero such that the change is minimal. Similar to OBS, we can formulate it as the following optimization problem:

 minq12δΘl⊤HlδΘl,s.t. e⊤qδΘl+Θl[q]=0, (5)

where is the unit selecting vector whose -th element is 1 and otherwise 0. By using the Lagrange multipliers method as suggested in lag , we obtain the closed-form solutions of the optimal parameter pruning and the resultant minimal change in the error function as follows,

 δΘl=−Θl[q][H−1l]qqH−1leq, and Lq=δEl=12(Θl[q])2[H−1l]qq. (6)

Here is referred to as the sensitivity of parameter . Then we select parameters to prune based on their sensitivity scores instead of their magnitudes. As mentioned in section 2, magnitude-based criteria which merely consider the numerator in (6

) is a poor estimation of sensitivity of parameters. Moreover, in (

6), as the inverse Hessian matrix over the training data is involved, it is able to capture data distribution when measuring sensitivities of parameters.

After pruning the parameter, , with the smallest sensitivity, the parameter vector is updated via . With Lemma 3.1 and (6), we have that the layer-wise error for layer is bounded by

 εlq≤√E(^Zl)=√E(^Zl)−E(Zl)=√δEl=|Θl[q]|√2[H−1l]qq. (7)

Note that first equality is obtained because of the fact that . It is worth to mention that though we merely focus on layer , the Hessian matrix is still a square matrix with size of . However, we will show how to significantly reduce the computation of for each layer in Section 3.4.

### 3.3 Layer-Wise Error Propagation and Accumulation

So far, we have shown how to prune parameters for each layer, and estimate their introduced errors independently. However, our aim is to control the consistence of the network’s final output before and after pruning. To do this, in the following, we show how the layer-wise errors propagate to final output layer, and the accumulated error over multiple layers will not explode.

###### Theorem 3.2.

Given a pruned deep network via layer-wise pruning introduced in Section 3.2, each layer has its own layer-wise error for , then the accumulated error of ultimate network output obeys:

 ~εL≤L−1∑k=1(L∏l=k+1∥^Θl∥F√δEk)+√δEL, (8)

where , for denotes ‘accumulated pruned output’ of layer , and .

Theorem 3.2 shows that: 1) Layer-wise error for a layer will be scaled by continued multiplication of parameters’ Frobenius Norm over the following layers when it propagates to final output, i.e., the layers after the -th layer; 2) The final error of ultimate network output is bounded by the weighted sum of layer-wise errors. The proof of Theorem 3.2 can be found in Appendix.

Consider a general case with (6) and (8): parameter who has the smallest sensitivity in layer is pruned by the -th pruning operation, and this finally adds to the ultimate network output error. It is worth to mention that although it seems that the layer-wise error is scaled by a quite large product factor, when it propagates to the final layer, this scaling is still tractable in practice because ultimate network output is also scaled by the same product factor compared with the output of layer . For example, we can easily estimate the norm of ultimate network output via, . If one pruning operation in the 1st layer causes the layer-wise error , then the relative ultimate output error is

 ξLr=∥~YL−YL∥F∥YL∥F≈√δE1∥1nY1∥F.

Thus, we can see that even may be quite large, the relative ultimate output error would still be about which is controllable in practice especially when most of modern deep networks adopt maxout layer maxout as ultimate output. Actually, is called as network gain representing the ratio of the magnitude of the network output to the magnitude of the network input.

### 3.4 The Proposed Algorithm

#### 3.4.1 Pruning on Fully-Connected Layers

To selectively prune parameters, our approach needs to compute the inverse Hessian matrix at each layer to measure the sensitivities of each parameter of the layer, which is still computationally expensive though tractable. In this section, we present an efficient algorithm that can reduce the size of the Hessian matrix and thus speed up computation on its inverse.

For each layer , according to the definition of the error function used in Lemma 3.1, the first derivative of the error function with respect to is , where and are the -th columns of the matrices and , respectively, and the Hessian matrix is defined as: . Note that for most cases is quite close to , we simply ignore the term containing . Even in the late-stage of pruning when this difference is not small, we can still ignore the corresponding term obs . For layer that has output units, , the Hessian matrix can be calculated via

 Hl=1nn∑j=1Hjl=1nn∑j=1ml∑i=1∂zlij∂Θl(∂zlij∂Θl)⊤, (9)

where the Hessian matrix for a single instance at layer , , is a block diagonal square matrix of the size . Specifically, the gradient of the first output unit w.s.t. is , where is the -th column of . As is the layer output before activation function, its gradient is simply to calculate, and more importantly all output units’s gradients are equal to the layer input: if , otherwise . An illustrated example is shown in Figure 1, where we ignore the scripts and for simplicity in presentation.

It can be shown that the block diagonal square matrix ’s diagonal blocks , where , are all equal to , and the inverse Hessian matrix is also a block diagonal square matrix with its diagonal blocks being . In addition, normally is degenerate and its pseudo-inverse can be calculated recursively via Woodbury matrix identity obs :

 (Ψlj+1)−1=(Ψlj)−1−(Ψlj)−1yl−1j(yl−1j)⊤(Ψlj)−1n+(yl−1j+1)⊤(Ψlj)−1yl−1j+1,

where with , , and . The size of is then reduced to , and the computational complexity of calculating is .

To make the estimated minimal change of the error function optimal in (6), the layer-wise Hessian matrices need to be exact. Since the layer-wise Hessian matrices only depend on the corresponding layer inputs, they are always able to be exact even after several pruning operations. The only parameter we need to control is the layer-wise error . Note that there may be a “pruning inflection point” after which layer-wise error would drop dramatically. In practice, user can incrementally increase the size of pruned parameters based on the sensitivity , and make a trade-off between the pruning ratio and the performance drop to set a proper tolerable error threshold or pruning ratio.

The procedure of our pruning algorithm for a fully-connected layer is summarized as follows.

1. Get layer input from a well-trained deep network.

2. Calculate the Hessian matrix , for , and its pseudo-inverse over the dataset, and get the whole pseudo-inverse of the Hessian matrix.

3. Compute optimal parameter change and the sensitivity for each parameter at layer . Set tolerable error threshold .

4. Pick up parameters ’s with the smallest sensitivity scores.

5. If , prune the parameter ’s and get new parameter values via , then repeat Step 4; otherwise stop pruning.

#### 3.4.2 Pruning on Convolutional Layers

It is straightforward to generalize our method to a convolutional layer and its variants if we vectorize filters of each channel and consider them as special fully-connected layers that have multiple inputs (patches) from a single instance. Consider a vectorized filter of channel , , it acts similarly to parameters which are connected to the same output unit in a fully-connected layer. However, the difference is that for a single input instance , every filter step of a sliding window across of it will extract a patch from the input volume. Similarly, each pixel in the 2-dimensional activation map that gives the response to each patch corresponds to one output unit in a fully-connected layer. Hence, for convolutional layers, (9) is generalized as , where is a block diagonal square matrix whose diagonal blocks are all the same. Then, we can slightly revise the computation of the Hessian matrix, and extend the algorithm for fully-connected layers to convolutional layers.

Note that the accumulated error of ultimate network output can be linearly bounded by layer-wise error as long as the model is feed-forward. Thus, L-OBS is a general pruning method and friendly with most of feed-forward neural networks whose layer-wise Hessian can be computed expediently with slight modifications. However, if models have sizable layers like ResNet-101, L-OBS may not be economical because of computational cost of Hessian, which will be studied in our future work.

## 4 Experiments

In this section, we verify the effectiveness of our proposed Layer-wise OBS (L-OBS) using various architectures of deep neural networks in terms of compression ratio (CR), error rate before retraining, and the number of iterations required for retraining to resume satisfactory performance. CR is defined as the ratio of the number of preserved parameters to that of original parameters, lower is better. We conduct comparison results of L-OBS with the following pruning approaches: 1) Randomly pruning, 2) OBD obd , 3) LWC hannips , 4) DNS dynamic , and 5) Net-Trim nettrim . The deep architectures used for experiments include: LeNet-300-100 lenet5 and LeNet-5 lenet5 on the MNIST dataset, CIFAR-Net222A revised AlexNet for CIFAR-10 containing three convolutional layers and two fully connected layers. cifarnet on the CIFAR-10 dataset, AlexNet alexnet and VGG-16 vgg16

on the ImageNet ILSVRC-2012 dataset. For experiments, we first well-train the networks, and apply various pruning approaches on networks to evaluate their performance. The retraining batch size, crop method and other hyper-parameters are under the same setting as used in LWC. Note that to make comparisons fair, we do not adopt any other pruning related methods like Dropout or sparse regularizers on MNIST. In practice, L-OBS can work well along with these techniques as shown on CIFAR-10 and ImageNet.

### 4.1 Overall Comparison Results

The overall comparison results are shown in Table 1

. In the first set of experiments, we prune each layer of the well-trained LeNet-300-100 with compression ratios: 6.7%, 20% and 65%, achieving slightly better overall compression ratio (7%) than LWC (8%). Under comparable compression ratio, L-OBS has quite less drop of performance (before retraining) and lighter retraining compared with LWC whose performance is almost ruined by pruning. Classic pruning approach OBD is also compared though we observe that Hessian matrices of most modern deep models are strongly non-diagonal in practice. Besides relative heavy cost to obtain the second derivatives via the chain rule, OBD suffers from drastic drop of performance when it is directly applied to modern deep models.

To properly prune each layer of LeNet-5, we increase tolerable error threshold from relative small initial value to incrementally prune more parameters, monitor model performance, stop pruning and set until encounter the “pruning inflection point” mentioned in Section 3.4. In practice, we prune each layer of LeNet-5 with compression ratio: 54%, 43%, 6% and 25% and retrain pruned model with much fewer iterations compared with other methods (around ). As DNS retrains the pruned network after every pruning operation, we are not able to report its error rate of the pruned network before retraining. However, as can be seen, similar to LWC, the total number of iterations used by DNS for rebooting the network is very large compared with L-OBS. Results of retraining iterations of DNS are reported from dynamic

and the other experiments are implemented based on TensorFlow

tensorflow . In addition, in the scenario of requiring high pruning ratio, L-OBS can be quite flexibly adopted to an iterative version, which performs pruning and light retraining alternatively to obtain higher pruning ratio with relative higher cost of pruning. With two iterations of pruning and retraining, L-OBS is able to achieve as the same pruning ratio as DNS with much lighter total retraining: 643 iterations on LeNet-300-100 and 841 iterations on LeNet-5.

Regarding comparison experiments on CIFAR-Net, we first well-train it to achieve a testing error of 18.57% with Dropout and Batch-Normalization. We then prune the well-trained network with LWC and L-OBS, and get the similar results as those on other network architectures. We also observe that LWC and other retraining-required methods always require much smaller learning rate in retraining. This is because representation capability of the pruned networks which have much fewer parameters is damaged during pruning based on a principle that number of parameters is an important factor for representation capability. However, L-OBS can still adopt original learning rate to retrain the pruned networks. Under this consideration, L-OBS not only ensures a warm-start for retraining, but also finds important connections (parameters) and preserve capability of representation for the pruned network instead of ruining model with pruning.

Regarding AlexNet, L-OBS achieves an overall compression ratio of 11% without loss of accuracy with 2.9 hours on 48 Intel Xeon(R) CPU E5-1650 to compute Hessians and 3.1 hours on NVIDIA Tian X GPU to retrain pruned model (i.e. 18.1K iterations). The computation cost of the Hessian inverse in L-OBS is negligible compared with that on heavy retraining in other methods. This claim can also be supported by the analysis of time complexity. As mentioned in Section 3.4, the time complexity of calculating is . Assume that neural networks are retrained via SGD, then the approximate time complexity of retraining is , where is the size of the mini-batch, and are the total numbers of parameters and iterations, respectively. By considering that , and retraining in other methods always requires millions of iterations () as shown in experiments, complexity of calculating the Hessian (inverse) in L-OBS is quite economic. More interestingly, there is a trade-off between compression ratio and pruning (including retraining) cost. Compared with other methods, L-OBS is able to provide fast-compression: prune AlexNet to 16% of its original size without substantively impacting accuracy (pruned top-5 error 20.98%) even without any retraining. We further apply L-OBS to VGG-16 that has 138M parameters. To achieve more promising compression ratio, we perform pruning and retraining alteratively twice. As can be seen from the table, L-OBS achieves an overall compression ratio of 7.5% without loss of accuracy taking 10.2 hours in total on 48 Intel Xeon(R) CPU E5-1650 to compute the Hessian inverses and 86.3K iterations to retrain the pruned model.

We also apply L-OBS on ResNet-50 he2016deep . From our best knowledge, this is the first work to perform pruning on ResNet. We perform pruning on all the layers: All layers share a same compression ratio, and we change this compression ratio in each experiments. The results are shown in Figure 2(a). As we can see, L-OBS is able to maintain ResNet’s accuracy (above 85%) when the compression ratio is larger than or equal to 45%.

### 4.2 Comparison between L-OBS and Net-Trim

As our proposed L-OBS is inspired by Net-Trim, which adopts -norm to induce sparsity, we conduct comparison experiments between these two methods. In Net-Trim, networks are pruned by formulating layer-wise pruning as a optimization: s.t. , where corresponds to in L-OBS. Due to memory limitation of Net-Trim, we only prune the middle layer of LeNet-300-100 with L-OBS and Net-Trim under the same setting. As shown in Table 2, under the same pruned error rate, CR of L-OBS outnumbers that of the Net-Trim by about six times. In addition, Net-Trim encounters explosion of memory and time on large-scale datasets and large-size parameters. Specifically, space complexity of the positive semidefinite matrix in quadratic constraints used in Net-Trim for optimization is . For example, requires about 65.7Gb for 1,000 samples on MNIST as illustrated in Figure 2(b)

. Moreover, Net-Trim is designed for multi-layer perceptrons and not clear how to deploy it on convolutional layers.

## 5 Conclusion

We have proposed a novel L-OBS pruning framework to prune parameters based on second order derivatives information of the layer-wise error function and provided a theoretical guarantee on the overall error in terms of the reconstructed errors for each layer. Our proposed L-OBS can prune considerable number of parameters with tiny drop of performance and reduce or even omit retraining. More importantly, it identifies and preserves the real important part of networks when pruning compared with previous methods, which may help to dive into nature of neural networks.

## Acknowledgements

This work is supported by NTU Singapore Nanyang Assistant Professorship (NAP) grant M4081532.020, Singapore MOE AcRF Tier-2 grant MOE2016-T2-2-060, and Singapore MOE AcRF Tier-1 grant 2016-T1-001-159.

## Proof of Theorem 3.2

We prove Theorem 3.2 via induction. First, for , (8) holds as a special case of (2). Then suppose that Theorem 3.2 holds up to layer :

 ~εl≤l−1∑h=1(l∏k=h+1∥^Θk∥F√δEh)+√δEl (10)

In order to show that (10) holds for layer as well, we refer to as ‘layer-wise pruned output’, where the input is fixed as the same as the originally well-trained network not an accumulated input , and have the following theorem.

###### Theorem 5.1.

Consider layer in a pruned deep network, the difference between its accumulated pruned output, , and layer-wise pruned output, , is bounded by:

 ∥~Yl+1−^Yl+1∥F≤√n∥^Θl+1∥F~εl. (11)

Proof sketch: Consider one arbitrary element of the layer-wise pruned output :

 ^yl+1ij = σ(^w⊤i~ylj+^w⊤i(ylj−~ylj)) ≤ ~yl+1ij+σ(^w⊤i(ylj−~ylj)) ≤ ~yl+1ij+|^w⊤i(ylj−~ylj)|,

where is the -th column of . The first inequality is obtained because we suppose the activation function is ReLU. Similarly, it holds for accumulated pruned output:

 ~yl+1ij≤^yl+1ij+|^w⊤i(ylj−~ylj)|.

By combining the above two inequalities, we have

 |~yl+1ij−^yl+1ij|≤|^w⊤i(ylj−~ylj)|,

and thus have the following inequality in a form of matrix,

 ∥~Yl+1−^Yl+1∥F≤∥^Wl+1(Yl−~Yl)∥F≤∥^Θl+1∥F∥Yl−~Yl∥F

As is defined as , we have

 ∥~Yl+1−^Yl+1∥F≤√n∥^Θl+1∥F~εl.

This completes the proof of Theorem 11.

By using (2) ,(11) and the triangle inequality, we are now able to extend (10) to layer :

 ~εl+1=1√n∥~Yl+1−Y(l+1)∥F ≤ 1√n∥~Yl+1−^Y(l+1)∥F+1√n∥^Yl+1−Y(l+1)∥F ≤ l∑h=1(l+1∏k=h+1∥^Θk+1∥F⋅√δEh)+√δEl+1.

Finally, we prove that (10) holds up for all layers, and Theorem 3.2 is a special case when .

## Extensive Experiments and Details

### Redundancy of Networks

LeNet-300-100 is a classical feed-forward network, which has three fully connected layers, with 267K learnable parameters. LeNet-5 is a convolutional neural network that has two convolutional layers and two fully connected layers, with 431K learnable parameters. CIFAR-Net is a revised AlexNet for CIFAR-10 containing three convolutional layers and two fully connected layers.

We first validate the redundancy of networks and the ability of our proposed Layer-wise OBS to find parameters with the smallest sensitivity scores with LeNet-300-100 on MINIST. In all cases, we first get a well-trained network without dropout or regularization terms. Then, we use four kinds of pruning criteria: Random, LWC [9], ApoZW, and Layer-wise OBS to prune parameters, and evaluate performance of the whole network after performing every 100 pruning operations. Here, LWC is a magnitude-based criterion proposed in [9], which prunes parameters based on smallest absolute values. ApoZW is a revised version of ApoZ [16], which measures the importance of each parameter in layer via . In this way, both magnitude of the parameter and its inputs are taken into consideration.

Originally well-trained model LeNet-300-100 achieves 1.8% error rate on MNIST without dropout. Four pruning criteria are respectively conducted on the well-trained model’s first layer which has 235K parameters by fixing the other two layers’ parameters, and test accuracy of the whole network is recorded every 100 pruning operations without any retraining. Overall comparison results are summarized in Figure 2.

We also visualize the distribution of parameters’ sensitivity scores ’s estimated by Layer-wise OBS in Figure 3, and find that parameters of little impact on the layer output dominate. This further verifies our hypothesis that deep neural networks usually contain a lot of redundant parameters. As shown in the figure, the distribution of parameters’ sensitivity scores in Layer-wise OBS are heavy-tailed. This means that a lot of parameters can be pruned with minor impact on the prediction outcome. Random pruning gets the poorest result as expected but can still preserve prediction accuracy when the pruning ratio is smaller than 30%. This also indicates the high redundancy of the network.

Compared with LWC and ApoZW, L-OBS is able to preserve original accuracy until pruning ratio reaches about 96% which we call as “pruning inflection point”. As mentioned in Section 3.4, the reason on this “pruning inflection point” is that the distribution of parameters’ sensitivity scores is heavy-tailed and sensitivity scores after “pruning inflection point” would be considerable all at once. The percentage of parameters with sensitivity smaller than 0.001 is about 92% which matches well with pruning ratio at inflection point.

L-OBS can not only preserve models’ performance when pruning one single layer, but also ensures tiny drop of performance when pruning all layers in a model. This claim holds because of the theoretical guarantee on the overall prediction performance of the pruned deep neural network in terms of reconstructed errors for each layer in Section 3.3. As shown in Figure 4, L-OBS is able to resume original performance after 740 iterations for LeNet-5 with compression ratio of 7%.

### How To Set Tolerable Error Threshold

One of the most important bounds we proved is that there is a theoretical guarantee on the overall prediction performance of the pruned deep neural network in terms of reconstructed errors for each pruning operation in each layer. This bound enables us to prune a whole model layer by layer without concerns because the accumulated error of ultimate network output is bounded by the weighted sum of layer-wise errors. As long as we control layer-wise errors, we can control the accumulated error.

Although L-OBS allows users to control the accumulated error of ultimate network output , this error is used to measure difference between network outputs before and after pruning, and is not strictly inversely proportional to the final accuracy. In practice, one can increase tolerable error threshold from a relative small initial value to incrementally prune more and more parameters to monitor model performance, and make a trade-off between compression ratio and performance drop. The corresponding relation (in the first layer of LeNet-300-100) between the tolerable error threshold and the pruning ratio is shown in Figure 5.

### Iterative Layer-wise OBS

As mentioned in Section 4.1, to achieve better compression ratio, L-OBS can be quite flexibly adopted to its iterative version, which performs pruning and light retraining alternatively. Specifically, the two-stage iterative L-OBS applied to LeNet-300-100, LeNet-5 and VGG-16 in this work follows the following work flow: pre-train a well-trained model prune model retrain the model and reboot performance in a degree prune again lightly retrain model. In practice, if required compression ratio is beyond the “pruning inflection point”, users have to deploy iterative L-OBS though ultimate compression ratio is not of too much importance. Experimental results are shown in Tabel 34 and 5, where CR(n) means ratio of the number of preserved parameters to the number of original parameters after the -th pruning.