# Revisiting hard thresholding for DNN pruning

The most common method for DNN pruning is hard thresholding of network weights, followed by retraining to recover any lost accuracy. Recently developed smart pruning algorithms use the DNN response over the training set for a variety of cost functions to determine redundant network weights, leading to less accuracy degradation and possibly less retraining time. For experiments on the total pruning time (pruning time + retraining time) we show that hard thresholding followed by retraining remains the most efficient way of reducing the number of network parameters. However smart pruning algorithms still have advantages when retraining is not possible. In this context we propose a novel smart pruning algorithm based on difference of convex functions optimisation and show that it is often orders of magnitude faster than competing approaches while achieving the lowest classification accuracy degradation. Furthermore we investigate theoretically the effect of hard thresholding on DNN accuracy. We show that accuracy degradation increases with remaining network depth from the pruned layer. We also discover a link between the latent dimensionality of the training data manifold and network robustness to hard thresholding.

• 5 publications
• 19 publications
• 38 publications
03/12/2018

### FeTa: A DCA Pruning Algorithm with Generalization Error Guarantees

Recent DNN pruning algorithms have succeeded in reducing the number of p...
09/11/2020

### Achieving Adversarial Robustness via Sparsity

Network pruning has been known to produce compact models without much ac...
06/06/2019

### (Pen-) Ultimate DNN Pruning

DNN pruning reduces memory footprint and computational work of DNN-based...
03/10/2021

### Manifold Regularized Dynamic Network Pruning

Neural network pruning is an essential approach for reducing the computa...
09/22/2020

### Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot

Network pruning is a method for reducing test-time computational resourc...
09/10/2021

### Dynamic Collective Intelligence Learning: Finding Efficient Sparse Model via Refined Gradients for Pruned Weights

With the growth of deep neural networks (DNN), the number of DNN paramet...
09/11/2020

### Enabling Image Recognition on Constrained Devices Using Neural Network Pruning and a CycleGAN

Smart cameras are increasingly used in surveillance solutions in public ...

## 1 Introduction

Deep neural networks have achieved state-of-the art results in a number of machine learning tasks

[LeCun et al., 2015]. Training such networks is computationally intensive and often requires dedicated and expensive hardware. Furthermore, the resulting networks often require a considerable amount of memory to be stored. Using a Pascal Titan X GPU the popular AlexNet and VGG-16 models require 13 hours and 7 days, respectively, to train, while requiring 200MB and 600MB, respectively, to store. The large memory requirements limit the use of DNNs in embedded systems and portable devices such as smartphones, which are now ubiquitous.

A number of approaches have been proposed to reduce the DNN size during training time, often with little or no degradation to classification performance. Approaches include introducing bayesian, sparsity-inducing priors [Louizos et al., 2017], [Blundell et al., 2015], [Molchanov et al., 2017], [Dai et al., 2018]

and binarization

[Hou et al., 2016] [Courbariaux et al., 2016].Other methods include the hashing trick used in [Chen et al., 2015], tensorisation [Novikov et al., 2015] and efficient matrix factorisations [Yang et al., 2015].

There has also been work in reducing the number of network parameters after training. Some first work in this direction has been [Kim et al., 2015] [Han et al., 2015a] [Han et al., 2015b] where hard thresholding is applied to network weights. However weight magnitude is a mediocre predictor for determining whether a weight is essential to network accuracy and following hard thresholding network accuracy often falls significantly. Therefore hard thresholding often has to be followed by a retraining procedure. Ideally one would want to remove parameters by minimizing the reduction in network accuracy over the training set, but this is computationally prohibitive. Smart pruning algorithms aim to overcome this problem by minimizing a surrogate cost function while pruning. In [Aghasi et al., 2016][Aghasi et al., 2018] the authors propose a convexified layerwise pruning algorithm termed Net-Trim, that minimizes the norm between pruned and unpruned layer output representations. The authors in [Dong et al., 2017]

propose LOBS, an algorithm for layerwise pruning by approximating the loss function using a second order Taylor expansion. In

[Baykal et al., 2018] the authors use a coreset approach to prune individual layers and enforce an

norm penalty between pruned and unpruned layer output representations with high probability.

Given that DNNs represent a highly non-linear function it is also of independent interest to explore theoretically what is the effect of pruning on DNN accuracy. Hard thresholding a neural network layer introduces a perturbation to the latent signal representations generated by that layer. As the pertubated signal passes through layers of non-linear projections, the perturbation could become arbitrarily large. DNN robustness to hidden layer perturbations has been investigated for random noise in [Raghu et al., 2016] and for the case of pruning in [Aghasi et al., 2016] and [Dong et al., 2017]. In the last two works the authors conduct a theoretical analysis using the Lipschitz properties of DNNs showing the stability of the latent representations, over the training set, after pruning by assuming that where are the weights at a given layer. The Lipschitz properties of DNNs have also been used to analyze their generalization error (GE), [Sokolic et al., 2017] [Bartlett et al., 2017] [Neyshabur et al., 2017]. Very close to our work [Bartlett et al., 2017] show that the difficulty of a training set can be captured by normalizing the classification margins of training samples with a ”spectral complexity” term that includes the spectral norms of the layer weights. Randomly labelled data as opposed to real labelled data produce networks with much smaller normalized margins, reflecting the difficulty of the randomly labelled dataset.

For the practical pruning section including with and without retraining we focus on pruning only the fully connected non-linear layers of a network. Two of the proposed algorithms are only applicable in fully connected layers. Furthermore fully connected layers often include of network parameters, and are the most redundant of layers in most architectures. Finally the softmax linear layer can be readily compressed using an SVD decomposition [Yang et al., 2015].

### 1.1 Contributions

In this work we make the following contributions

• We perform pruning and retraining experiments for various datasets and architectures and discover that hard thresholding is the most efficient pruning pipeline often taking just a fraction of the total pruning and retraining time of other approaches. For the case without retraining smart pruning approaches lead to significant accuracy gains for the same level of sparsity, especially for DNNs with only fully connected layers.

• We propose a novel smart pruning algorithm called FeTa that can be cast as a difference of convex functions problem, and has an efficient solution. For a fully connected layer with input dimension , output dimension and training samples, the time complexity of our iterative algorithm scales like , where is the precision of the solution, is related to the Lipschitz and strong convexity constants, and is the outer iteration number. Competing approaches scale like and and we conduct experiments showing that our algorithm leads to higher accuracy in the pruned DNN and is often orders of magnitude faster.

• We build upon the work of [Sokolic et al., 2017] to bound the GE of a DNN for the case of hard thresholding of hidden layer weights. In contrast to the analysis of [Aghasi et al., 2016] and [Dong et al., 2017] our analysis correctly predicts that for the same level of sparsity, accuracy degrades with the remaining depth of the thresholded layer.

• Our theoretical analysis also predicts that a DNN trained on a data manifold with lower intrinsic dimensionality will be more robust to hard thresholding. We provide empirical evidense that validates this prediction.

.

### 1.2 Notation and Definitions

We use the following notation in the sequel: matrices, column vectors, scalars and sets are denoted by boldface upper-case letters (

), boldface lower-case letters (), italic letters () and calligraphic upper-case letters (), respectively. The covering number of with -metric balls of radius is denoted by . A -regular -dimensional manifold, where is a constant that captures ”intrinsic” properties, is one that has a covering number . We define as the number of non-zeros of a vector or matrix, as the rectifier non-linearity, as the elementwise indicator function, and as the Hadamard product.

## 2 Retraining Experiments

We consider a classification problem, where we observe a vector that has a corresponding class label . The set is called the input space, is called the label space and denotes the number of classes. The samples space is denoted by and an element of is denoted by . We assume that samples from

are drawn according to a probability distribution

defined on . A training set of samples drawn from is denoted by .

We consider DNN classifiers defined as

 g(x)=maxi∈[Ny](f(x))i, (1)

where is the th element of dimensional output of a DNN . We assume that is composed of layers

 f(x)=fL(fL−1(...f1(x,W1),...WL−1),WL), (2)

where represents the th layer with parameters , . The output of the th layer is denoted , i.e. . The input layer corresponds to and the output of the last layer is denoted by .

For each training signal we assume also that we have access to the inputs and the outputs of a fully connected layer to be pruned. For an unpruned weight matrix the layer is defined by the equation . We denote the matrix concatenating all the latent input vectors and the matrix concatenating all the latent output vectors . We are looking for a new pruned weight matrix such that the new layer will be defined as .

We compare three state of the art smart pruning techniques with hard thresholding for pruning feedforward neural networks. A smart pruning algorithm is defined as which takes as inputs the original layer weights, the latent representation inputs and outputs to the layer and outputs the new pruned weight matrix. We compare the following algorithms: (i) Hard thresholding defined as , where is a positive constant. (ii) Net-Trim [Aghasi et al., 2018] (iii) LOBS [Dong et al., 2017] (iv) Corenet [Baykal et al., 2018]. We refer to the original papers for details. We used implementations by the authors for Net-Trim and LOBS and created our own implementation of Corenet.

Experiments were performed on three commonly used datasets, Mnist[LeCun et al., 1998], FashionMnist[Xiao et al., 2017] and Cifar-10[Krizhevsky and Hinton, 2009]. For each dataset we used half the test set as a validation set. We also tested two different architectures for each dataset one fully connected and one convolutional. We plot the architectures tested in Figure 1. We refer to implementation details in APPENDIX C.

For fully connected architectures both pruning and retraining was done on a MacBook Pro with CPU Intel Core i7 @ 2.8GHz and RAM 16GB 1600 MHz DDR3 using only the CPU. For convolutional architectures pruning was done on 48 Intel Xeon CPUs E5-2650 v4 @ 2.20GHz and retraining was done using a single GeForce GTX 1080 GPU.

We first prune the fully connected nonlinear layers of the DNN architectures using the different algorithms to sparsity. We then retrain the entire pruned architecture until the lost accuracy is recovered. We present the results comparing total pruning time (pruning time + retraining time) for all architectures in Table 1. We see that, using retraining, hard thresholding is able to achieve the same accuracy in the pruned model compared to other approaches at only a fraction of the time. We see furthermore that Net-Trim is faster than Corenet and LOBS in the larger scale Cifar experiments. Finally LOBS is by far the most time consuming algorithm.

We conclude that for the architectures tested and the computational resources that where allocated for pruning and retraining, smart pruning techniques represent a significant increase in time complexity without providing any benefit in the final network accuracy. We must note that while we consider the experimental setting which we consider the most realistic it might still be possible to see a benefit from smart pruning techniques if more efficient computational resources are allocated to smart pruning or less efficient resources are available for retraining. For example retraining a convolutional network using only CPUs might be less efficient than smart pruning.

We determine that pruning accuracy vs pruning speed is a crucial parameter to be optimised by any smart pruning algorithm. In this context we derive in the next section a novel pruning algorithm that is often orders of magnitude faster than competing approaches while still achieving the highest network accuracy after pruning.

## 3 The FeTa algorithm

In this section we derive our novel smart pruning algorithm.

### 3.1 Net-Trim

We first make a detailed description of the Net-Trim algorithm [Aghasi et al., 2018] which will be the basis for our approach. For , Net-Trim aims to minimize

 minU||U||1subject to||ρ(UTA)−B||2F≤ϵ, (3)

where is a tolerance parameter. Intuitively Net-Trim tries to find a sparse matrix such that the output of the pruned nonlinear layer stays close in terms of the Frobenius norm to the original unpruned output . The optimisation above is non-convex and the authors of Net-Trim introduce a convexified formulation. For , and , Net-Trim minimizes

 (4)

which is solved using the ADMM approach. There are two main issues with this formulation. One is time complexity as the ADMM implementation has to solve a Cholesky factorisation which scales like where is the layer input dimension. The second is space complexity as the algorithm has to create inequality constraints, and thus allocate memory, proportional to the number of data samples . In [Aghasi et al., 2018] the space complexity under some assumptions has been shown to be of the order . We note however that for large layers can be higher than even for high sparsity levels.

### 3.2 DC decomposition

We aim to overcome the shortcomings of Net-Trim by reformulating the optimisation problem as a difference of convex functions problem. For we reformulate the optimisation problem that we want to solve as

 minU1m∑sj∈Sm||ρ(UTaj)−bj||22+λΩ(U), (5)

where is the sparsity parameter. The term ensures that the nonlinear projection remains the same for training signals. The term is any convex regulariser which imposes the desired structure on the weight matrix .

The objective in Equation 5 is non-convex. We show that the optimisation of this objective can be cast as a difference of convex functions (DC) problem. We assume just one training sample , for simplicity, with latent representations and

 ||ρ(UTa)−b||22+λΩ(U)=∑i[ρ(uTia)−bi]2+λΩ(U)=∑i[ρ2(uTia)−2ρ(uTia)bi+b2i]+λΩ(U)=∑i[ρ2(uTia)+b2i]+λΩ(U)+∑i[−2biρ(uTia)]=∑i[ρ2(uTia)+b2i]+λΩ(U)+∑ibi<0[−2biρ(uTia)]+∑ibi≥0[−2biρ(uTia)]. (6)

Notice that after the split the first term () is convex while the second () is concave. We note that

by definition of the ReLu and set

 g(U;x)=∑i[ρ2(uTia)+b2i], (7)
 h(U;x)=∑ibi>0[2biρ(uTia)]. (8)

Then by summing over all the samples we get

 f(U)=∑jg(U;xj)+λΩ(U)−∑jh(U;xj)=g(U)+λΩ(U)−h(U), (9)

which is difference of convex functions. The rectifier nonlinearity is non-smooth, but we can alleviate that by assuming a smooth approximation. A common choice for this task is , with a positive constant.

### 3.3 Optimisation

It is well known that DC programs have efficient optimisation algorithms. We propose to use the DCA algorithm [Tao and An, 1997]. DCA is an iterative algorithm that consists in solving, at each iteration, the convex optimisation problem obtained by linearizing (the non-convex part of ) around the current solution. Although DCA is only guaranteed to reach local minima the authors of [Tao and An, 1997] state that DCA often converges to the global minimum, and has been used succefully to optimise a fully connected DNN layer [Fawzi et al., 2015]. At iteration of DCA, the linearized optimisation problem is given by

 argminU{g(U)+λΩ(U)−Tr(UT∇h(Uk))}, (10)

where

is the solution estimate at iteration

. The detailed procedure is then given in algorithms 1 and 2. We assume that the regulariser is convex but possibly non-smooth in which case the optimisation can be performed using proximal methods.

In order to solve the linearized problem we propose to use Accelerated Proximal SVRG (Acc-Prox-SVRG), which was presented in [Nitanda, 2014]. We detail this method in Algorithm 2. At each iteration a minibatch and is drawn. The gradient for the smooth part is calculated and the algorithm takes a step in that direction with step size . Then the proximal operator for the non-smooth regulariser

is applied to the result. The hyperparameters for Acc-Prox-SVRG are the acceleration parameter

and the gradient step . We have found that in our experiments, using and gives the best results.

Our proposed algorithm has time complexity , where is the precision of the solution, is related to the Lipschitz and strong convexity constants, and is the outer iteration number compared to for Net-trim. Furthermore the optimisation can be done in a stochastic manner using data minibatches greatly reducing the algorithm space complexity. We name our algorithm FeTa, Fast and Efficient Trimming Algorithm.

## 4 Experiments without retraining.

We now present detailed comparison experiments for pruning DNNs without retraining. The experimental setup (datasets and architectures) is identical to the one of Section 2. We prune both fully connected layers of all architectures to sparsity and plot the tradeoff between accuracy degradation and time complexity in Figure 2. The first thing that we notice is that Corenet and Net-Trim have inconsistent performance with regards to hard thresholding. While in some experiments their results are better or equal to hard thresholding in others they are significantly worse. On the other hand the FeTa and LOBS enjoy significant improvements in accuracy after pruning, while FeTa is up to faster than LOBS in some experiments. We also notice that the gains over hard thresholding are significant in fully connected architectures. On the contrary fully connected layers in convolutional DNN architectures are extremely redundant, with the majority of the non-linear separation of the data manifold being performed by the convolutional layers. Overall FeTa significantly outperforms Hard Thresholding and other smart pruning approaches both in final test accuracy and pruning speed, when retraining is not possible.

## 5 Generalization Error

We have already seen that hard thresholding followed by retraining is a very efficient pruning method. It is also interesting to explore theoretically what will be the response of the network after hard thresholding different layers at different sparsity levels. In the following section we use tools from the robustness framework [Xu and Mannor, 2012] to bound the generalization error of the new architecture induced by hard thresholding.

We need the following two definitions of the classification margin and the score that we take from [Sokolic et al., 2017]. These will be useful later for measuring the generalization error.

###### Definition 5.1.

(Score). For a classifier a training sample has a score

 o(si)=o(xi,g(xi))=minj≠g(xi)√2(δg(xi)−δj)Tf(xi), (12)

where is the Kronecker delta vector with , and is the output class for from classifier which can also be .

###### Definition 5.2.

(Training Sample Margin). For a classifier a training sample has a classification margin measured by the norm if

 g(x)=g(xi);∀x:||x−xi||2<γ(si). (13)

The classification margin of a training sample is the radius of the largest metric ball (induced by the norm) in centered at that is contained in the decision region associated with the classification label . Note that it is possible for a classifier to misclassify a training point .

We are now ready to state our main result.

###### Theorem 5.1.

Assume that is a (subset of) -regular k-dimensional manifold, where . Assume also that the DNN classifier achieves a lower bound to the classification score and take to be the loss. Furthermore assume that we prune classifier on layer using hard thresholding to obtain a new classifier . Then for any , with probability at least , when ,

 GE(g2)≤A⋅(γ−C1∏i>i⋆||Wi||2∏i||Wi||2)−k2+B, (14)

where and can be considered constants related to the data manifold and the training sample size, is the margin and is the maximum error of the layer over the training set.

The detailed proof can be found in Appendix A. The bound depends on two constants related to intrinsic properties of the data manifold, the regularity constant and the intrinsic data dimensionality . In particular the bound depends exponentially on the intrinsic data dimensionality . Thus more complex datasets are expected to lead DNNs that are less robust to pruning. The bound also depends on the spectral norm of the hidden layers . Small spectral norms lead to a larger base in and thus to tigher bounds.

Our result is quite pessimistic as the hard thresholding error is multiplied by the factor . Thus in our analysis the GE grows exponentially with respect to the remaining layer depth of the pertubated layer. This is in line with previous work [Raghu et al., 2016] [Han et al., 2015b] that demonstrates that layers closer to the input are much less robust compared to layers close to the output.

We can use a perturbation bound introduced in [Neyshabur et al., 2017] to extend the above bound to include pruning of multiple layers

###### Theorem 5.2.

Assume that is a (subset of) -regular k-dimensional manifold, where . Assume also that the DNN classifier achieves a lower bound to the classification score and take to be the loss. Furthermore assume that we prune classifier on all layers using hard thresholding, to obtain a new classifier . Then for any , with probability at least , when ,

 GE(g2)≤A⋅(γ−eD2∑i||Hi||2||Wi||2)−k2+B, (15)

where and can be considered constants related to the data manifold and the training sample size, is the margin and is the per layer perturbation matrix induced by hard thresholding.

The detailed proof can be found in Appendix B. We note also the generality of our result; even though we have assumed a specific form of pruning, the GE bound holds for any type of bounded perturbation to a hidden layer.

We make here an important note regarding generalization bounds for neural networks in general. All state of the art recent results such as [Neyshabur et al., 2017] [Bartlett et al., 2017] [Arora et al., 2018] [Golowich et al., 2018] provide only vacuous estimates of the generalization error. At the same time while the values of the generalization error estimated by these techinques are loose by orders of magnitude they still provide meanigfull insights into various properties of deep neural networks. In this context we will validate our theoretical framework by showing remaining layer depth correlates with an exponential decrease in DNN test accuracy. Also we will show that increased intrinsic data dimentionality correlates with decreased DNN robustness to pruning.

Finally in the statement of our theorem we assume that this implies that all training samples remain correctly classified after pruning. What happens if this quantity goes to zero? In this case we can imagine the following three step procedure. We first start from a clean network and prune it and use our formula until the above quantity goes to zero. We can then recompute all relevant quantities for our bound including accuracy over the training set, margins, scores and spectral norms. We can then reapply our formula as the the quantity is by definition positive for a small enough pruning. Importantly the rate of margin decrease remains the same for all sparsity levels.

## 6 Experiments on Generalization Error

### 6.1 Pruning different layers

In this experiment we use the convolutional MNIST and the convolutional CIFAR architectures from section 2. We prune individual layers from the two networks using hard thresholding for different sparsity levels from to sparsity and compute the network accuracy. We plot the results in Figure 3. We see that as predicted by our bound the network accuracy drops exponentially with remaining layer depth (the number of layers from the pruned layer to the output). This exponential behaviour is the result of the term , and not the result of the dimensionality exponent factor . This suggests that if one has to chose it is beneficial to prune the layers close to the output first as this is have a minimal impact on the network accuracy.

### 6.2 Manifold dimensionality and pruning

Our theoretical analysis suggests also that data with higher intrinsic dimensionality will lead to networks that are less robust to pruning. We test this hypothesis on the convolutional CIFAR architecture. We upper bound the intrinsic dimensionality of the data manifold by applying PCA for different number of latent dimensions, and then train different classification networks. Then we pruned using hard thresholding only the first convolutional layer and computed the accuracy over the test set. We average the results over 10 different networks for each upper bound. Denoting the dimensionality upper bound by we see that for the drop in accuracy is significantly larger compared to . The degradation is not exactly exponential as our dimensionality factor suggests, however this is expected as our theoretical bound involves also other quantities such as the margin and the spectral norm of different layers which change for each trained network and will also affect the resulting accuracy.

## 7 Conclusion

We have seen that hard thresholding followed by retraining remains the most efficient way of pruning fully connected DNN layers. For the case without retraining we have introduced a novel algorithm called FeTa that is often orders of magnitude faster than competing approaches while maintaining network accuracy. We have also shown theoretically and empirically that it is more profitable to prune layers close to the output as the perturbation that is introduced results in a much smaller drop in accuracy compared to pruning earlier layers. We discover also that networks trained on data that lies on a manifold of high intrinsic dimensionality exhibit significantly reduced robustness to pruning.

## Appendix

### A. Proof of theorem 5.1.

We will proceed as follows. We first introduce some prior results which hold for the general class of robust classifiers. We will then give specific prior generalization error results for the case of classifiers operating on datapoints from -regular manifolds. Afterwards we will provide prior results for the specific case of DNN classifiers. Finally we will prove our novel generalization error bound and provide a link with prior bounds.

We first formalize robustness for generic classifiers . In the following we assume a loss function that is positive and bounded .

###### Definition 7.1.

An algorithm is robust if can be partitioned into K disjoint sets, denoted by , such that , ,

 si,s∈Tt,⇒|l(g(xi),yi)−l(g(x),y)|≤ϵ(Sm). (16)

Now let and denote the expected error and the training error, i.e,

 ^l(g)≜Es∼Sl(g(x),y);l% emp(g)≜1m∑si∈Sml(q(xi),yi) (17)

we can then state the following theorem from [Xu and Mannor, 2012]:

###### Theorem 7.1.

If consists of i.i.d. samples, and is -robust, then for any , with probability at least ,

 GE(g)=|^l(g)−lemp(g)|≤ϵ(Sm)+M√2Kln2+2ln(1/δ)m. (18)

The above generic bound can be specified for the case of -regular manifolds as in [Sokolic et al., 2017]. We recall the definition of the sample margin as well as the following theorem:

###### Theorem 7.2.

If there exists such that

 γ(si)>γ>0∀si∈Sm, (19)

then the classifier is -robust.

By direct substitution of the above result and the definiton of a -regular manifold into Theorem 0.1 we get:

###### Corollary 7.2.1.

Assume that is a (subset of) regular dimensional manifold, where . Assume also that classifier achieves a classification margin and take to be the loss. Then for any , with probability at least ,

 GE(g)≤ ⎷log(2)⋅NY⋅2k+1⋅(CM)kγkm+√2log(1/δ)m. (20)

Note that in the above we have used the fact that and therefore . The above holds for a wide range of algorithms that includes as an example SVMs. We are now ready to specify the above bound for the case of DNNs, adapted from [Sokolic et al., 2017],

###### Theorem 7.3.

Assume that a DNN classifier , as defined in equation 8, and let be the training sample with the smallest score . Then the classification margin is bounded as

 γ(si)≥o(~s)∏i||Wi||2=γ. (21)

We now prove our main result. We want to relate the generalization error of an original unpruned classifier to that of a new pruned classifier at layer . Our analysis will be based on the maximum norm error of the layer over the training set, after pruning. In the notation we have omitted the layer weights and introduced instead a numbered superscript and denoting unpruned and pruned layers respectively. We will denote by the training sample with the smallest score. For this training sample we will denote the second best guess of the classifier . Throughout the proof, we will use the notation .

First we assume the score of the point for the original classifier . Then, for the second classifier , we take a point that lies on the decision boundary between and such that . We assume for simplicity that, after pruning, the classification decisions do not change such that . We then make the following calculations

 o1(~x,g1(~x))=o1(~x,g1(~x))−o2(x⋆,g2(~x))=vTg1(~x)j⋆f1(~x)−vTg2(~x)j⋆f2(x⋆)=vTg2(~x)j⋆(f1(~x)−f2(x⋆))≤||vTg2(~x)j⋆||2||f1(~x)−f2(x⋆)||2=||f1L(~x)−f2L(x⋆)||2≤∏i>i⋆||Wi||2||f1i⋆(~x)−f2i⋆(x⋆)||2≤∏i>i⋆||Wi||2{||f1i⋆(~x)−f1i⋆(x⋆)||2+||f1i⋆(x⋆)−f2i⋆(x⋆)||2}≤∏i>i⋆||Wi||2{||f1i⋆(~x)−f1i⋆(x⋆)||2+C1}≤∏i||Wi||2||~x−x⋆||2+C1∏i>i⋆||Wi||2≤∏i||Wi||2γ2(si)+C1∏i>i⋆||Wi||2, (22)

. From the above we can therefore write

 o1(~x,g1(~x))−C1∏i>i⋆||Wi||2∏i||Wi||2≤γ2(~x). (23)

By following the derivation of the margin from the original paper [Sokolic et al., 2017] and taking into account the definition of the margin we know that

 γ=o1(~x,g1(~x))∏i||Wi||2≤γ1(~x). (24)

Therefore we can finally write

 γ−C1∏i>i⋆||Wi||2∏i||Wi||2≤γ2(~x). (25)

The theorem follows from direct application of Corollary 0.2.1.

### B. Proof of theorem 5.2.

We start as in theorem 3.2 but instead of dealing with the maximum error over the training set we assume that pruning induces a perturbation matrix to each layer . We will then use the following stability result from [Neyshabur et al., 2017].

###### Theorem 7.4.

[Neyshabur et al., 2017](Perturbation Bound). For any , let be a layer neural network with ReLU activations. Then for any , and , and any perturbation such that , the change in the output of the network can be bounded as follows:

 ||fw(x)−fw+h(x)||2=||f1(x)−f2(x)||2≤eD2∏i||Wi||2∑i||Hi||2||Wi||2 (26)

We can directly apply the above result to our case by assuming that layer pruning results in perturbation matrices per layer .

We assume the score of the point for the original classifier . Then, for the second classifier , we take a point that lies on the decision boundary between and such that . We assume as before that the classification decisions do not change such that . We write

 o1(~x,g1(~x))=o1(~x,g1(~x))−o2(x⋆,g2(~x))=vTg1(~x)j⋆f1(~x)−vTg2(~x)j⋆f2(x⋆)=vTg2(~x)j⋆(f1(~x)−f2(x⋆))≤||vTg2(~x)j⋆||2||f1(~x)−f2(x⋆)||2=||f1(~x)−f2(x⋆)||2≤||f1(~x)−f1(x⋆)||2+||f1(x⋆)−f2(x⋆)||2≤∏i||Wi||2||~x−x⋆||2+eD2∏i||Wi||2∑i||Hi||2||Wi||2≤∏i||Wi||2γ2(si)+eD2∏i||Wi||2∑i||Hi||2||Wi||2 (27)

We can then write

 o1(~x,g1(~x))−eD2∏i||Wi||2∑i||Hi||2||Wi||2∏i||Wi||2≤γ2(~x). (28)

Then as before

 γ−eD2∑i||Hi||2||Wi||2≤γ2(~x). (29)

The theorem follows from direct application of Corollary 0.2.1.

### C. Experiment Details

All architectures were trained using SGD with momentum , learning rate of , exponential decay and minibatch size of . The architectures where trained until the validation accuracy saturated usually for epochs.

We now give some details regarding the hyperparameters of the pruning algorithms. For all algorithms the entire training set was used to perform pruning.

For Corenet algorithm we implemented neuron pruning (Corenet+) but did not implement amplification (Corenet++). Amplification increased linearly the computation time of the algorithm and we did not notice any improvements in accuracy for high sparsity levels.

For Feta the algorithm was ran for to outer loups. The acceleration parameter was set to , the SVRG learning rate was set to with minibatch size of . In SVRG we need to set a number of full gradient computations denoted by the parameter we set this to .

For Net-Trim we ran the algorithm for 400 iterations and . We set the sparsity level by modifying the parameter as detailed in the authors instructions.

The LOBS argorithm does not contain any tunable parameters.

We did not implement GPU acceleration for any of the algorithms and it is not clear that this will provide speedups for common matrix-matrix and matrix-vector operations which are at the core of the implemented pruning algorithms.