1 Introduction
Recently, deep neural networks have achieved stateofthe art results in a number of machine learning tasks LeCun et al. (2015). Training such networks is computationally intensive and often requires dedicated and expensive hardware. Furthermore, the resulting networks often require a considerable amount of memory to be stored. Using a Pascal Titan X GPU the popular AlexNet and VGG16 models require 13 hours and 7 days, respectively, to train, while requiring 200MB and 600MB, respectively, to store. The large memory requirements limit the use of DNNs in embedded systems and portable devices such as smartphones, which are now ubiquitous.
A number of approaches have been proposed to reduce the DNN size during training time, often with little or no degradation to classification performance. Approaches include introducing bayesian, sparsityinducing priors Louizos et al. (2017) Blundell et al. (2015) Molchanov et al. (2017)
and binarization
Hou et al. (2016) Courbariaux et al. (2016).Other methods include the hashing trick used in Chen et al. (2015), tensorisation Novikov et al. (2015) and efficient matrix factorisations Yang et al. (2015).However, trained DNN models are used by researchers and developers that do not have dedicated hardware to train them, often as general feature extractors for transfer learning. In such settings it is important to introduce a
cheap compression method, i.e., one that can be implemented as a postprocessing step with little or no retraining. Some first work in this direction has been Kim et al. (2015) Han et al. (2015a) Han et al. (2015b) although these still require a lengthy retraining procedure. Closer to our approach recently in Aghasi et al. (2016) the authors propose a convexified layerwise pruning algorithm termed NetTrim. Building upon NetTrim, the authors in Dong et al. (2017)propose LOBS, an algorithm for layerwise pruning by loss function approximation.
Pruning a neural network layer introduces a perturbation to the latent signal representations generated by that layer. As the pertubated signal passes through layers of nonlinear projections, the perturbation could become arbitrarily large. DNN robustness to hidden layer perturbations has been investigated for random noise in Raghu et al. (2016). For the case of pruning in Aghasi et al. (2016) and Dong et al. (2017) the authors conduct a theoretical analysis using the Lipschitz properties of DNNs showing the stability of the latent representations, over the training set, after pruning. The methods employed have connections to recent work Sokolic et al. (2017) Bartlett et al. (2017) Neyshabur et al. (2017) that have used the Lipschitz properties to analyze the Generalization Error (GE) of DNNs, a more useful performance measure.
1.1 Contributions
In this work we introduce a cheap pruning algorithm for dense layers of DNNs. We also conduct a theoretical analysis of how pruning affects the Generalization Error of the trained classifier.

We show that the sparsityinducing objective proposed in Aghasi et al. (2016) can be cast as a difference of convex functions problem, that has an efficient solution. For a fully connected layer with input dimension , output dimension and training samples, NetTrim and LOBS scale like and , respectively. Our iterative algorithm scales like , where is the precision of the solution, is related to the Lipschitz and strong convexity constants, and is the outer iteration number. Emprirically, our algorithm is orders of magnitude faster than competing approaches. We also extend our formulation to allow retraining a layer with any convex regulariser.

We build upon the work of Sokolic et al. (2017) to bound the GE of a DNN for the case of bounded perturbations to the hidden layer weights, of which pruning is a special case. Our theoretical analysis provides a principled way of pruning while managing the GE. In sharp contrast to the analysis of Aghasi et al. (2016) and Dong et al. (2017) our analysis correctly predicts the previously observed phenomenon that accuracy degrades exponentially with the remaining depth of the pruned layer.
Experiments on common feedforward architectures show that our method is orders of magnitude faster than competing pruning methods, while allowing for a controlled increase in GE.
1.2 Notation and Definitions
We use the following notation in the sequel:matrices ,column vectors, scalars and sets are denoted by boldface uppercase letters (
), boldface lowercase letters (), italic letters () and calligraphic uppercase letters (), respectively. The covering number of with metric balls of radius is denoted by . A regular dimensional manifold, where is a constant that captures ”intrinsic” properties, is one that has a covering number .2 Our formulation
2.1 DC decomposition
We consider a classification problem, where we observe a vector that has a corresponding class label . The set is called the input space, is called the label space and denotes the number of classes. The samples space is denoted by and an element of is denoted by . We assume that samples from
are drawn according to a probability distribution
defined on . A training set of samples drawn from is denoted by .We start from the NetTrim formulation and show that it can be cast as a difference of convex functions problem. For each training signal we assume also that we have access to the inputs and the outputs of the fully connected layer, with a rectifier nonlinearity . The optimisation problem that we want to solve is then
(1) 
where is the sparsity parameter. The term ensures that the nonlinear projection remains the same for training signals. The term is the convex regulariser which imposes the desired structure on the weight matrix .
The objective in Equation 1 is nonconvex. We show that the optimisation of this objective can be cast as a difference of convex functions (DC) problem. We assume just one training sample , for simplicity, with latent representations and
(2) 
Notice that after the split the first term () is convex while the second () is concave. We note that
by definition of the ReLu and set
(3) 
(4) 
Then by summing over all the samples we get
(5) 
which is difference of convex functions. The rectifier nonlinearity is nonsmooth, but we can alleviate that by assuming a smooth approximation. A common choice for this task is , with a positive constant.
2.2 Optimisation
It is well known that DC programs have efficient optimisation algorithms. We propose to use the DCA algorithm Tao & An (1997). DCA is an iterative algorithm that consists in solving, at each iteration, the convex optimisation problem obtained by linearizing (the nonconvex part of ) around the current solution. Although DCA is only guaranteed to reach local minima the authors of Tao & An (1997) state that DCA often converges to the global minimum, and has been used succefully to optimise a fully connected DNN layer Fawzi et al. (2015). At iteration of DCA, the linearized optimisation problem is given by
(6) 
where
is the solution estimate at iteration
. The detailed procedure is then given in algorithms 1 and 2. We assume that the regulariser is convex but possibly nonsmooth in which case the optimisation can be performed using proximal methods.(7) 
In order to solve the linearized problem we propose to use Accelerated Proximal SVRG (AccProxSVRG), which was presented in Nitanda (2014). We detail this method in Algorithm 2b. At each iteration a minibatch and is drawn. The gradient for the smooth part is calculated and the algorithm takes a step in that direction with step size . Then the proximal operator for the nonsmooth regulariser
is applied to the result. The hyperparameters for AccProxSVRG are the acceleration parameter
and the gradient step . We have found that in our experiments, using and gives the best results.We name our algorithm FeTa, Fast and Efficient Trimming Algorithm.
3 Generalization Error
3.1 Generalization Error of Pruned Layer
Having optimized our pruned layer for the training set we want to see if it is stable for the test set. We denote the original representation and the pruned representation. We assume that after training . Second, we assume that . Third, the linear operators in , are frames with upper frame bounds , respectively.
Theorem 3.1.
For any testing point , the distance between the original representation and the pruned representation is bounded by where .
the detailed proof can be found in Appendix A.
3.2 Generalization Error of Classifier
In this section we use tools from the robustness framework Xu & Mannor (2012) to bound the generalization error of the new architecture induced by our pruning. We consider DNN classifiers defined as
(8) 
where is the th element of dimensional output of a DNN . We assume that is composed of layers
(9) 
where represents the th layer with parameters , . The output of the th layer is denoted , i.e. . The input layer corresponds to and the output of the last layer is denoted by . We then need the following two definitions of the classification margin and the score that we take from Sokolic et al. (2017). These will be useful later for measuring the generalization error.
Definition 3.1.
(Score). For a classifier a training sample has a score
(10) 
where is the Kronecker delta vector with , and is the output class for from classifier which can also be .
Definition 3.2.
(Training Sample Margin). For a classifier a training sample has a classification margin measured by the norm if
(11) 
The classification margin of a training sample is the radius of the largest metric ball (induced by the norm) in centered at that is contained in the decision region associated with the classification label . Note that it is possible for a classifier to misclassify a training point . We then restate a useful result from Sokolic et al. (2017).
Corollary 3.1.1.
Assume that is a (subset of) regular kdimensional manifold, where . Assume also that the DNN classifier achieves a lower bound to the classification score and take to be the loss. Then for any , with probability at least ,
(12) 
where and can be considered constants related to the data manifold and the training sample size, and .
We are now ready to state our main result.
Theorem 3.2.
Assume that is a (subset of) regular kdimensional manifold, where . Assume also that the DNN classifier achieves a lower bound to the classification score and take to be the loss. Furthermore assume that we prune classifier on layer using Algorithm 1, to obtain a new classifier . Then for any , with probability at least , when ,
(13) 
where and can be considered constants related to the data manifold and the training sample size, and .
The detailed proof can be found in Appendix B. The bound depends on two constants related to intrinsic properties of the data manifold, the regularity constant and the intrinsic data dimensionality . In particular the bound depends exponentially on the intrinsic data dimensionality . Thus more complex datasets are expected to lead to less robust DNNs. This has been recently observed empirically in Bartlett et al. (2017). The bound also depends on the spectral norm of the hidden layers . Small spectral norms lead to a larger base in and thus to tigher bounds.
With respect to pruning our result is quite pessimistic as the pruning error is multiplied by the factor . Thus in our analysis the GE grows exponentially with respect to the remaining layer depth of the pertubated layer. This is in line with previous work Raghu et al. (2016) Han et al. (2015b) that demonstrates that layers closer to the input are much less robust compared to layers close to the output. Our algorithm is applied to the fully connected layers of a DNN, which are much closer to the output compared to convolutional layers.
We can extend the above bound to include pruning of multiple layers.
Theorem 3.3.
Assume that is a (subset of) regular kdimensional manifold, where . Assume also that the DNN classifier achieves a lower bound to the classification score and take to be the loss. Furthermore assume that we prune classifier on all layers using Algorithm 1, to obtain a new classifier . Then for any , with probability at least , when ,
(14) 
where and can be considered constants related to the data manifold and the training sample size, and .
The detailed proof can be found in Appendix C. The bound predicts that when pruning multiple layers the GE will be much greater than the sum of the GEs for each individual pruning. We note also the generality of our result; even though we have assumed a specific form of pruning, the GE bound holds for any type of bounded perturbation to a hidden layer.
4 Experiments
We make a number of experiments to compare FeTa with LOBS and NetTrimADMM. All experiments were run on a MacBook Pro with CPU 2.8GHz Intel Core i7 and RAM 16GB 1600 MHz DDR3.
4.1 Time Complexity
First we compare the execution time of FeTa with that of LOBS and NetTrimADMM. We set and aim for sparsity. We set to be the input dimensions, to be the output dimensions and to be the number of training samples. Assuming that each is Lipschitz smooth and is strongly convex, if we optimise for an optimal solution and set , FeTa scales like . We obtain this by multiplying the number of outer iterations with the number of gradient evaluations required to reach an good solution in inner Algorithm 2, and finally multiplying with the gradient evaluation cost. Conversely LOBS scales like while NetTrimADMM scales like due to the required Cholesky factorisation. This gives a computational advantage to our algorithm in settings where the input dimension is large. We validate this by constructing a toy dataset with , and . The samples and are generated with Gaussian entries. We plot in Figure 1 the results, which are in line with the theoretical predictions.
4.2 Classification Accuracy
4.2.1 Sparse Regularisation
In this section we perform experiments on the proposed compression scheme with feedforward neural networks. We compare the original fullprecision network (without compression) with the following compressed networks: (i) FeTa with (ii) NetTrim (iii) LOBS (iv) Hard Thresholding. We refer to the respective papers for NetTrim and LOBS. Hard Thresholding is defined as , where is the elementwise indicator function, is the Hadamard product and is a positive constant.
Experiments were performed on two commonly used datasets:

MNIST: This contains gray images from ten digit classes. We use 55000 images for training, another 5000 for validation, and the remaining 10000 for testing. We use the LeNet5 model:
(15) where is a ReLU convolution layer, is a maxpooling layer, is a fully connected layer and
is a linear softmax layer.

CIFAR10:This contains 60000 color images for ten object classes. We use 50000 images for training and the remaining 10000 for testing. The training data is augmented by random cropping to pixels, random flips from left to right, contrast and brightness distortions to 200000 images. We use a smaller variant of the AlexNet model:
(16)
We first prune only the first fully connected layer (the one furthest from the output) for clarity. Figure 2 shows the classification accuracy vs compression ratio for FeTa, NetTrim, LOBS and Hard Thresholding. We see that Hard Thresholding works adequately up to sparsity. From this level of sparsity and above the performance of Hard Thresholding degrades rapidly, FeTa has higher accuracy on average while being the same or marginally worse than LOBS and NetTrim.
For the task of pruning the first fully connected layer we also show detailed comparison results for all methods in Table 1. For the LeNet5 model, FeTa achieves the same accuracy as NetTrim while being faster. This is expected as the two algorithms optimise a similar objective, while FeTa exploits the structure of the objective to achieve lower complexity in optimisation. Furthermore FeTa achieves marginally lower classification accuracy compared to LOBS while being faster, and is significantly better than Thresholding.
LeNet5  Original  CR  Pruned  Time 

NetTrim  99.2%  95%  95%  455s 
LOBS  99.2%  95%  97%  90s 
Threshold  99.2%  95%  83%   
FeTa  99.2%  95%  s  
CifarNet  Original  CR  Pruned  Time 
NetTrim  86%       
LOBS  86%  90%  83.4%  3h 15min 
Threshold  86%  90%  73%   
FeTa  86%  90%  min 
LeNet5  Original  CR  Pruned  Time 

NetTrim  99.2%  90%  95%  500s 
LOBS  99.2%  90%  97%  97s 
Threshold  99.2%  90%  64%   
FeTa  99.2%  90%  s  
CifarNet  Original  CR  Pruned  Time 
NetTrim  86%       
LOBS  86%  90%  83.4%  3h 15min 
Threshold  86%  90%  64%   
FeTa  86%  90%  min 
For the CifarNet model we see in Table 1 that NetTrim is not feasible on the machine used for the experiments as it requires over 16GB of RAM. Compared to LOBS FeTa again achieves marginally lower accuracy but is faster.
Next we prune both the fully connected layers in the two architectures to the same sparsity level and show the results in Table 2. We lower the achieved sparsity for all methods to . For MNIST The accuracy results are the same as pruning a single layer, with FeTa achieving the same or marginally worse results while being faster than NetTrim and faster than LOBS. For the Cifar experiment FeTa shows a bigger degradation in performance compared to LOBS while remaining faster. Thresholding achieves a notably bad result of accuracy, which makes the method essentially inapplicable for multilayer pruning.
We note here that the degraded performance of FeTa for two layer pruning in Cifar is due to a poor solution for the second dense layer. By combining FeTa for the first dense layer and Thresholding for the second dense layer one can achieve accuracy for the same computational cost. Furthermore as mentioned in Dong et al. (2017) and Wolfe et al. (2017) retraining can recover classification accuracy that was lost during pruning. Starting from a good pruning which doesn’t allow for much degradation significantly reduces retraining time.
4.2.2 Low Rank Regularisation
As a proof of concept for the generality of our approach we apply our method while imposing lowrank regularisation on the learned matrix . For low rank we compare two methods (i) FeTa with
and optimised with AccProxSVRG and (ii) Hard Thresholding of singular values using the truncated SVD defined as
. We plot the results in Figure 3.In the above given the Commpression Ratio (CR) is defined as . The results are in line with the regularisation, with significant degredation in classification accuracy for Hard Thresholding above CR.
4.3 Generalization Error
According to our theoretical analysis the GE drops exponentially with remaining layer depth. To corroborate this we train a LeNet5 to high accuracy, then we pick a single layer and gradually increase its sparsity using Hard Thresholding. We find that the layers closer to the input are exponentially less robust to pruning, in line with our theoretical analysis. We plot the results in Figure 4.a. For some layers there is a sudden increase in accuracy around sparsity which could be due to the small size of the DNN. We point out that in empirical results Raghu et al. (2016) Han et al. (2015b) for much larger networks the degradation is entirely smooth.
Next we test our multilayer pruning bound. We prune to the same sparsity levels all layers in the sets , , , . We plot the results in Figure 4.b. It is evident that the accuracy loss for layer groups is not simply the addition of the accuracy losses of the individual layers, but shows an exponential drop in accordance with our theoretical result.
We now aim to see how well our bound captures this exponential behaviour. We take two networks pruned at layer 3 and an unpruned network and make a number of simplifying assumptions. First we assume that in Theorem 3.3 such that . This is logical as includes only log terms. Assuming that the bounds are tight we now aim to calculate
(17) 
We can use the above to make predictions for the GE of the pruned network by noting that as we know that for the unpruned network and we have managed to avoid the cumbersome parameter. Next we make the assumption that . Dimensionality values are common for the MNIST dataset and result from a simple dimensionality analysis using PCA. We also deviate slightly from our theory by using the minimum layerwise error for each sparsity level, as well as the average scores . We plot the theoretical predictions for single layer pruning in Figure 4.a and the theoretical predictions for multilayer pruning in Figure 4.b. We see that, while loose, the theoretical predictions correctly capture qualitatively the behaviour of the GE. Specifically, layers, as predicted, are exponentially less robust with remaining layer depth. Also , as predicted, when pruning multiple layers the resulting GE is exponentially greater than the sum of the individual GEs.
5 Conclusion
In this paper we have presented an efficient pruning algorithm for fully connected layers of DNNs, based on difference of convex functions optimisation. Our algorithm is orders of magnitude faster than competing approaches while allowing for a controlled increase in the GE. We provided a theoretical analysis of the increase in GE resulting from bounded perturbations to the hidden layer weights, of which pruning is a special case. This analysis correctly predicts the previously observed phenomenon that network layers closer to the input are exponentially less robust to pruning compared to layers close to the output. Experiments on common feedforward architectures validated our results.
References
 Aghasi et al. (2016) Aghasi, Alireza, Nguyen, Nam, and Romberg, Justin. Nettrim: A layerwise convex pruning of deep neural networks. arXiv preprint arXiv:1611.05162, 2016.
 Bartlett et al. (2017) Bartlett, Peter, Foster, Dylan J, and Telgarsky, Matus. Spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.
 Blundell et al. (2015) Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray, and Wierstra, Daan. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Chen et al. (2015) Chen, Wenlin, Wilson, James, Tyree, Stephen, Weinberger, Kilian, and Chen, Yixin. Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pp. 2285–2294, 2015.
 Courbariaux et al. (2016) Courbariaux, Matthieu, Hubara, Itay, Soudry, Daniel, ElYaniv, Ran, and Bengio, Yoshua. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 Dong et al. (2017) Dong, Xin, Chen, Shangyu, and Pan, Sinno Jialin. Learning to prune deep neural networks via layerwise optimal brain surgeon. arXiv preprint arXiv:1705.07565, 2017.

Fawzi et al. (2015)
Fawzi, Alhussein, Davies, Mike, and Frossard, Pascal.
Dictionary learning for fast classification based on
softthresholding.
International Journal of Computer Vision
, 114(23):306–321, 2015.  Han et al. (2015a) Han, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
 Han et al. (2015b) Han, Song, Pool, Jeff, Tran, John, and Dally, William. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143, 2015b.
 Hou et al. (2016) Hou, Lu, Yao, Quanming, and Kwok, James T. Lossaware binarization of deep networks. arXiv preprint arXiv:1611.01600, 2016.
 Kim et al. (2015) Kim, YongDeok, Park, Eunhyeok, Yoo, Sungjoo, Choi, Taelim, Yang, Lu, and Shin, Dongjun. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
 LeCun et al. (2015) LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deep learning. Nature, 521(7553):436–444, 2015.
 Louizos et al. (2017) Louizos, Christos, Ullrich, Karen, and Welling, Max. Bayesian compression for deep learning. arXiv preprint arXiv:1705.08665, 2017.
 Molchanov et al. (2017) Molchanov, Dmitry, Ashukha, Arsenii, and Vetrov, Dmitry. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
 Neyshabur et al. (2017) Neyshabur, Behnam, Bhojanapalli, Srinadh, McAllester, David, and Srebro, Nathan. A pacbayesian approach to spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017.
 Nitanda (2014) Nitanda, Atsushi. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pp. 1574–1582, 2014.
 Novikov et al. (2015) Novikov, Alexander, Podoprikhin, Dmitrii, Osokin, Anton, and Vetrov, Dmitry P. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
 Raghu et al. (2016) Raghu, Maithra, Poole, Ben, Kleinberg, Jon, Ganguli, Surya, and SohlDickstein, Jascha. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.
 Sokolic et al. (2017) Sokolic, Jure, Giryes, Raja, Sapiro, Guillermo, and Rodrigues, Miguel RD. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 2017.
 Tao & An (1997) Tao, Pham Dinh and An, Le Thi Hoai. Convex analysis approach to dc programming: Theory, algorithms and applications. Acta Mathematica Vietnamica, 22(1):289–355, 1997.
 Wolfe et al. (2017) Wolfe, Nikolas, Sharma, Aditya, Drude, Lukas, and Raj, Bhiksha. The incredible shrinking neural network: New perspectives on learning representations through the lens of pruning. arXiv preprint arXiv:1701.04465, 2017.
 Xu & Mannor (2012) Xu, Huan and Mannor, Shie. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
 Yang et al. (2015) Yang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and Wang, Ziyu. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1476–1483, 2015.