FeTa: A DCA Pruning Algorithm with Generalization Error Guarantees

03/12/2018 ∙ by Konstantinos Pitas, et al. ∙ 0

Recent DNN pruning algorithms have succeeded in reducing the number of parameters in fully connected layers, often with little or no drop in classification accuracy. However, most of the existing pruning schemes either have to be applied during training or require a costly retraining procedure after pruning to regain classification accuracy. We start by proposing a cheap pruning algorithm for fully connected DNN layers based on difference of convex functions (DC) optimisation, that requires little or no retraining. We then provide a theoretical analysis for the growth in the Generalization Error (GE) of a DNN for the case of bounded perturbations to the hidden layers, of which weight pruning is a special case. Our pruning method is orders of magnitude faster than competing approaches, while our theoretical analysis sheds light to previously observed problems in DNN pruning. Experiments on commnon feedforward neural networks validate our results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, deep neural networks have achieved state-of-the art results in a number of machine learning tasks LeCun et al. (2015). Training such networks is computationally intensive and often requires dedicated and expensive hardware. Furthermore, the resulting networks often require a considerable amount of memory to be stored. Using a Pascal Titan X GPU the popular AlexNet and VGG-16 models require 13 hours and 7 days, respectively, to train, while requiring 200MB and 600MB, respectively, to store. The large memory requirements limit the use of DNNs in embedded systems and portable devices such as smartphones, which are now ubiquitous.

A number of approaches have been proposed to reduce the DNN size during training time, often with little or no degradation to classification performance. Approaches include introducing bayesian, sparsity-inducing priors Louizos et al. (2017) Blundell et al. (2015) Molchanov et al. (2017)

and binarization

Hou et al. (2016) Courbariaux et al. (2016).Other methods include the hashing trick used in Chen et al. (2015), tensorisation Novikov et al. (2015) and efficient matrix factorisations Yang et al. (2015).

However, trained DNN models are used by researchers and developers that do not have dedicated hardware to train them, often as general feature extractors for transfer learning. In such settings it is important to introduce a

cheap compression method, i.e., one that can be implemented as a postprocessing step with little or no retraining. Some first work in this direction has been Kim et al. (2015) Han et al. (2015a) Han et al. (2015b) although these still require a lengthy retraining procedure. Closer to our approach recently in Aghasi et al. (2016) the authors propose a convexified layerwise pruning algorithm termed Net-Trim. Building upon Net-Trim, the authors in Dong et al. (2017)

propose LOBS, an algorithm for layerwise pruning by loss function approximation.

Pruning a neural network layer introduces a perturbation to the latent signal representations generated by that layer. As the pertubated signal passes through layers of non-linear projections, the perturbation could become arbitrarily large. DNN robustness to hidden layer perturbations has been investigated for random noise in Raghu et al. (2016). For the case of pruning in Aghasi et al. (2016) and Dong et al. (2017) the authors conduct a theoretical analysis using the Lipschitz properties of DNNs showing the stability of the latent representations, over the training set, after pruning. The methods employed have connections to recent work Sokolic et al. (2017) Bartlett et al. (2017) Neyshabur et al. (2017) that have used the Lipschitz properties to analyze the Generalization Error (GE) of DNNs, a more useful performance measure.

1.1 Contributions

In this work we introduce a cheap pruning algorithm for dense layers of DNNs. We also conduct a theoretical analysis of how pruning affects the Generalization Error of the trained classifier.

  • We show that the sparsity-inducing objective proposed in Aghasi et al. (2016) can be cast as a difference of convex functions problem, that has an efficient solution. For a fully connected layer with input dimension , output dimension and training samples, Net-Trim and LOBS scale like and , respectively. Our iterative algorithm scales like , where is the precision of the solution, is related to the Lipschitz and strong convexity constants, and is the outer iteration number. Emprirically, our algorithm is orders of magnitude faster than competing approaches. We also extend our formulation to allow retraining a layer with any convex regulariser.

  • We build upon the work of Sokolic et al. (2017) to bound the GE of a DNN for the case of bounded perturbations to the hidden layer weights, of which pruning is a special case. Our theoretical analysis provides a principled way of pruning while managing the GE. In sharp contrast to the analysis of Aghasi et al. (2016) and Dong et al. (2017) our analysis correctly predicts the previously observed phenomenon that accuracy degrades exponentially with the remaining depth of the pruned layer.

Experiments on common feedforward architectures show that our method is orders of magnitude faster than competing pruning methods, while allowing for a controlled increase in GE.

1.2 Notation and Definitions

We use the following notation in the sequel:matrices ,column vectors, scalars and sets are denoted by boldface upper-case letters (

), boldface lower-case letters (), italic letters () and calligraphic upper-case letters (), respectively. The covering number of with -metric balls of radius is denoted by . A -regular -dimensional manifold, where is a constant that captures ”intrinsic” properties, is one that has a covering number .

2 Our formulation

2.1 DC decomposition

We consider a classification problem, where we observe a vector that has a corresponding class label . The set is called the input space, is called the label space and denotes the number of classes. The samples space is denoted by and an element of is denoted by . We assume that samples from

are drawn according to a probability distribution

defined on . A training set of samples drawn from is denoted by .

We start from the Net-Trim formulation and show that it can be cast as a difference of convex functions problem. For each training signal we assume also that we have access to the inputs and the outputs of the fully connected layer, with a rectifier non-linearity . The optimisation problem that we want to solve is then

(1)

where is the sparsity parameter. The term ensures that the nonlinear projection remains the same for training signals. The term is the convex regulariser which imposes the desired structure on the weight matrix .

The objective in Equation 1 is non-convex. We show that the optimisation of this objective can be cast as a difference of convex functions (DC) problem. We assume just one training sample , for simplicity, with latent representations and

(2)

Notice that after the split the first term () is convex while the second () is concave. We note that

by definition of the ReLu and set

(3)
(4)

Then by summing over all the samples we get

(5)

which is difference of convex functions. The rectifier nonlinearity is non-smooth, but we can alleviate that by assuming a smooth approximation. A common choice for this task is , with a positive constant.

2.2 Optimisation

It is well known that DC programs have efficient optimisation algorithms. We propose to use the DCA algorithm Tao & An (1997). DCA is an iterative algorithm that consists in solving, at each iteration, the convex optimisation problem obtained by linearizing (the non-convex part of ) around the current solution. Although DCA is only guaranteed to reach local minima the authors of Tao & An (1997) state that DCA often converges to the global minimum, and has been used succefully to optimise a fully connected DNN layer Fawzi et al. (2015). At iteration of DCA, the linearized optimisation problem is given by

(6)

where

is the solution estimate at iteration

. The detailed procedure is then given in algorithms 1 and 2. We assume that the regulariser is convex but possibly non-smooth in which case the optimisation can be performed using proximal methods.

1:  Choose initial point:
2:  for k = 1,…,K do
3:     Compute .
4:     Solve with Algorithm 2 the convex optimisation problem:
(7)
5:  end for
6:  If return .
Algorithm 1 FeTa (Fast and Efficient Trimming Algorithm)
1:  Initialization:
2:  for s = 1,…,S do
3:     
4:     
5:     for t = 1,2,…,T do
6:        Choose randomly chosen minibatch.
7:        
8:        
9:        
10:     end for
11:     
12:  end for
13:  Return
Algorithm 2 Acc-Prox-SVRG

In order to solve the linearized problem we propose to use Accelerated Proximal SVRG (Acc-Prox-SVRG), which was presented in Nitanda (2014). We detail this method in Algorithm 2b. At each iteration a minibatch and is drawn. The gradient for the smooth part is calculated and the algorithm takes a step in that direction with step size . Then the proximal operator for the non-smooth regulariser

is applied to the result. The hyperparameters for Acc-Prox-SVRG are the acceleration parameter

and the gradient step . We have found that in our experiments, using and gives the best results.

We name our algorithm FeTa, Fast and Efficient Trimming Algorithm.

3 Generalization Error

3.1 Generalization Error of Pruned Layer

Having optimized our pruned layer for the training set we want to see if it is stable for the test set. We denote the original representation and the pruned representation. We assume that after training . Second, we assume that . Third, the linear operators in , are frames with upper frame bounds , respectively.

Theorem 3.1.

For any testing point , the distance between the original representation and the pruned representation is bounded by where .

the detailed proof can be found in Appendix A.

3.2 Generalization Error of Classifier

In this section we use tools from the robustness framework Xu & Mannor (2012) to bound the generalization error of the new architecture induced by our pruning. We consider DNN classifiers defined as

(8)

where is the th element of dimensional output of a DNN . We assume that is composed of layers

(9)

where represents the th layer with parameters , . The output of the th layer is denoted , i.e. . The input layer corresponds to and the output of the last layer is denoted by . We then need the following two definitions of the classification margin and the score that we take from Sokolic et al. (2017). These will be useful later for measuring the generalization error.

Definition 3.1.

(Score). For a classifier a training sample has a score

(10)

where is the Kronecker delta vector with , and is the output class for from classifier which can also be .

Definition 3.2.

(Training Sample Margin). For a classifier a training sample has a classification margin measured by the norm if

(11)

The classification margin of a training sample is the radius of the largest metric ball (induced by the norm) in centered at that is contained in the decision region associated with the classification label . Note that it is possible for a classifier to misclassify a training point . We then restate a useful result from Sokolic et al. (2017).

Corollary 3.1.1.

Assume that is a (subset of) -regular k-dimensional manifold, where . Assume also that the DNN classifier achieves a lower bound to the classification score and take to be the loss. Then for any , with probability at least ,

(12)

where and can be considered constants related to the data manifold and the training sample size, and .

We are now ready to state our main result.

Theorem 3.2.

Assume that is a (subset of) -regular k-dimensional manifold, where . Assume also that the DNN classifier achieves a lower bound to the classification score and take to be the loss. Furthermore assume that we prune classifier on layer using Algorithm 1, to obtain a new classifier . Then for any , with probability at least , when ,

(13)

where and can be considered constants related to the data manifold and the training sample size, and .

The detailed proof can be found in Appendix B. The bound depends on two constants related to intrinsic properties of the data manifold, the regularity constant and the intrinsic data dimensionality . In particular the bound depends exponentially on the intrinsic data dimensionality . Thus more complex datasets are expected to lead to less robust DNNs. This has been recently observed empirically in Bartlett et al. (2017). The bound also depends on the spectral norm of the hidden layers . Small spectral norms lead to a larger base in and thus to tigher bounds.

With respect to pruning our result is quite pessimistic as the pruning error is multiplied by the factor . Thus in our analysis the GE grows exponentially with respect to the remaining layer depth of the pertubated layer. This is in line with previous work Raghu et al. (2016) Han et al. (2015b) that demonstrates that layers closer to the input are much less robust compared to layers close to the output. Our algorithm is applied to the fully connected layers of a DNN, which are much closer to the output compared to convolutional layers.

We can extend the above bound to include pruning of multiple layers.

Theorem 3.3.

Assume that is a (subset of) -regular k-dimensional manifold, where . Assume also that the DNN classifier achieves a lower bound to the classification score and take to be the loss. Furthermore assume that we prune classifier on all layers using Algorithm 1, to obtain a new classifier . Then for any , with probability at least , when ,

(14)

where and can be considered constants related to the data manifold and the training sample size, and .

The detailed proof can be found in Appendix C. The bound predicts that when pruning multiple layers the GE will be much greater than the sum of the GEs for each individual pruning. We note also the generality of our result; even though we have assumed a specific form of pruning, the GE bound holds for any type of bounded perturbation to a hidden layer.

4 Experiments

We make a number of experiments to compare FeTa with LOBS and NetTrim-ADMM. All experiments were run on a MacBook Pro with CPU 2.8GHz Intel Core i7 and RAM 16GB 1600 MHz DDR3.

4.1 Time Complexity

First we compare the execution time of FeTa with that of LOBS and NetTrim-ADMM. We set and aim for sparsity. We set to be the input dimensions, to be the output dimensions and to be the number of training samples. Assuming that each is -Lipschitz smooth and is -strongly convex, if we optimise for an optimal solution and set , FeTa scales like . We obtain this by multiplying the number of outer iterations with the number of gradient evaluations required to reach an good solution in inner Algorithm 2, and finally multiplying with the gradient evaluation cost. Conversely LOBS scales like while NetTrim-ADMM scales like due to the required Cholesky factorisation. This gives a computational advantage to our algorithm in settings where the input dimension is large. We validate this by constructing a toy dataset with , and . The samples and are generated with Gaussian entries. We plot in Figure 1 the results, which are in line with the theoretical predictions.

4.2 Classification Accuracy

Figure 1: Time Complexity: We plot the calculation time for FeTa, NetTrim and LOBS for the toy dataset. We see that the computation time is in line with theoretical predictions. FeTa scales roughly as while NetTrim and LOBS scale like and . As the size of the input dimensions increases FeTa becomes orders of magnitude faster than the competing approaches.

4.2.1 Sparse Regularisation

In this section we perform experiments on the proposed compression scheme with feedforward neural networks. We compare the original full-precision network (without compression) with the following compressed networks: (i) FeTa with (ii) Net-Trim (iii) LOBS (iv) Hard Thresholding. We refer to the respective papers for Net-Trim and LOBS. Hard Thresholding is defined as , where is the elementwise indicator function, is the Hadamard product and is a positive constant.

Experiments were performed on two commonly used datasets:

  1. MNIST: This contains gray images from ten digit classes. We use 55000 images for training, another 5000 for validation, and the remaining 10000 for testing. We use the LeNet-5 model:

    (15)

    where is a ReLU convolution layer, is a max-pooling layer, is a fully connected layer and

    is a linear softmax layer.

  2. CIFAR-10:This contains 60000 color images for ten object classes. We use 50000 images for training and the remaining 10000 for testing. The training data is augmented by random cropping to pixels, random flips from left to right, contrast and brightness distortions to 200000 images. We use a smaller variant of the AlexNet model:

    (16)

We first prune only the first fully connected layer (the one furthest from the output) for clarity. Figure 2 shows the classification accuracy vs compression ratio for FeTa, NetTrim, LOBS and Hard Thresholding. We see that Hard Thresholding works adequately up to sparsity. From this level of sparsity and above the performance of Hard Thresholding degrades rapidly, FeTa has higher accuracy on average while being the same or marginally worse than LOBS and NetTrim.

(a)
(b)
Figure 2: Accuracy vs Sparsity: (a)We plot the classification accuracy of the pruned LeNet-5 architecture for different sparsity levels. Until the 80% sparsity level roughly all methods are equal. For sparsity levels greater than 80% FeTa clearly outperforms Hard Thresholding while remaining competitive with LOBS. (b)We plot the classification accuracy of the pruned CifarNet architecture for different sparsity levels. The results are consistent with the LeNet-5 experiment.

For the task of pruning the first fully connected layer we also show detailed comparison results for all methods in Table 1. For the LeNet-5 model, FeTa achieves the same accuracy as Net-Trim while being faster. This is expected as the two algorithms optimise a similar objective, while FeTa exploits the structure of the objective to achieve lower complexity in optimisation. Furthermore FeTa achieves marginally lower classification accuracy compared to LOBS while being faster, and is significantly better than Thresholding.


LeNet-5 Original CR Pruned Time
Net-Trim 99.2% 95% 95% 455s
LOBS 99.2% 95% 97% 90s
Threshold 99.2% 95% 83% -
FeTa 99.2% 95% s
CifarNet Original CR Pruned Time
Net-Trim 86% - - -
LOBS 86% 90% 83.4% 3h 15min
Threshold 86% 90% 73% -
FeTa 86% 90% min
Table 1: Test accuracy rates (%) prune only first fully connected layer.

LeNet-5 Original CR Pruned Time
Net-Trim 99.2% 90% 95% 500s
LOBS 99.2% 90% 97% 97s
Threshold 99.2% 90% 64% -
FeTa 99.2% 90% s
CifarNet Original CR Pruned Time
Net-Trim 86% - - -
LOBS 86% 90% 83.4% 3h 15min
Threshold 86% 90% 64% -
FeTa 86% 90% min
Table 2: Test accuracy rates (%) prune all fully connected layers.

For the CifarNet model we see in Table 1 that Net-Trim is not feasible on the machine used for the experiments as it requires over 16GB of RAM. Compared to LOBS FeTa again achieves marginally lower accuracy but is faster.

Next we prune both the fully connected layers in the two architectures to the same sparsity level and show the results in Table 2. We lower the achieved sparsity for all methods to . For MNIST The accuracy results are the same as pruning a single layer, with FeTa achieving the same or marginally worse results while being faster than Net-Trim and faster than LOBS. For the Cifar experiment FeTa shows a bigger degradation in performance compared to LOBS while remaining faster. Thresholding achieves a notably bad result of accuracy, which makes the method essentially inapplicable for multilayer pruning.

We note here that the degraded performance of FeTa for two layer pruning in Cifar is due to a poor solution for the second dense layer. By combining FeTa for the first dense layer and Thresholding for the second dense layer one can achieve accuracy for the same computational cost. Furthermore as mentioned in Dong et al. (2017) and Wolfe et al. (2017) retraining can recover classification accuracy that was lost during pruning. Starting from a good pruning which doesn’t allow for much degradation significantly reduces retraining time.

4.2.2 Low Rank Regularisation

As a proof of concept for the generality of our approach we apply our method while imposing low-rank regularisation on the learned matrix . For low rank we compare two methods (i) FeTa with

and optimised with Acc-Prox-SVRG and (ii) Hard Thresholding of singular values using the truncated SVD defined as

. We plot the results in Figure 3.

(a) LeNet-5
(b) CifarNet
Figure 3: Accuracy vs CR: (a)We plot the classification accuracy of the low-rank compressed LeNet-5 architecture for different CR levels. Until the 85% CR level roughly all methods are equal. For CR levels greater than 85% FeTa clearly outperforms Hard Thresholding. (b)We plot the classification accuracy of the pruned CifarNet architecture for different CR levels. The results are consistent with the LeNet-5 experiment.

In the above given the Commpression Ratio (CR) is defined as . The results are in line with the regularisation, with significant degredation in classification accuracy for Hard Thresholding above CR.

4.3 Generalization Error

(a) Single Layer
(b) Multiple Layers
Figure 4: Layer Robustness: We plot the theoretical prediction for the GE (dashed lines) and the empirical value of the GE (solid lines) for single layer pruning (a) and multilayer (b) pruning. Our theoretical predictions are tight for layers with small remaining depth but are loose for layers with big remaining depth. We first focus on pruning for sparsity. Layer is as predicted exponentially less robust compared to layers . We then focus on pruning layer and layers for 80% sparsity. We see that even though the GE errors for are negligible the GE error for is exponentially greater than the sum of the GEs when pruning and . Interestingly in the empirical GE estimate there exists an artifact around 90% sparsity which is partially captured by our prediction.

According to our theoretical analysis the GE drops exponentially with remaining layer depth. To corroborate this we train a LeNet-5 to high accuracy, then we pick a single layer and gradually increase its sparsity using Hard Thresholding. We find that the layers closer to the input are exponentially less robust to pruning, in line with our theoretical analysis. We plot the results in Figure 4.a. For some layers there is a sudden increase in accuracy around sparsity which could be due to the small size of the DNN. We point out that in empirical results Raghu et al. (2016) Han et al. (2015b) for much larger networks the degradation is entirely smooth.

Next we test our multilayer pruning bound. We prune to the same sparsity levels all layers in the sets , , , . We plot the results in Figure 4.b. It is evident that the accuracy loss for layer groups is not simply the addition of the accuracy losses of the individual layers, but shows an exponential drop in accordance with our theoretical result.

We now aim to see how well our bound captures this exponential behaviour. We take two networks pruned at layer 3 and an unpruned network and make a number of simplifying assumptions. First we assume that in Theorem 3.3 such that . This is logical as includes only log terms. Assuming that the bounds are tight we now aim to calculate

(17)

We can use the above to make predictions for the GE of the pruned network by noting that as we know that for the unpruned network and we have managed to avoid the cumbersome parameter. Next we make the assumption that . Dimensionality values are common for the MNIST dataset and result from a simple dimensionality analysis using PCA. We also deviate slightly from our theory by using the minimum layerwise error for each sparsity level, as well as the average scores . We plot the theoretical predictions for single layer pruning in Figure 4.a and the theoretical predictions for multilayer pruning in Figure 4.b. We see that, while loose, the theoretical predictions correctly capture qualitatively the behaviour of the GE. Specifically, layers, as predicted, are exponentially less robust with remaining layer depth. Also , as predicted, when pruning multiple layers the resulting GE is exponentially greater than the sum of the individual GEs.

5 Conclusion

In this paper we have presented an efficient pruning algorithm for fully connected layers of DNNs, based on difference of convex functions optimisation. Our algorithm is orders of magnitude faster than competing approaches while allowing for a controlled increase in the GE. We provided a theoretical analysis of the increase in GE resulting from bounded perturbations to the hidden layer weights, of which pruning is a special case. This analysis correctly predicts the previously observed phenomenon that network layers closer to the input are exponentially less robust to pruning compared to layers close to the output. Experiments on common feedforward architectures validated our results.

References