Artificial Neural Networks (ANNs) achieve state-of-the-art performance in several tasks at the price of complex topologies with millions of learnable parameters. As an example, ResNet He et al. (2016) includes tens of millions of parameters, soaring to hundreds of millions for VGG-Net Simonyan and Zisserman (2014). A large parameter count jeopardizes however the possibility to deploy a network over a memory-constrained (e.g., embedded, mobile) device, calling for leaner architectures with fewer parameters.
The complexity of ANNs can be reduced enforcing a sparse
network topology. Namely, some connections between neurons can bepruned
by wiring the corresponding parameters to zero. Besides the reduction of parameters, some works also suggested other benefits coming from pruning ANNs, like improving the performance in transfer learning scenariosLiu et al. (2017). Popular methods such as Han et al. (2015), for example, introduce a regularization term in the cost function with the goal to shrink to zero some parameters. Next, a threshold operator pinpoints the shrunk parameters to zero, eventually enforcing the sought sparse topology. However, such methods require that the topology to be pruned has been preliminarily pruned via standard gradient descent, which sums up to the total learning time.
This work contributes LOBSTER (LOss-Based SensiTivity rEgulaRization), a method for learning sparse neural topologies. In this context, let us define the sensitivity of the parameter of an ANN as the derivative of the loss function with respect to the parameter. Intuitively, low-sensitivity parameters have a negligible impact on the loss function when perturbed, and so are fit to be shrunk without compromising the network performance. Practically, LOBSTER shrinks to zero parameters with low sensitivity with a regularize-and-prune approach, achieving a sparse network topology. With respect to similar literature Han et al. (2016); Guo et al. (2016); Gomez et al. (2019), LOBSTER does not require a preliminary training stage to learn the dense reference topology to prune. Moreover, differently to other sensitivity-based approaches, LOBSTER computes the sensitivity exploiting the already available gradient of the loss function, avoiding additional derivative computations Mozer and Smolensky (1989); Tartaglione et al. (2018), or second-order derivatives LeCun et al. (1990). Our experiments, performed over different network topologies and datasets, show that LOBSTER outperforms several competitors in multiple tasks.
The rest of this paper is organized as follows. In Sec. 2 we review the relevant literature concerning sparse neural architectures. Next, in Sec. 3 we describe our method for training a neural network such that its topology is sparse. We provide a general overview on the technique in Sec. 4. Then, in Sec. 5 we experiment with our proposed training scheme over some deep ANNs on a number of different datasets. Finally, Sec. 6 draws the conclusions while providing further directions for future research.
2 Related Works
It is well known that many ANNs, trained on some tasks, are typically over-parametrized Mhaskar and Poggio (2016); Brutzkus et al. (2018).
There are many ways to reduce the size of an ANN.
In this work we focus on the so-called pruning problem: it consists in detecting and removing parameters from the ANN without excessively affecting its performance.
In a recent work Frankle and Carbin (2019), it has been observed that only a few parameters are actually updated during training: this suggests that all the others parameters can be removed from the learning process without affecting the performance.
Despite similar approaches were already taken years earlier Karnin (1990), their finding woke-up the research interest around such a topic.
Lots of efforts are devoted towards making pruning mechanisms more efficient: for example, Wang et al. show that some sparsity is achievable pruning weights at the very beginning of the training process Wang et al. (2020), or Lee et al., with their “SNIP”, are able to prune weights in a one-shot fashion Lee et al. (2019). However, these approaches achieve limited sparsity: iterative pruning-based strategy, when compared to one-shot or few-shot approaches, are able to achieve a higher sparsity Tartaglione et al. (2020). Despite the recent technology advances make this problem actual and relevant by the community towards the ANN architecture optimization, it deepens its roots in the past.
In Le Cun et al. LeCun et al. (1990), the information from the second order derivative of the error function is leveraged to rank the the parameters of the trained model on a saliency basis: this allows to select a trade-off between size of the network (in terms of number of parameters) and and performance. In the same years, Mozer and Smolensky proposed skeletonization, a technique to identify, on a trained model, the less relevant neurons, and to remove them Mozer and Smolensky (1989). This is accomplished evaluating the global effect of removing a given neuron, evaluated as error function penalty from a pre-trained model.
The recent technological advances let ANN models to be very large, and pose questions about the efficiency of pruning algorithms: the target of the technique is to achieve the highest sparsity (ie. the maximum percentage of removed parameters) having minimal performance loss (accuracy loss from the “un-pruned” model). Towards this end, a number of different approaches to pruning exists.
Dropout-based approaches constitute another possibility to achieve sparsity. For example, Sparse VD relies on variational dropout to promote sparsity Molchanov et al. (2017), providing also a Bayesian interpretation for Gaussian dropout. Another dropout-based approach is Targeted Dropout Gomez et al. (2019): there, fine-tuning the ANN model is self-reinforcing its sparsity by stochastically dropping connections (or entire units).
Some approaches to introduce sparsity in ANNs attempt to rely on the optimal regularizer which, however, is a non-differentiable measure. A recent work Louizos et al. (2017) proposes a differentiable proxy measure to overcome this problem introducing, though, some relevant computational overhead. Having a similar overall approach, in another work, a regularizer based on group lasso whose task is to cluster filters in convolutional layers is proposed Wen et al. (2016). However, such a technique is not directly generalizeable to the bulky fully-connected layers, where most of the complexity (as number of parameters) lies.
A sound approach towards pruning parameters consists in exploiting a regularizer in a shrink-and-prune framework. In particular, a standard regularization term is included in the minimized cost function (to penalize the magnitude of the parameters): all the parameters dropping below some threshold are pinpointed to zero, thus learning a sparser topology Han et al. (2015). Such approach is effective since regularization replaces unstable (ill-posed) problems with nearby and stable (well-posed) ones by introducing a prior on the parameters Groetsch (1993). However, as a drawback, this method requires a preliminary training to learn the threshold value; furthermore, all the parameters are blindly, equally-penalized by their norm: some parameters, which can introduce large error (if removed), might drop below the threshold because of the regularization term: this introduces sub-optimalities as well as instabilities in the pruning process. Guo et al. attempted to address this issue with their DNS Guo et al. (2016): they proposed an algorithmic procedure to corrects eventual over-pruning by enabling the recovery of severed connections. Moving to sparsification methods not based on pruning, Soft Weight Sharing (SWS) Ullrich et al. (2019) shares redundant parameters among layers, resulting in fewer parameters to be stored. Approaches based on knowledge distillation, like Few Samples Knowledge Distillation (FSKD) Li et al. (2020), are also an alternative to reduce the size of a model: it is possible to successfully train a small student network from a larger teacher, which has been directly trained on the task. Quantization can also be considered for pruning: Yang et al., for example, considered the problem of ternarizing and prune a pre-trained deep model Yang et al. (2020). Other recent approaches mainly focus on the pruning of convolutional layers either leveraging on the artificial bee colony optimization algorithm (dubbed as ABCPruner) Lin et al. (2020) or using a small set of input to evaluate a saliency score and construct a sampling distribution Liebenwein et al. (2020).
In another recent work Tartaglione et al. (2018), it was proposed to measure how much the network output changes for small perturbations of some parameters, and to iteratively penalize just those which generate little or no performance loss. However, such method requires the network to be already trained so to measure the variation of the network output when a parameter is perturbed, increasing the overall learning time.
In this work, we overcome the basic limitation of pre-training the network, introducing the concept of loss-based sensitivity: it only penalizes the parameters whose small perturbation introduces little or no performance loss at training time.
3 Proposed Regularization
ANNs are typically trained via gradient descent based optimization, i.e. minimizing the loss function . Methods based on mini-batches of samples have gained popularity as they allow better generalization than stochastic learning while they are memory and time efficient. In such a framework, a network parameter
is updated towards the averaged direction which minimizes the averaged loss for the minibatch, i.e. using the well known stochastic gradient descent or its variations. If the gradient magnitude is close to zero, then the parameter is not modified. Our ultimate goal is to assess to which extent a variation of the value ofwould affect the error on the network output . We make a first attempt towards this end introducing a small perturbation over and measuring the variation of as
Unfortunately, the evaluation of (1) is specific and restricted to the neighborhood of the network output. We would like to directly evaluate the error of the output of the ANN model over the learned data.
Towards this end, we estimate the error on the network output caused by the perturbation onas:
The use of (2) in place of (1) shifts the focus from the output to the error of the network. The latter is a more accurate information in order to evaluate the real effect of the perturbation of a given parameter . Let us define the sensitivity for a given parameter as
Large values indicate large variations of the loss function for small perturbations of .
Given the above sensitivity definition, we can promote sparse topologies by pruning parameters with both low sensitivity (i.e., in a flat region of the loss function gradient, where a small perturbation of the parameter has a negligible effect on the loss) and low magnitude, keeping unmodified those with large . Towards this end, we propose the following parameter update rule to promote sparsity:
is the one-step function and two positive hyper-parameters.
After some algebraic manipulations, we can rewrite (6) as
In (8), we observe two different components of the proposed regularization term:
a weight decay-like term which is enabled/disabled by the magnitude of the gradient on the parameter;
a correction term for the learning rate. In particular, the full learning process follows an equivalent learning rate
Let us analyze the corrections in the learning rate. If ( has large sensitivity), it follows that and and the dominant contribution comes from the gradient. In this case our update rule reduces to the classical GD:
When we consider less sensitive with , we get (weight decay term) and we can distinguish two sub-cases for the learning rate:
A schematics of all these cases can be found in Table 1 and the representation of the possible effects are shown in Fig. 1. The contribution coming from aims at minimizing the parameter magnitude, disregarding the loss minimization. If the loss minimization tends to minimize the magnitude as well, then the equivalent learning rate is reduced. On the contrary, when the gradient descent tends to increase the magnitude, the learning rate is increased, to compensate the contribution coming from . This mechanism allows us to succeed in the learning task while introducing sparsity.
In the next section we are going to detail the overall training strategy, which cascades a learning and a pruning stage.
4 Training Procedure
This section describes a procedure to train a sparse neural network leveraging the sensitivity-based rule above to update the network parameters. We assume that the parameters have been randomly initialized, albeit the procedure holds also if the network has been pre-trained. The procedure is illustrated in Fig. 1(a) and iterates over two stages as follows.
During the learning stage, the ANN is iteratively trained according to the update rule (4) on some training set. Let
indicate the current learning stage iteration (i.e., epoch) andrepresent the network (i.e., the set of learnable parameters) at the end of the -th iteration. Also let be the loss measured on some validation set at the end of the -th iteration and be the best (lowest) loss measured so far on (network with lowest validation loss so far). As initial condition, we assume, . If , the reference to the best network is updated as , . We iterate again the learning stage until the best validation loss has not decreased for iterations of the learning stage in a row (we say the regularizer has reached a performance plateau). At such point, we move to the pruning stage.
We provide as input for the pruning stage, where a number of parameters have been shrunk towards zero by our sensitivity-based regularizer.
In a nutshell, during the pruning stage parameters with magnitude below a threshold value are pinpointed to zero, eventually sparsifying the network topology as shown in Fig. 1(b). Namely, we look for the largest that worsens the classification loss at most by a relative quantity :
where is called loss boundary. is found using the bisection method, initializing with the average magnitude of the non-null parameters in the network. Then, we apply the threshold to obtaining the pruned network with its loss on the validation set. At the next pruning iteration, we update as follows:
if the network tolerates that more parameters be pruned, so is increased;
if then too many parameters have been pruned and we need to restore some: we decrease .
The pruning stage ends when and we observe that for any new threshold . Once is found, all the parameters whose magnitude is below are pinpointed to zero, i.e. they are pruned for good. If at leas one parameter has been pruned during the last iteration of the pruning stage, a new iteration of the regularization stage follows; otherwise, the procedure ends returning the trained, sparse network.
In this section we experimentally evaluate LOBSTER over multiple architectures and datasets commonly used as benchmark in the literature:
We compare with other state-of-the-art approaches introduced in Sec. 2 wherever numbers are publicly available. Besides these, we also perform an ablation study with a -based regularizer and our proposed pruning strategy (as discussed in Sec. 4
). Performance is measured as the achieved model sparsity versus classification error (Top-1 or Top-5 error). The network sparsity is defined here ad the percentage of pruned parameters in the ANN model. Our algorithms are implemented in Python, using PyTorch 1.2 and simulations are run over an RTX2080 TI NVIDIA GPU. All the hyper-parameters have been tuned via grid-search. The validation set size for all the experiments is 5k large.111The source code is provided in the supplementary materials and will be made publicly available upon acceptance of the article. For all datasets, the learning and pruning stages take place on a random split of the training set, whereas the numbers reported below are related to the test set.
LeNet-300 on MNIST
As a first experiment, we train a sparse LeNet-300 LeCun et al. (1998) architecture, which consists of three fully-connected layers with 300, 100 and 10 neurons respectively. We trained the network on the MNIST dataset, made of 60k training images and 10k test gray-scale 2828 pixels large images, depicting handwritten digits. Starting from a randomly initialized network, we trained LeNet-300 via SGD with learning rate , , epochs and .
The related literature reports several compression results that can be clustered in two groups corresponding to classification error rates of about and , respectively. Fig. 2(a) provides results for the proposed procedure. Our method reaches higher sparsity than the the approaches found in literature. This is particularly noticeable around classification error (low left in Fig. 2(a)), where we achieve almost twice the sparsity of the second best method. LOBSTER also achieves the highest sparsity for the higher error range (right side of the graph), gaining especially in regards to the number of parameters removed from the first fully-connected layer (the largest, consisting of 235k parameters), in which we observe just the of the parameters survives.
|Top-1 (%)||Sparsity (%)||FLOPs||Top-1 (%)||Sparsity (%)||FLOPs|
LeNet-5 on MNIST and Fashion-MNIST
Next, we experiment on the caffe version of the LeNet-5 architecture, consisting in two convolutional and two fully-connected layers. Again, we use a randomly-initialized network, trained via SGD with learning rate, , epochs and . The results are shown in Fig. 2(b). Even with a convolutional architecture, we obtain a competitively small network with a sparsity of 99.57%. At higher compression rates, Sparse VD slightly outperforms all other methods in the LeNet5-MNIST experiment. We observe that LOBSTER, in this experiment, sparsifies the first convolutional layer ( sparsity) more than Sparse VD solution (). In particular, LOBSTER prunes filters out of the original in the first layer (or in other words, just filters survive, and contain all the un-pruned parameters). We hypothesize that, in the case of Sparse VD and for this particular dataset, extracting a larger variety of features at the first convolutional layer, both eases the classification task (hence the lower Top-1 error) and allows to drop more parameters in the next layers (a slightly improved sparsity). However, since we are above of sparsity, the difference between the two techniques is minimal.
To scale-up the difficulty of the training task, we experimented on the classification of the Fashion-MNIST dataset Xiao et al. (2017)
, using again LeNet5. This dataset has the same size and image format of the MNIST dataset, yet it contains images of clothing items, resulting in a non-sparse distribution of the pixel intensity value. Since the images are not as sparse, such dataset is notoriously harder to classify than MNIST. For this experiment, we trained the network from scratch using SGD with, , epochs and . The results are shown in Fig. 2(c).
F-MNIST is an inherently more challenging dataset than MNIST, so the achievable sparsity is lower. Nevertheless, the proposed method still reaches higher sparsity than other approaches, removing an higher percentage of parameters, especially in the fully connected layers, while maintaining good generalization. In this case, we observe that the first layer is the least sparsified: this is an effect of the higher complexity of the classification task, which requires more features to be extracted.
ResNet-32 on CIFAR-10
To evaluate how our method scales to deeper, modern architectures, we applied it on a PyTorch implementation of the ResNet-32 network He et al. (2015) that classifies the CIFAR-10 dataset.222https://github.com/akamaster/pytorch˙resnet˙cifar10 This dataset consists of 60k 3232 RGB images divided in 10 classes (50k training images and 10k test images). We trained the network using SGD with momentum , , and . The full training is performed for 11k epochs.Our method performs well on this task and outperforms other state-of-the-art techniques. Furthermore, LOBSTER improves the network generalization ability reducing the baseline Top-1 error from to of the sparsified network while removing of the parameters. This effect is most likely due to the LOBSTER technique itself, which self-tunes the regularization on the parameters as explained in Sec. 3.
ResNet on ImageNet
Finally, we further scale-up both the output and the complexity of the classification problem testing the proposed method on network over the well-known ImageNet dataset (ILSVRC-2012), composed of more than 1.2 million train images, for a total of 1k classes. For this test we used SGD with momentum , and . The full training lasts 95 epochs. Due to time constraints, we decided to use the pre-trained network offered by the torchvision library.333https://pytorch.org/docs/stable/torchvision/models.html Fig. 2(e) shows the results for ResNet-18 while Fig.2(f) shows the results for ResNet-101. Even in this scenario, LOBSTER proves to be particularly efficient: we are able to remove, with no performance loss, of the parameters from ResNet-18 and from ResNet-101.
As a final ablation study, we replace our sensitivity-based regularizer with a simpler regularizer in our leraning scheme in Fig. 2.
Such scheme “+pruning” uniformly applies an penalty to all the parameters regardless their contribution to the loss.
This scheme is comparable with Han et al. (2015), yet enhanced with the same pruning strategy with adaptive thresholding shown in Fig. 1(b).
A comparison between LOBSTER and +pruning is reported in Table 2.
In all the experiments we observe that dropping the sensitivity based regularizer impairs the performance. This experiment verifies the role of the sensitivity-based regularization in the performance of our scheme. Finally, Table 2 also reports the corresponding inference complexity in FLOPs. For the same or lower Top-1 error LOBSTER yelds benefits as fewer operations at inference time and suggesting the presence of some structure in the sparsity achieved by LOBSTER.
We presented LOBSTER, a regularization method suitable to train neural networks with a sparse topology without a preliminary training.
Differently from regularization, LOBSTER is aware of the global contribution of the parameter on the loss function and self-tunes the regularization effect on the parameter depending on factors like the ANN architecture or the training problem itself (in other words, the dataset). Moreover, tuning its hyper-parameters is easy and the optimal threshold for parameter pruning is self-determined by the proposed approach employing a validation set.
LOBSTER achieves competitive results from shallow architectures like LeNet-300 and LeNet-5 to deeper topologies like ResNet over ImageNet.
In these scenarios we have observed the boost provided by the proposed regularization approach towards less-unaware approaches like regularization, in terms of achieved sparsity.
Future research includes the extension of LOBSTER to achieve sparsity with a structure and a thorough evaluation of the savings in terms of memory footprint.
- SGD learns over-parameterized networks that provably generalize on linearly separable data. Cited by: §2.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. Cited by: §2.
- Learning sparse networks using targeted dropout. CoRR abs/1905.13678. External Links: Cited by: §1, §2.
- Inverse problems in the mathematical sciences. Vieweg. Cited by: §2.
- Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387. Cited by: §1, §2.
- Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. Cited by: §1.
- Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §1, §2, §5.
- Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Cited by: §5.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1.
- A simple procedure for pruning back-propagation trained neural networks. IEEE transactions on neural networks 1 (2), pp. 239–242. Cited by: §2.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278 – 2324. Cited by: §5.
- Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1, §2.
- SNIP: single-shot network pruning based on connection sensitivity. ArXiv abs/1810.02340. Cited by: §2.
- Few sample knowledge distillation for efficient network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14639–14647. Cited by: §2.
- Provable filter pruning for efficient neural networks. In International Conference on Learning Representations, Cited by: §2.
Channel pruning via automatic structure search.
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 673–679. Note: Main track External Links: Cited by: §2.
Sparse deep transfer learning for convolutional neural network. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 2245–2251. Cited by: §1.
- Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §2.
- Deep vs. shallow networks: an approximation theory perspective. Analysis and Applications 14 (06), pp. 829–848. Cited by: §2.
- Variational dropout sparsifies deep neural networks. Vol. 5, pp. 3854–3863. Cited by: §2.
- Skeletonization: a technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pp. 107–115. Cited by: §1, §2.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
- Pruning artificial neural networks: a way to find well-generalizing, high-entropy sharp minima. arXiv preprint arXiv:2004.14765. Cited by: §2.
- Learning sparse neural networks via sensitivity-driven regularization. In Advances in Neural Information Processing Systems, pp. 3878–3888. Cited by: §1, §2.
- Soft weight-sharing for neural network compression. Cited by: §2.
- Pruning from scratch.. In AAAI, pp. 12273–12280. Cited by: §2.
- Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082. Cited by: §2.
Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR abs/1708.07747. External Links: Cited by: §5.
- Harmonious coexistence of structured weight pruning and ternarization for deep neural networks.. In AAAI, pp. 6623–6630. Cited by: §2.