1 Introduction
Recent successful advances of Deep Neural Networks are commonly attributed to their high architectural complexity and excessive size (overparametrization) (Denton et al., 2014; Neyshabur et al., 2019; Arora et al., 2018). Modern stateoftheart architectures exhibit enormous parameter overhead, requiring prohibitive amounts of resources during both training and inference and leaving a significant environmental footprint (Shoeybi et al., 2019). In response to these challenges, much attention has turned to compression of neural networks and, in particular, parameter pruning. While initial approaches mostly focused on pruning models after training (LeCun et al., 1990; Hassibi et al., 1993), contemporary algorithms optimize the sparsity structure of a network while training its parameters (Mocanu et al., 2018; Evci et al., 2020) or even remove connections before any training whatsoever (Lee et al., 2019; Wang et al., 2020).
Compression rates initially considered in the pruning literature usually fall between and of the size of the original model. However, as contemporary model sizes grow into the billions of parameters, studying higher compression regimes becomes increasingly important. Recently, a new bold sparsity benchmark was set by Tanaka et al. (2020) with Iterative Synaptic Flow (SynFlow), a dataagnostic algorithm for pruning at initialization. Reportedly, it is capable of removing all but only a few hundreds of parameters (a compression for VGG16) and still produce trainable subnetworks, while other pruning methods disconnect networks at much lower sparsity levels (Tanaka et al., 2020). Related work by de Jorge et al. (2021) proposes an iterative version of oneshot pruning algorithm, Singleshot Network Pruning (SNIP) (Lee et al., 2019), and evaluates it in a similar high sparsity regime, reaching more than compression ratio.
Effective sparsity.
This increased focus on extreme sparsity leads us to consider what sparsity is meant to represent in neural networks and computational graphs at large. In the context of neural network pruning, sparsity to date is computed straightforwardly as the fraction of removed connections (direct sparsity)—and compression as the inverse fraction of unpruned connections (direct compression). We observe that this definition does not distinguish between connections that have actually been pruned, and those that have become effectively pruned because they have disconnected from the computational flow. In this work, we propose to instead focus on effective sparsity—the fraction of inactivated connections, be it through direct pruning or through otherwise disconnecting from either input or output of a network (see Figure 1 for an illustration).
dashed connections incident to inactivated neurons (yielding twice as large effective compression
).In this work, we advocate that effective sparsity (effective compression) be used universally in place of its direct counterpart since it more accurately depicts what one would reasonably consider the network’s sparsity state. Using the lens of effective compression for benchmarking allows for a fairer comparison between different unstructured pruning algorithms. Note that effective compression is lower bounded by direct compression, which means that some pruning algorithms will give improved sparsityaccuracy tradeoffs in this new framework. In Section 3, we critically reexamine a plethora of recent pruning algorithms for a variety of architectures to find that, in this refined framework, conclusions drawn in previous works appear overstated or incorrect. Figure 2 gives a sneakpreview of this effect for three abinitio pruning algorithms: SynFlow (Tanaka et al., 2020), SNIP (Lee et al., 2019)
and plain random pruning for LeNet300100 on MNIST. While SynFlow appears superior to other methods when evaluated against direct compression, it loses its advantage in the effective framework. Such radical performance changes are partly explained by differing gaps between effective and direct compression inherent to different pruning algorithms (Figure
2). We can see that significant departure between direct and effective compression kicks in at relatively low rates below , making our work relevant even in these moderate regimes. For example, using random pruning to compress LeNet300100 by (sparsity ) results in effective compression; yet, removing the same number of parameters with SynFlow yields an unchanged effective compression. What makes certain iterative algorithms like SynFlow less likely to amass disconnected edges? In Section 3, we show that they are fortuitously designed to achieve a close convergence of direct and effective sparsity, hinting that preserving connectivity is an important aspect in the strong performance of highcompression pruning algorithms (Tanaka et al., 2020; de Jorge et al., 2021). Moreover, the lens of effective compression gives access to more extreme compression regimes for some pruning algorithms, which appear to disconnect much earlier when not accounting for inactive connections. For these high effective compression ratios all three pruning methods from Figure 2 perform surprisingly similar, even though they use varying degrees of information on data and parameter values.Layerwise Sparsity Quotas (LSQ) and Ideal Gas Quotas (IGQ).
A recent thread of research by Frankle et al. (2021) and Su et al. (2020) shows that performance of trained subnetworks produced by algorithms for pruning at initialization is robust to randomly reshuffling unpruned edges within layers before training. This observation led to the conjecture that these algorithms essentially generate successful distributions of sparsity across layers, while the exact connectivity patterns are unimportant. In Section 4, we reexamine this conjecture through the lens of effective sparsity, confirm it for moderate compression regimes (–) studied by Frankle et al. (2021) and Su et al. (2020), but find the truth to be more nuanced at higher compression rates. Nonetheless, this result highlights the importance of algorithms that carefully engineer layerwise sparsity quotas (LSQ) to obtain very simple and adequately performing pruning algorithms that are data and parameteragnostic. Another important motivation to search for good LSQ is that global pruning algorithms frequently remove entire layers prematurely (Lee et al., 2020) (cf. layercollapse (Tanaka et al., 2020)), even before any significant differences between direct and effective sparsity emerge. Wellengineered LSQ could avoid this and enforce proper redistribution of compression across layers (see (Gale et al., 2019; Mocanu et al., 2018; Evci et al., 2020) for existing baselines). In Section 4, we propose a novel LSQ coined Ideal Gas Quotas (IGQ) by drawing intuitive analogies from physics. Effortlessly computable for any networksparsity combination, IGQ performs similarly or better than any other baseline in the context of random pruning at initialization and of magnitude pruning after training.
Effective pruning.
Pruning to any desired direct sparsity is straightforward: one simply needs to mask out the corresponding number of parameters from a network. Effective sparsity, unfortunately, is more unpredictable and difficult to control. In particular, several known pruning algorithms suffer from layercollapse once reaching a certain sparsity level, leading to unstable effective sparsity just before the disconnection. As a result, most pruning methods are unable to deliver certain values of effective sparsity regardless of how many connections are pruned. When possible, however, one needs to carefully tune the number of pruned parameters so that effective sparsity lands near a desired value. In Section 5, we suggest a simple extension to algorithms for pruning at initialization or after training that helps bring effective sparsity close to any predefined achievable value while incurring costs that are at most logarithmic in model size.
2 Related work
Neural network compression encompasses a number of orthogonal approaches such as parameter regularization (Lebedev and Lempitsky, 2016; Louizos et al., 2018), variational dropout (Molchanov et al., 2017)
, vector quantization and parameter sharing
(Gong et al., 2014; Chen et al., 2015; Han et al., 2016), lowrank matrix decomposition (Denton et al., 2014; Jaderberg et al., 2014), and knowledge distillation (Buciluǎ et al., 2006; Hinton et al., 2015). Network pruning, however, is by far the most common technique for model compression, and can be partitioned into structured (at the level of entire neurons/units) and unstructured (at the level of individual connections). While the former offers resource efficiency unconditioned on use of specialized hardware (Liu et al., 2019) and constitutes a fruitful research area (Li et al., 2017; Liu et al., 2017), we focus on the more actively studied unstructured pruning, which is where differences between effective and direct sparsity emerge. In what follows we give a quick overview, naturally grouping pruning methods by the time they are applied relative to training (see (Frankle and Carbin, 2019) and (Wang et al., 2020) for a similar taxonomy).Pruning after training.
These earliest pruning techniques were designed to remove the least “salient” learned connections without sacrificing predictive performance. Optimal Brain Damage (LeCun et al., 1990) and its sequel Optimal Brain Surgeon (Hassibi et al., 1993)
use the Hessian of the loss to estimate sensitivity to removal of individual parameters.
Han et al. (2015) popularized magnitude as a simple and effective pruning criterion. It proved to be especially successful when applied alternately with several finetuning cycles, which is commonly referred to as Iterative Magnitude Pruning (IMP), a modification of which was used by Frankle and Carbin (2019) to discover lottery tickets. Later, Dong et al. (2017) showed that magnitudebased pruning minimizes distortion of each layer’s output incurred by parameter removal. Recently, Lee et al. (2021) extend this idea and propose LayerAdaptive MagnitudeBased Pruning (LAMP), which approximately minimizes the upper bound of the distortion of the entire network. While equivalent to magnitude pruning within individual layers, LAMP automatically discovers stateoftheart layerwise sparsity quotas (see Section 4) that yield better performance (as a function of direct compression) than existing alternatives in the context of IMP.Pruning during training.
Algorithms in this category learn sparsity structures together with parameter values, hoping that continued training will correct for damage incurred by pruning. To avoid inefficient pruneretrain cycles inherent to IMP, Narang et al. (2017) introduce gradual magnitude pruning over a single training round. Subsequently, Zhu and Gupta (2018) modify this algorithm by introducing a simpler pruning schedule and keeping layerwise sparsities uniform throughout training. Sparse Evolutionary Training (SET) (Mocanu et al., 2018) starts with an already sparse subnetwork and restructures it during training by pruning and randomly reviving connections. Unlike SET, Mostafa and Wang (2019) allow redistribution of sparsity across layers, while Dettmers and Zettlemoyer (2019) use gradient momentum as the criterion for parameter regrowth. Evci et al. (2020) rely on the instantaneous gradient to revive weights but follow SET to maintain the initial layerwise sparsity distribution during training.
Pruning before training.
Pruning at initialization is especially alluring to deep learning practitioners as it promises lower costs of both optimization and inference. While this may seem too ambitious, the Lottery Ticket Hypothesis (LTH) postulates that randomly initialized dense networks do indeed contain highly trainable and equally wellperforming sparse subnetworks
(Frankle and Carbin, 2019). Inspired by the LTH, Lee et al. (2019) design SNIP, which uses connection sensitivity as a parameter saliency score. Wang et al. (2020) notice that SNIP creates bottlenecks or even removes entire layers and propose Gradient Signal Preservation (GraSP) as an alternative that aims to maximize gradient flow in a pruned network. de Jorge et al. (2021) improve SNIP by applying it iteratively, allowing for reassessment of saliency scores during pruning and helping networks stay connected at higher compression rates. A truly new compression benchmark was set by Tanaka et al. (2020); their algorithm, SynFlow, iteratively prunes subsets of parameters according to theirpath norm and helps networks reach maximum compression without disconnecting. For example, SynFlow achieves nonrandom test accuracy on CIFAR10 with a
compressed VGG16, while SNIP and GraSP fail already at and , respectively. An extensive ablation study by Frankle et al. (2021) examines SNIP, GraSP and SynFlow within moderate compression rates (up to ) and reveals that performance of subnetworks produced by these methods is stable under layerwise rearrange prior to training. Later, this result was independently confirmed by Su et al. (2020) for SNIP and GraSP only. This observation suggests that these algorithms perform as well as random pruning with corresponding layerwise quotas, putting the spotlight on designing competitive LSQ (Mocanu et al., 2018; Gale et al., 2019; Lee et al., 2021).3 Effective sparsity
In this section, we present our comparisons of a variety of pruning algorithms under the lens of effective compression. To illustrate the striking difference between direct and effective sparsity and expose the often radical change in relative performance of pruning algorithms when switching from the former to the latter, we evaluate several recent methods (SNIP, GraSP, SynFlow, LAMP^{1}^{1}1as a stateoftheart representative of magnitude pruning after training and, in particular, lottery tickets (Frankle and Carbin, 2019).
, and SNIPiterative) and random pruning with uniform sparsity distribution across layers in both frameworks. Our experiments encompass modern architectures on commonly used computer vision benchmark datasets: LeNet300100
(Lecun et al., 1998) on MNIST, LeNet5 (Lecun et al., 1998) on CIFAR10, VGG19 (Simonyan and Zisserman, 2015) on CIFAR100 , and ResNet18 (He et al., 2016) on TinyImageNet. We place results of VGG16 (Simonyan and Zisserman, 2015) on CIFAR10 in Appendix B, as they closely resemble those of VGG19. Further experimental details are presented in Appendix A. Our code is made available at github.com/avysogorets/effectivesparsity.Notation. Consider an layer neural network
with weight tensors
for . A subnetwork is specified by a set of binary masks that indicate unpruned parameters . With , it is given by whereindicates pointwise multiplication. Note that biases and batch normalization parameters
(Ioffe and Szegedy, 2015) are normally considered unprunable. Direct sparsity, the fraction of pruned weights, is given by and direct compression rate is defined as .Figure 3 reveals that different algorithms tend to develop varying amounts of inactive connections. For example, effective compression of subnetworks pruned by LAMP consistently reaches of their direct compression across all architectures, at which point at least nine in ten unpruned connections are effectively inactivated. Other methods (e.g., SNIP on VGG19) remove entire layers early on, before any substantial differences between effective and direct compression emerge. SNIPiterative and especially SynFlow, however, demonstrate a truly unique property: subnetworks pruned by these two algorithms exhibit practically equal effective and direct compressions, and, in the case of SynFlow, disconnect only at very high compression rates. What makes them special? Both SynFlow and SNIPiterative are multishot pruning algorithms that remove parameters over and iterations, respectively. SynFlow ranks connections by their path norm (sum of weighted paths passing through the edge, where the weight of a path is the product of magnitudes of weights of its edges). SNIP uses connection sensitivity scores from Lee et al. (2019) as a saliency measure, where
is the loss function. Both these pruning criteria assign the lowest possible score of zero to inactive connections, scheduling them for immediate removal in the subsequent pruning iteration. Thus, by virtue of their iterative design, these two methods produce subnetworks with little to no difference between effective and direct compression. They are fortuitously designed to prune inactivated edges, which might explain their highcompression performance.
Tanaka et al. (2020) compare SynFlow to SNIP and GraSP using direct sparsity, claiming it vastly superior in high compression regimes. However, pruning methods that generate large amounts of inactivated connections are clearly at a significant disadvantage in the original direct framework. Figure 4 shows that the performance gap between SynFlow and other methods shrinks on all tested architectures under effective compression. The most dramatic changes are perhaps evident with LeNet300100 where SynFlow significantly dominates both SNIP and GraSP in direct comparison, but becomes strictly inferior when taken to the more telling effective compression. On the other hand, differences are not as pronounced on purely convolutional architectures such as VGG19, and ResNet18. Feature maps in convolutional layers are connected via groups of several parameters (kernels), making them more robust to inactivation compared to neurons in fullyconnected layers.
Computing effective sparsity: In advocating the use of effective sparsity, we must make sure that it can be calculated efficiently. We propose an easily computable approach leveraging SynFlow. Note that a connection is inactive if and only if it is not part of any path from input to output. Assuming that unpruned weights are nonzero, this is equivalent to having zero path norm. Tanaka et al. (2020) observe that path norms can be efficiently computed with one pass on the allones input as , where and is the linearized version of the original network . For deep architectures, rescaling of weights might be required to avoid numerical instability (Tanaka et al., 2020).
4 Layerwise sparsity quotas (LSQ) and a novel allocation method (IGQ)
Inspired by Frankle et al. (2021) and Su et al. (2020), we wish to confirm that SNIP, GraSP, and SynFlow work no better than random pruning with corresponding layerwise sparsity allocation. While Frankle et al. (2021) and Su et al. (2020) only considered moderate compression rates up to and used direct sparsity as a reference frame, we reconfirm their conjecture in the effective framework and test it across the entire compression spectrum. We generate and train two sets of subnetworks: pruned by either SNIP, GraSP, and SynFlow (original), and randomly pruned while preserving layerwise sparsity quotas provided by each of these three methods (random).
Our results in Figure 5 agree with observations made by Frankle et al. (2021) and Su et al. (2020): in the – compression range, all three random pruning algorithms perform similarly (LeNet300100, LeNet5, VGG19) or better (ResNet18) than their original counterparts. Effective sparsity allows us to faithfully examine higher compression, where the evidence is more equivocal. Similar patterns are still seen on ResNet18; however, the original SNIP and GraSP beat random pruning with corresponding layerwise sparsities by a wide margin starting at about compression on LeNet300100. Random pruning associated with SynFlow matches original SynFlow on the same network for longer, up to compression. On VGG19, SynFlow bests the corresponding random pruning from about compression onward, while the original SNIP suffers from disconnection early on together with its random variant. Despite these nuances in the high compression regime, random pruning with specific layerwise sparsity quotas fares extremely well in the moderate sparsity regime (up to ) and is even competitive to fullfledged SynFlow (see Figure 7). Therefore, random pruning can be a cheap and competitive alternative to more sophisticated and resourceconsuming algorithms. Random methods from Figure 5, however, still require running SNIP, GraSP, or SynFlow to identify appropriate sparsity quotas and thus are just as expensive. Furthermore, sparsity distributions inherited from global pruning methods frequently suffer from premature removal of entire layers (e.g., SNIP on VGG19), which is undesired. Can we engineer readily computable and consistently wellperforming sparsity quotas?
To our knowledge, there are only a few abinitio approaches in the literature to allocate sparsity in a principled fashion. Uniform is the simplest solution that keeps sparsity constant across all layers. Gale et al. (2019) give a modification (denoted Uniform+ following Lee et al. (2021)) that retains all parameters in the first convolutional layer and caps sparsity of the last fullyconnected layer at . A more sophisticated approach, ErdösRenyiKernel (ERK), sets the density of a convolutional layer with kernel size , fanin and fanout proportional to (Mocanu et al., 2018; Evci et al., 2020). The last two approaches are unable to support the entire range of sparsities: Uniform+ can only achieve moderate direct compression because of the prunability constraints on its first and last layer, while both direct and effective sparsity levels achievable with ERK are often lower bounded. For example, the density of certain layers of VGG16 set by ERK exceeds when cutting less than of parameters, unless excessive density is redistributed. We suggest a formal definition for layerwise sparsity quotas to guide future research into sparsity allocation and avoid problems that riddle Uniform+ and ERK.
Definition 1 (Layerwise Sparsity Quotas).
A function mapping a target sparsity to layerwise sparsities is called Layerwise Sparsity Quotas (LSQ) if it satisfies the following properties: (i) total sparsity: for any , , and (ii) monotonicity: for any whenever .
We now present Ideal Gas Quotas (IGQ), our formula for sparsity allocation that satisfies Definition 1 and outperforms the above mentioned baselines, while faring very well (over effective sparsity) compared to the allocation quotas derived from sophisticated pruning methods such as SynFlow. To develop an intuition on what constitutes a good LSQ construction, we study the layerwise sparsities induced by contemporary pruning algorithms such as LAMP and SynFlow (Figure 6). As a rule, they prune larger layers more aggressively than smaller layers, i.e., whenever for any , and avoid premature removal of entire layers, i.e., if and only if .
Our approach is to interpret compression of layers within a network as compression of stacked cylinders of unit volume filled with gas, where the height of the cylinder is proportional to the number of parameters in that layer. We then use the Ideal Gas Law to derive the compression of each of the stacked coupled cylinders. More formally, model each layer as a cylinder of height and crosssection area . Further, assume that these stacked weightless cylinders with frictionless pistons and filled with the same amount of ideal gas are in thermodynamic equilibrium with common pressure and temperature . Isothermal compression of this system using an external force is governed by the Ideal Gas Law: , where , , and is its new compressed height. Then, or, equivalently, . Interpreting as compression ratio of layer , we arrive at compression quotas (or sparsity quotas ) parameterized by the force controlling the overall sparsity of the network. Given a target sparsity , the needed value of can simply be found with binary search to any desired precision. Our IGQ clearly satisfies all conditions of Definition 1 and the other properties identified above. It is surprising how closely the sparsity quotas achieved by IGQ resemble those of SynFlow considering that they describe a physical process (see Figure 6).
We now evaluate IGQ for random pruning, comparing it against ERK, Uniform, Uniform+, as well as random pruning with the sparsity quotas induced by SynFlow for reference (Figure 7). Across all architectures, random pruning with IGQ and SynFlow sparsity quotas are almost indistinguishable from each other, suggesting that IGQ successfully mimics the quotas produced by SynFlow. While ERK sometimes exhibits similar (ResNet18) or even better (VGG19 compressed to or higher) performance than IGQ, it yields invalid layerwise sparsity quotas when removing less than and of parameters from ResNet18 and VGG19, respectively. In the moderate sparsity regime (up to ), subnetworks pruned by IGQ reach unparalleled performance after training.
In Appendix C, we test IGQ in the context of magnitude pruning after training. Here, performance of IGQ practically coincides with that of LAMP, making it the only known LSQ to consistently perform best and a competitive method for layerwise sparsity allocation.
5 Effective pruning
Unlike pruning to a target direct sparsity, pruning to achieve a particular effective sparsity can be nontrivial. Here, we present an extension to algorithms for pruning at initialization or after training that achieves this goal efficiently, when possible (see Figure 8).
Rankingbased pruning.
Algorithms like GraSP, SynFlow, and LAMP rank parameters by some notion of importance to guide pruning. When such a ranking is available, the naive solution is to iterate through all scores in order, considering each as a potential pruning threshold and recording effective sparsity of the corresponding subnetwork with parameters removed. While provably identifying the optimal threshold that yields a subnetwork with effective sparsity as close to the desired value as possible, this approach requires pruneevaluate cycles, which is unreasonable for most contemporary architectures. To achieve an efficient overhead of at the price of minor inaccuracy, we utilize binary search for the cutoff threshold instead, leveraging the following monotonicity property: given two pruning thresholds and corresponding subnetworks , we have if and only if , which implies (note that in general does not imply the last inequality above). Thus, binary search will branch in the correct direction.
Random pruning.
Another class of pruning algorithms designs layerwise sparsities and then prunes each layer randomly (see Section 4). Binary search will not work directly, because here does not imply , and random pruning is unlikely to produce a neat chain of embedded subnetworks as before. To circumvent this issue, we design an improved algorithm that produces embedded subnetworks on each iteration, allowing binary search to work (Algorithm 1). Starting from the extreme subnetworks (fullydense) and (fullysparse), we narrow the sparsity gap between them while preserving . For each layer, we keep track of unpruned connections of and pruned connections of , randomly sample parameters from according to and form another network by pruning out from (or, equivalently, reviving in ). Depending on where effective sparsity of lands relative to target , we update either or and branch. Since connections to be pruned from (or revived in
) are chosen randomly at each step, weights within the same layer have equal probability of being pruned. Once
and are only parameter away from each other, the algorithm returns , yielding a connected model.6 Discussion
In our work, we argue that effective sparsity (effective compression) is the correct benchmarking measure for pruning algorithms since it discards effectively inactive connections and represents the true remaining connectivity pattern. Moreover, effective sparsity allows us to study extreme compression regimes for subnetworks that otherwise appear disconnected at much lower direct sparsities. We initiate the study of current pruning algorithms in this refined frame of reference and rectify previous benchmarks. To facilitate the use of effective sparsity in future research, we describe lowcost procedures to both compute and achieve desired effective sparsity when pruning. Lastly, with effective sparsity allowing us to fairly zoom into higher compression regimes than previously possible, we examine random pruning with prescribed layerwise sparsities and propose our own readily computable quotas (IGQ) after establishing conditions reasonable LSQ should fulfill. We show that IGQ, while allowing for any level of sparsity, is more advantageous than all existing similar baselines (Uniform, ERK) and gives comparable performance to sparsity quotas derived from more sophisticated and computationally expensive algorithms like SynFlow.
Limitations and Broader Impacts: We hope that the lens of effective compression will spur more research in high compression regimes. One possible limitation is that it is harder to control effective compression exactly. In particular using different seeds might lead to slightly different effective compression rates. However, these perturbations are minor. Another small limitation is that our effective pruning strategies are not immediately applicable to some algorithms that prune while training (e.g., RigL (Evci et al., 2020)). However, in most cases our approach can be adapted. Lastly, one might argue that for some architectures accuracy drops precipitously with higher compression thus making very sparse subnetworks less practical. We hope that opening the study of high compressions will allow to explore how to use sparse networks as building blocks, for instance using the power of ensembling. Our framework allows a principled study of this regime.
Both authors were supported by the National Science Foundation under NSF Award 1922658. Neither of the authors has any competing interests to report.
References

On the optimization of deep networks: implicit acceleration by overparameterization.
In
35th International Conference on Machine Learning, ICML 2018
, A. Krause and J. Dy (Eds.), 35th International Conference on Machine Learning, ICML 2018, pp. 372–389 (English (US)). Cited by: §1.  Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, New York, NY, USA, pp. 535–541. External Links: ISBN 1595933395, Link, Document Cited by: §2.
 Compressing neural networks with the hashing trick. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2285–2294. External Links: Link Cited by: §2.
 Progressive skeletonization: trimming more fat from a network at initialization. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2.
 Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. External Links: Link Cited by: §1, §2.
 Sparse networks from scratch: faster training without losing performance. CoRR abs/1907.04840. External Links: Link, 1907.04840 Cited by: §2.
 Learning to prune deep neural networks via layerwise optimal brain surgeon. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. External Links: Link Cited by: §2.
 Rigging the lottery: making all tickets winners. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 2943–2952. External Links: Link Cited by: §1, §1, §2, §4, §6.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: §2, §2, §2, footnote 1.
 Pruning neural networks at initialization: why are we missing the mark?. In International Conference on Learning Representations, External Links: Link Cited by: Table 1, Appendix A, §1, §2, §4, §4.
 The state of sparsity in deep neural networks. arXiv eprints arXiv:1902.09574. External Links: Link Cited by: §1, §2, §4.
 Compressing deep convolutional networks using vector quantization. CoRR abs/1412.6115. External Links: Link, 1412.6115 Cited by: §2.
 Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2.
 Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. External Links: Link Cited by: §2.
 Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, Vol. , pp. 293–299 vol.1. External Links: Document Cited by: §1, §2.

Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document Cited by: Appendix A, §3.  Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, External Links: Link Cited by: §2.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. External Links: Link Cited by: §3.

Speeding up convolutional neural networks with low rank expansions
. In Proceedings of the British Machine Vision Conference, External Links: Document Cited by: §2.  Fast convnets using groupwise brain damage. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2554–2564. External Links: Document Cited by: §2.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document Cited by: Appendix A, §3.
 Optimal brain damage. In Advances in Neural Information Processing Systems, pp. 598–605. Cited by: §1, §2.
 Layeradaptive sparsity for the magnitudebased pruning. In International Conference on Learning Representations, External Links: Link Cited by: §2, §2, §4.
 A signal propagation perspective for pruning neural networks at initialization. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 Snip: singleshot network pruning based on connection sensitivity. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: Table 1, §1, §1, §1, §2, §3.
 Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.
 Learning efficient convolutional networks through network slimming. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2755–2763. External Links: Document Cited by: §2.
 Rethinking the value of network pruning. In ICLR, Cited by: §2.
 Learning sparse neural networks through regularization. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications 9 (1), pp. 2383. External Links: Document, ISBN 20411723, Link Cited by: §1, §1, §2, §2, §4.
 Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 2498–2507. External Links: Link Cited by: §2.
 Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the 36th International Conference on Machine Learning, pp. 4646–4655. Cited by: §2.

Exploring sparsity in recurrent neural networks
. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.  The role of overparametrization in generalization of neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §1.

Megatronlm: training multibillion parameter language models using model parallelism
. CoRR abs/1909.08053. External Links: Link, 1909.08053 Cited by: §1.  Very deep convolutional networks for largescale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Appendix A, §3.
 Sanitychecking pruning methods: random tickets can win the jackpot. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 20390–20401. External Links: Link Cited by: Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity, §1, §2, §4, §4.
 Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 6377–6389. External Links: Link Cited by: Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity, §1, §1, §1, §2, §3, §3.
 Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, External Links: Link Cited by: Table 1, Appendix A, §1, §2, §2.
 To prune, or not to prune: exploring the efficacy of pruning for model compression. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Workshop Track Proceedings, External Links: Link Cited by: §2.
Appendix A Experimental details
Our experimental work encompasses five different architecturedataset combinations: LeNet300100 (Lecun et al., 1998) on MNIST (Creative Commons AttributionShare Alike 3.0 license), LeNet5 (Lecun et al., 1998) and VGG16 (Simonyan and Zisserman, 2015) on CIFAR10 (MIT license), VGG19 (Simonyan and Zisserman, 2015) on CIFAR100 (MIT license), and ResNet18 (He et al., 2016) on TinyImageNet (MIT license). Following Frankle et al. (2021)
, we do not reinitialize subnetworks after pruning (we revert back to the original initialization when pruning a pretrained model by LAMP). We use our own implementation of all pruning algorithms in TensorFlow except for GraSP, for which we use the original code in Torch published by
Wang et al. (2020). All runs were repeated 3 times for stability of results. Training was performed on an internal cluster equipped with NVIDIA RTX8000, NVIDIA V100, and AMD MI50 GPUs. Hyperparameters and training schedules used in our experiments are adopted from related works and are listed in Table
1.Model  Epochs  Drop epochs  Batch  LR  Weight decay  Source  Node type 

LeNet300100  (Lee et al., 2019)  CPU  
LeNet5  (Lee et al., 2019)  CPU  
VGG16  (Frankle et al., 2021)  GPU  
VGG19  (Wang et al., 2020)  GPU  
ResNet18  (Frankle et al., 2021)  GPU 
Summary of experimental work. All architectures include batch normalization layers followed by ReLU activations. Models are initialized using Kaiming normal scheme (fanavg) and optimized by SGD (momentum
) with a stepwise LR schedule ( drop factor applied on specified drop epochs). The categorical crossentropy loss function is used for all models.We apply standard augmentations to images during training. In particular, we normalize examples perchannel for all datasets and, additionally, randomly apply: shifts by at most 4 pixels in any direction and horizontal flips (CIFAR10, CIFAR100, and TinyImageNet), or rotations by up to 4 degrees (MNIST).
Appendix B Experiments with VGG16
In Figure 9, we display the results of our experiments with VGG16 on CIFAR10. As we argued in Section 3, higher sparsities are required for purely convolutional architectures (such as VGG16) to develop inactive connections since feature maps are harder to disconnect. At the same time, several algorithms (SNIP, SNIPiterative, GraSP) suffer from layercollapse at modest sparsities ( or less) and, hence, fail to develop significant amounts of inactive parameters. For this reason, as evident from Figures 3, 4, and 9, VGG16 arguably showcases the least differences between effective and direct compression among all tested architectures.
Appendix C Magnitude pruning and IGQ
In addition to the abinitio pruning experiments in Section 4, we test IGQ in the context of magnitude pruning after training. In this set of experiments, we pretrain fullydense models and prune them by magnitude using global methods (Global Magnitude Pruning, LAMP) or layerbylayer respecting sparsity allocation quotas (Uniform, Uniform+, ERK, and IGQ). Then, we revert the unpruned weights back to their original random values and fully retrain the resulting subnetworks to convergence. Results are displayed in Figure 10 in the framework of effective compression. Overall, our method for distributing sparsity in the context of magnitude pruning performs consistently well across all architectures and favorably compares to other baselines, especially in moderate compression regimes of or less. Even though Global Magnitude Pruning can marginally outperform IGQ, it is completely unreliable on VGG19. ERK appears slighly better than IGQ on VGG19 and ResNet18 at extreme sparsities, however, it performs much worse on LeNet300100 and has other general deficiencies as we discussed in Section 4. The closest rival of IGQ is LAMP, which performs very similarly but is still unable to reach IGQ’s performance on VGG19 and ResNet18 in moderate compression regimes. Note, however, that all presented methods require practically equal compute and time; thus, the evidence in Figure 10 is not meant to advertise IGQ as a cheaper alternative to LAMP but rather to illustrate the effectiveness of IGQ.
Comments
There are no comments yet.