Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity

07/05/2021
by   Artem Vysogorets, et al.
NYU college
0

Neural network pruning is a fruitful area of research with surging interest in high sparsity regimes. Benchmarking in this domain heavily relies on faithful representation of the sparsity of subnetworks, which has been traditionally computed as the fraction of removed connections (direct sparsity). This definition, however, fails to recognize unpruned parameters that detached from input or output layers of underlying subnetworks, potentially underestimating actual effective sparsity: the fraction of inactivated connections. While this effect might be negligible for moderately pruned networks (up to 10-100 compression rates), we find that it plays an increasing role for thinner subnetworks, greatly distorting comparison between different pruning algorithms. For example, we show that effective compression of a randomly pruned LeNet-300-100 can be orders of magnitude larger than its direct counterpart, while no discrepancy is ever observed when using SynFlow for pruning [Tanaka et al., 2020]. In this work, we adopt the lens of effective sparsity to reevaluate several recent pruning algorithms on common benchmark architectures (e.g., LeNet-300-100, VGG-19, ResNet-18) and discover that their absolute and relative performance changes dramatically in this new and more appropriate framework. To aim for effective, rather than direct, sparsity, we develop a low-cost extension to most pruning algorithms. Further, equipped with effective sparsity as a reference frame, we partially reconfirm that random pruning with appropriate sparsity allocation across layers performs as well or better than more sophisticated algorithms for pruning at initialization [Su et al., 2020]. In response to this observation, using a simple analogy of pressure distribution in coupled cylinders from physics, we design novel layerwise sparsity quotas that outperform all existing baselines in the context of random pruning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/18/2020

Pruning Neural Networks at Initialization: Why are We Missing the Mark?

Recent work has explored the possibility of pruning neural networks at i...
11/22/2021

Plant 'n' Seek: Can You Find the Winning Ticket?

The lottery ticket hypothesis has sparked the rapid development of pruni...
02/25/2019

The State of Sparsity in Deep Neural Networks

We rigorously evaluate three state-of-the-art techniques for inducing sp...
10/15/2020

A Deeper Look at the Layerwise Sparsity of Magnitude-based Pruning

Recent discoveries on neural network pruning reveal that, with a careful...
04/30/2021

Studying the Consistency and Composability of Lottery Ticket Pruning Masks

Magnitude pruning is a common, effective technique to identify sparse su...
06/16/2020

Progressive Skeletonization: Trimming more fat from a network at initialization

Recent studies have shown that skeletonization (pruning parameters) of n...
05/24/2022

Compression-aware Training of Neural Networks using Frank-Wolfe

Many existing Neural Network pruning approaches either rely on retrainin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent successful advances of Deep Neural Networks are commonly attributed to their high architectural complexity and excessive size (over-parametrization) (Denton et al., 2014; Neyshabur et al., 2019; Arora et al., 2018). Modern state-of-the-art architectures exhibit enormous parameter overhead, requiring prohibitive amounts of resources during both training and inference and leaving a significant environmental footprint (Shoeybi et al., 2019). In response to these challenges, much attention has turned to compression of neural networks and, in particular, parameter pruning. While initial approaches mostly focused on pruning models after training (LeCun et al., 1990; Hassibi et al., 1993), contemporary algorithms optimize the sparsity structure of a network while training its parameters (Mocanu et al., 2018; Evci et al., 2020) or even remove connections before any training whatsoever (Lee et al., 2019; Wang et al., 2020).

Compression rates initially considered in the pruning literature usually fall between and of the size of the original model. However, as contemporary model sizes grow into the billions of parameters, studying higher compression regimes becomes increasingly important. Recently, a new bold sparsity benchmark was set by Tanaka et al. (2020) with Iterative Synaptic Flow (SynFlow), a data-agnostic algorithm for pruning at initialization. Reportedly, it is capable of removing all but only a few hundreds of parameters (a compression for VGG-16) and still produce trainable subnetworks, while other pruning methods disconnect networks at much lower sparsity levels (Tanaka et al., 2020). Related work by de Jorge et al. (2021) proposes an iterative version of one-shot pruning algorithm, Single-shot Network Pruning (SNIP) (Lee et al., 2019), and evaluates it in a similar high sparsity regime, reaching more than compression ratio.

Effective sparsity.

This increased focus on extreme sparsity leads us to consider what sparsity is meant to represent in neural networks and computational graphs at large. In the context of neural network pruning, sparsity to date is computed straightforwardly as the fraction of removed connections (direct sparsity)—and compression as the inverse fraction of unpruned connections (direct compression). We observe that this definition does not distinguish between connections that have actually been pruned, and those that have become effectively pruned because they have disconnected from the computational flow. In this work, we propose to instead focus on effective sparsity—the fraction of inactivated connections, be it through direct pruning or through otherwise disconnecting from either input or output of a network (see Figure 1 for an illustration).

Figure 1: Pruning edges from a fully-connected -edge network. Top: direct sparsity () does not account for disconnected edges (compression ). Bottom: effective sparsity () accounts for the

dashed connections incident to inactivated neurons (yielding twice as large effective compression

).

In this work, we advocate that effective sparsity (effective compression) be used universally in place of its direct counterpart since it more accurately depicts what one would reasonably consider the network’s sparsity state. Using the lens of effective compression for benchmarking allows for a fairer comparison between different unstructured pruning algorithms. Note that effective compression is lower bounded by direct compression, which means that some pruning algorithms will give improved sparsity-accuracy trade-offs in this new framework. In Section 3, we critically reexamine a plethora of recent pruning algorithms for a variety of architectures to find that, in this refined framework, conclusions drawn in previous works appear overstated or incorrect. Figure 2 gives a sneak-preview of this effect for three ab-initio pruning algorithms: SynFlow (Tanaka et al., 2020), SNIP (Lee et al., 2019)

and plain random pruning for LeNet-300-100 on MNIST. While SynFlow appears superior to other methods when evaluated against direct compression, it loses its advantage in the effective framework. Such radical performance changes are partly explained by differing gaps between effective and direct compression inherent to different pruning algorithms (Figure

2). We can see that significant departure between direct and effective compression kicks in at relatively low rates below , making our work relevant even in these moderate regimes. For example, using random pruning to compress LeNet-300-100 by (sparsity ) results in effective compression; yet, removing the same number of parameters with SynFlow yields an unchanged effective compression. What makes certain iterative algorithms like SynFlow less likely to amass disconnected edges? In Section 3, we show that they are fortuitously designed to achieve a close convergence of direct and effective sparsity, hinting that preserving connectivity is an important aspect in the strong performance of high-compression pruning algorithms (Tanaka et al., 2020; de Jorge et al., 2021). Moreover, the lens of effective compression gives access to more extreme compression regimes for some pruning algorithms, which appear to disconnect much earlier when not accounting for inactive connections. For these high effective compression ratios all three pruning methods from Figure 2 perform surprisingly similar, even though they use varying degrees of information on data and parameter values.

Layerwise Sparsity Quotas (LSQ) and Ideal Gas Quotas (IGQ).

A recent thread of research by Frankle et al. (2021) and Su et al. (2020) shows that performance of trained subnetworks produced by algorithms for pruning at initialization is robust to randomly reshuffling unpruned edges within layers before training. This observation led to the conjecture that these algorithms essentially generate successful distributions of sparsity across layers, while the exact connectivity patterns are unimportant. In Section 4, we reexamine this conjecture through the lens of effective sparsity, confirm it for moderate compression regimes () studied by Frankle et al. (2021) and Su et al. (2020), but find the truth to be more nuanced at higher compression rates. Nonetheless, this result highlights the importance of algorithms that carefully engineer layerwise sparsity quotas (LSQ) to obtain very simple and adequately performing pruning algorithms that are data- and parameter-agnostic. Another important motivation to search for good LSQ is that global pruning algorithms frequently remove entire layers prematurely (Lee et al., 2020) (cf. layer-collapse (Tanaka et al., 2020)), even before any significant differences between direct and effective sparsity emerge. Well-engineered LSQ could avoid this and enforce proper redistribution of compression across layers (see (Gale et al., 2019; Mocanu et al., 2018; Evci et al., 2020) for existing baselines). In Section 4, we propose a novel LSQ coined Ideal Gas Quotas (IGQ) by drawing intuitive analogies from physics. Effortlessly computable for any network-sparsity combination, IGQ performs similarly or better than any other baseline in the context of random pruning at initialization and of magnitude pruning after training.

Figure 2: LeNet-300-100 trained on MNIST after pruning with SNIP, SynFlow and layerwise uniform random pruning. Left: gaps between direct and effective compression. Right: SynFlow has a better sparsity-accuracy trade-off than SNIP when plotted against direct compression (dashed), but not against effective compression (solid curves fitted to dots that represent inidvidual experiments). Dashed and solid curves coincide for SynFlow.

Effective pruning.

Pruning to any desired direct sparsity is straightforward: one simply needs to mask out the corresponding number of parameters from a network. Effective sparsity, unfortunately, is more unpredictable and difficult to control. In particular, several known pruning algorithms suffer from layer-collapse once reaching a certain sparsity level, leading to unstable effective sparsity just before the disconnection. As a result, most pruning methods are unable to deliver certain values of effective sparsity regardless of how many connections are pruned. When possible, however, one needs to carefully tune the number of pruned parameters so that effective sparsity lands near a desired value. In Section 5, we suggest a simple extension to algorithms for pruning at initialization or after training that helps bring effective sparsity close to any predefined achievable value while incurring costs that are at most logarithmic in model size.

2 Related work

Neural network compression encompasses a number of orthogonal approaches such as parameter regularization (Lebedev and Lempitsky, 2016; Louizos et al., 2018), variational dropout (Molchanov et al., 2017)

, vector quantization and parameter sharing

(Gong et al., 2014; Chen et al., 2015; Han et al., 2016), low-rank matrix decomposition (Denton et al., 2014; Jaderberg et al., 2014), and knowledge distillation (Buciluǎ et al., 2006; Hinton et al., 2015). Network pruning, however, is by far the most common technique for model compression, and can be partitioned into structured (at the level of entire neurons/units) and unstructured (at the level of individual connections). While the former offers resource efficiency unconditioned on use of specialized hardware (Liu et al., 2019) and constitutes a fruitful research area (Li et al., 2017; Liu et al., 2017), we focus on the more actively studied unstructured pruning, which is where differences between effective and direct sparsity emerge. In what follows we give a quick overview, naturally grouping pruning methods by the time they are applied relative to training (see (Frankle and Carbin, 2019) and (Wang et al., 2020) for a similar taxonomy).

Pruning after training.

These earliest pruning techniques were designed to remove the least “salient” learned connections without sacrificing predictive performance. Optimal Brain Damage (LeCun et al., 1990) and its sequel Optimal Brain Surgeon (Hassibi et al., 1993)

use the Hessian of the loss to estimate sensitivity to removal of individual parameters.

Han et al. (2015) popularized magnitude as a simple and effective pruning criterion. It proved to be especially successful when applied alternately with several finetuning cycles, which is commonly referred to as Iterative Magnitude Pruning (IMP), a modification of which was used by Frankle and Carbin (2019) to discover lottery tickets. Later, Dong et al. (2017) showed that magnitude-based pruning minimizes distortion of each layer’s output incurred by parameter removal. Recently, Lee et al. (2021) extend this idea and propose Layer-Adaptive Magnitude-Based Pruning (LAMP), which approximately minimizes the upper bound of the distortion of the entire network. While equivalent to magnitude pruning within individual layers, LAMP automatically discovers state-of-the-art layerwise sparsity quotas (see Section 4) that yield better performance (as a function of direct compression) than existing alternatives in the context of IMP.

Pruning during training.

Algorithms in this category learn sparsity structures together with parameter values, hoping that continued training will correct for damage incurred by pruning. To avoid inefficient prune-retrain cycles inherent to IMP, Narang et al. (2017) introduce gradual magnitude pruning over a single training round. Subsequently, Zhu and Gupta (2018) modify this algorithm by introducing a simpler pruning schedule and keeping layerwise sparsities uniform throughout training. Sparse Evolutionary Training (SET) (Mocanu et al., 2018) starts with an already sparse subnetwork and restructures it during training by pruning and randomly reviving connections. Unlike SET, Mostafa and Wang (2019) allow redistribution of sparsity across layers, while Dettmers and Zettlemoyer (2019) use gradient momentum as the criterion for parameter regrowth. Evci et al. (2020) rely on the instantaneous gradient to revive weights but follow SET to maintain the initial layerwise sparsity distribution during training.

Pruning before training.

Pruning at initialization is especially alluring to deep learning practitioners as it promises lower costs of both optimization and inference. While this may seem too ambitious, the Lottery Ticket Hypothesis (LTH) postulates that randomly initialized dense networks do indeed contain highly trainable and equally well-performing sparse subnetworks

(Frankle and Carbin, 2019). Inspired by the LTH, Lee et al. (2019) design SNIP, which uses connection sensitivity as a parameter saliency score. Wang et al. (2020) notice that SNIP creates bottlenecks or even removes entire layers and propose Gradient Signal Preservation (GraSP) as an alternative that aims to maximize gradient flow in a pruned network. de Jorge et al. (2021) improve SNIP by applying it iteratively, allowing for reassessment of saliency scores during pruning and helping networks stay connected at higher compression rates. A truly new compression benchmark was set by Tanaka et al. (2020); their algorithm, SynFlow, iteratively prunes subsets of parameters according to their

path norm and helps networks reach maximum compression without disconnecting. For example, SynFlow achieves non-random test accuracy on CIFAR-10 with a

compressed VGG-16, while SNIP and GraSP fail already at and , respectively. An extensive ablation study by Frankle et al. (2021) examines SNIP, GraSP and SynFlow within moderate compression rates (up to ) and reveals that performance of subnetworks produced by these methods is stable under layerwise rearrange prior to training. Later, this result was independently confirmed by Su et al. (2020) for SNIP and GraSP only. This observation suggests that these algorithms perform as well as random pruning with corresponding layerwise quotas, putting the spotlight on designing competitive LSQ (Mocanu et al., 2018; Gale et al., 2019; Lee et al., 2021).

3 Effective sparsity

In this section, we present our comparisons of a variety of pruning algorithms under the lens of effective compression. To illustrate the striking difference between direct and effective sparsity and expose the often radical change in relative performance of pruning algorithms when switching from the former to the latter, we evaluate several recent methods (SNIP, GraSP, SynFlow, LAMP111as a state-of-the-art representative of magnitude pruning after training and, in particular, lottery tickets (Frankle and Carbin, 2019).

, and SNIP-iterative) and random pruning with uniform sparsity distribution across layers in both frameworks. Our experiments encompass modern architectures on commonly used computer vision benchmark datasets: LeNet-300-100

(Lecun et al., 1998) on MNIST, LeNet-5 (Lecun et al., 1998) on CIFAR-10, VGG-19 (Simonyan and Zisserman, 2015) on CIFAR-100 , and ResNet-18 (He et al., 2016) on TinyImageNet. We place results of VGG-16 (Simonyan and Zisserman, 2015) on CIFAR-10 in Appendix B, as they closely resemble those of VGG-19. Further experimental details are presented in Appendix A. Our code is made available at github.com/avysogorets/effective-sparsity.

Notation. Consider an -layer neural network

with weight tensors

for . A subnetwork is specified by a set of binary masks that indicate unpruned parameters . With , it is given by where

indicates pointwise multiplication. Note that biases and batch normalization parameters

(Ioffe and Szegedy, 2015) are normally considered unprunable. Direct sparsity, the fraction of pruned weights, is given by and direct compression rate is defined as .

Figure 3 reveals that different algorithms tend to develop varying amounts of inactive connections. For example, effective compression of subnetworks pruned by LAMP consistently reaches of their direct compression across all architectures, at which point at least nine in ten unpruned connections are effectively inactivated. Other methods (e.g., SNIP on VGG-19) remove entire layers early on, before any substantial differences between effective and direct compression emerge. SNIP-iterative and especially SynFlow, however, demonstrate a truly unique property: subnetworks pruned by these two algorithms exhibit practically equal effective and direct compressions, and, in the case of SynFlow, disconnect only at very high compression rates. What makes them special? Both SynFlow and SNIP-iterative are multi-shot pruning algorithms that remove parameters over and iterations, respectively. SynFlow ranks connections by their path norm (sum of weighted paths passing through the edge, where the weight of a path is the product of magnitudes of weights of its edges). SNIP uses connection sensitivity scores from Lee et al. (2019) as a saliency measure, where

is the loss function. Both these pruning criteria assign the lowest possible score of zero to inactive connections, scheduling them for immediate removal in the subsequent pruning iteration. Thus, by virtue of their iterative design, these two methods produce subnetworks with little to no difference between effective and direct compression. They are fortuitously designed to prune inactivated edges, which might explain their high-compression performance.

Figure 3: Effective versus direct compression across different pruning methods and architectures (curves and bands represent min/average/max across 3 seeds where subnetworks disconnect last among a total of 5 seeds).

Tanaka et al. (2020) compare SynFlow to SNIP and GraSP using direct sparsity, claiming it vastly superior in high compression regimes. However, pruning methods that generate large amounts of inactivated connections are clearly at a significant disadvantage in the original direct framework. Figure 4 shows that the performance gap between SynFlow and other methods shrinks on all tested architectures under effective compression. The most dramatic changes are perhaps evident with LeNet-300-100 where SynFlow significantly dominates both SNIP and GraSP in direct comparison, but becomes strictly inferior when taken to the more telling effective compression. On the other hand, differences are not as pronounced on purely convolutional architectures such as VGG-19, and ResNet-18. Feature maps in convolutional layers are connected via groups of several parameters (kernels), making them more robust to inactivation compared to neurons in fully-connected layers.

Computing effective sparsity: In advocating the use of effective sparsity, we must make sure that it can be calculated efficiently. We propose an easily computable approach leveraging SynFlow. Note that a connection is inactive if and only if it is not part of any path from input to output. Assuming that unpruned weights are non-zero, this is equivalent to having zero path norm. Tanaka et al. (2020) observe that path norms can be efficiently computed with one pass on the all-ones input as , where and is the linearized version of the original network . For deep architectures, rescaling of weights might be required to avoid numerical instability (Tanaka et al., 2020).

Figure 4: Test accuracy (min/average/max) of subnetworks trained from scratch after being pruned by different algorithms plotted against direct (dashed) and effective (solid) compression. Dashed and solid curves overlap for SynFlow and SNIP-iterative. Solid curves are fitted to scatter data (not shown for clarity of the presentation) as in Figure 2.

4 Layerwise sparsity quotas (LSQ) and a novel allocation method (IGQ)

Inspired by Frankle et al. (2021) and Su et al. (2020), we wish to confirm that SNIP, GraSP, and SynFlow work no better than random pruning with corresponding layerwise sparsity allocation. While Frankle et al. (2021) and Su et al. (2020) only considered moderate compression rates up to and used direct sparsity as a reference frame, we reconfirm their conjecture in the effective framework and test it across the entire compression spectrum. We generate and train two sets of subnetworks: pruned by either SNIP, GraSP, and SynFlow (original), and randomly pruned while preserving layerwise sparsity quotas provided by each of these three methods (random).

Our results in Figure 5 agree with observations made by Frankle et al. (2021) and Su et al. (2020): in the compression range, all three random pruning algorithms perform similarly (LeNet-300-100, LeNet-5, VGG-19) or better (ResNet-18) than their original counterparts. Effective sparsity allows us to faithfully examine higher compression, where the evidence is more equivocal. Similar patterns are still seen on ResNet-18; however, the original SNIP and GraSP beat random pruning with corresponding layerwise sparsities by a wide margin starting at about compression on LeNet-300-100. Random pruning associated with SynFlow matches original SynFlow on the same network for longer, up to compression. On VGG-19, SynFlow bests the corresponding random pruning from about compression onward, while the original SNIP suffers from disconnection early on together with its random variant. Despite these nuances in the high compression regime, random pruning with specific layerwise sparsity quotas fares extremely well in the moderate sparsity regime (up to ) and is even competitive to full-fledged SynFlow (see Figure 7). Therefore, random pruning can be a cheap and competitive alternative to more sophisticated and resource-consuming algorithms. Random methods from Figure 5, however, still require running SNIP, GraSP, or SynFlow to identify appropriate sparsity quotas and thus are just as expensive. Furthermore, sparsity distributions inherited from global pruning methods frequently suffer from premature removal of entire layers (e.g., SNIP on VGG-19), which is undesired. Can we engineer readily computable and consistently well-performing sparsity quotas?

Figure 5: Original methods for pruning at initialization (solid) and random pruning with corresponding layerwise sparsity quotas (dashdot). Test accuracy of the unpruned networks is shown in grey.

To our knowledge, there are only a few ab-initio approaches in the literature to allocate sparsity in a principled fashion. Uniform is the simplest solution that keeps sparsity constant across all layers. Gale et al. (2019) give a modification (denoted Uniform+ following Lee et al. (2021)) that retains all parameters in the first convolutional layer and caps sparsity of the last fully-connected layer at . A more sophisticated approach, Erdös-Renyi-Kernel (ERK), sets the density of a convolutional layer with kernel size , fan-in and fan-out proportional to (Mocanu et al., 2018; Evci et al., 2020). The last two approaches are unable to support the entire range of sparsities: Uniform+ can only achieve moderate direct compression because of the prunability constraints on its first and last layer, while both direct and effective sparsity levels achievable with ERK are often lower bounded. For example, the density of certain layers of VGG-16 set by ERK exceeds when cutting less than of parameters, unless excessive density is redistributed. We suggest a formal definition for layerwise sparsity quotas to guide future research into sparsity allocation and avoid problems that riddle Uniform+ and ERK.

Definition 1 (Layerwise Sparsity Quotas).

A function mapping a target sparsity to layerwise sparsities is called Layerwise Sparsity Quotas (LSQ) if it satisfies the following properties: (i) total sparsity: for any , , and (ii) monotonicity: for any whenever .

We now present Ideal Gas Quotas (IGQ), our formula for sparsity allocation that satisfies Definition 1 and outperforms the above mentioned baselines, while faring very well (over effective sparsity) compared to the allocation quotas derived from sophisticated pruning methods such as SynFlow. To develop an intuition on what constitutes a good LSQ construction, we study the layerwise sparsities induced by contemporary pruning algorithms such as LAMP and SynFlow (Figure 6). As a rule, they prune larger layers more aggressively than smaller layers, i.e., whenever for any , and avoid premature removal of entire layers, i.e., if and only if .

Our approach is to interpret compression of layers within a network as compression of stacked cylinders of unit volume filled with gas, where the height of the cylinder is proportional to the number of parameters in that layer. We then use the Ideal Gas Law to derive the compression of each of the stacked coupled cylinders. More formally, model each layer as a cylinder of height and cross-section area . Further, assume that these stacked weightless cylinders with frictionless pistons and filled with the same amount of ideal gas are in thermodynamic equilibrium with common pressure and temperature . Isothermal compression of this system using an external force is governed by the Ideal Gas Law: , where , , and is its new compressed height. Then, or, equivalently, . Interpreting as compression ratio of layer , we arrive at compression quotas (or sparsity quotas ) parameterized by the force controlling the overall sparsity of the network. Given a target sparsity , the needed value of can simply be found with binary search to any desired precision. Our IGQ clearly satisfies all conditions of Definition 1 and the other properties identified above. It is surprising how closely the sparsity quotas achieved by IGQ resemble those of SynFlow considering that they describe a physical process (see Figure 6).

Figure 6: Layerwise direct compression quotas of LeNet-5 (top row) and VGG-16 (bottom row) associated with SynFlow (left), our Ideal Gas Quotas (middle), and LAMP (right). Percentages indicate layer sizes relative to the total number of parameters; colors are assigned accordingly from blue (smaller layers) to red (larger layers). Curves of LAMP and SynFlow end when underlying networks disconnect.

We now evaluate IGQ for random pruning, comparing it against ERK, Uniform, Uniform+, as well as random pruning with the sparsity quotas induced by SynFlow for reference (Figure 7). Across all architectures, random pruning with IGQ and SynFlow sparsity quotas are almost indistinguishable from each other, suggesting that IGQ successfully mimics the quotas produced by SynFlow. While ERK sometimes exhibits similar (ResNet-18) or even better (VGG-19 compressed to or higher) performance than IGQ, it yields invalid layerwise sparsity quotas when removing less than and of parameters from ResNet-18 and VGG-19, respectively. In the moderate sparsity regime (up to ), subnetworks pruned by IGQ reach unparalleled performance after training.

In Appendix C, we test IGQ in the context of magnitude pruning after training. Here, performance of IGQ practically coincides with that of LAMP, making it the only known LSQ to consistently perform best and a competitive method for layerwise sparsity allocation.

Figure 7: Test performance of trained subnetworks after random pruning with different layerwise sparsity distributions. Original SynFlow (black) is shown for reference. Uniform+ is not shown for LeNet-300-100 since it is designed for convolutional networks.

5 Effective pruning

Unlike pruning to a target direct sparsity, pruning to achieve a particular effective sparsity can be non-trivial. Here, we present an extension to algorithms for pruning at initialization or after training that achieves this goal efficiently, when possible (see Figure 8).

Figure 8: Effective compression produced by regular (dashdot) and our effective (solid) pruning on ResNet-18 according to ranking-based (left) and random (right) algorithms. Our procedures help pruning reach target effective sparsity, falling short only when the subnetwork is on the brink of disconnection.

Ranking-based pruning.

Algorithms like GraSP, SynFlow, and LAMP rank parameters by some notion of importance to guide pruning. When such a ranking is available, the naive solution is to iterate through all scores in order, considering each as a potential pruning threshold and recording effective sparsity of the corresponding subnetwork with parameters removed. While provably identifying the optimal threshold that yields a subnetwork with effective sparsity as close to the desired value as possible, this approach requires prune-evaluate cycles, which is unreasonable for most contemporary architectures. To achieve an efficient overhead of at the price of minor inaccuracy, we utilize binary search for the cut-off threshold instead, leveraging the following monotonicity property: given two pruning thresholds and corresponding subnetworks , we have if and only if , which implies (note that in general does not imply the last inequality above). Thus, binary search will branch in the correct direction.

Random pruning.

Another class of pruning algorithms designs layerwise sparsities and then prunes each layer randomly (see Section 4). Binary search will not work directly, because here does not imply , and random pruning is unlikely to produce a neat chain of embedded subnetworks as before. To circumvent this issue, we design an improved algorithm that produces embedded subnetworks on each iteration, allowing binary search to work (Algorithm 1). Starting from the extreme subnetworks (fully-dense) and (fully-sparse), we narrow the sparsity gap between them while preserving . For each layer, we keep track of unpruned connections of and pruned connections of , randomly sample parameters from according to and form another network by pruning out from (or, equivalently, reviving in ). Depending on where effective sparsity of lands relative to target , we update either or and branch. Since connections to be pruned from (or revived in

) are chosen randomly at each step, weights within the same layer have equal probability of being pruned. Once

and are only parameter away from each other, the algorithm returns , yielding a connected model.

Input :  Desired effective sparsity ; LSQ function .
; ; ; ; for all ;
while  do
       ; ;
       for all ;
       for all ;
       if  then
             for all ; ; ;
            
      else
             for all ; ; ;
            
       end if
      
end while
Return :  Masks such that and .
Algorithm 1 Approximate Effective Random Pruning

6 Discussion

In our work, we argue that effective sparsity (effective compression) is the correct benchmarking measure for pruning algorithms since it discards effectively inactive connections and represents the true remaining connectivity pattern. Moreover, effective sparsity allows us to study extreme compression regimes for subnetworks that otherwise appear disconnected at much lower direct sparsities. We initiate the study of current pruning algorithms in this refined frame of reference and rectify previous benchmarks. To facilitate the use of effective sparsity in future research, we describe low-cost procedures to both compute and achieve desired effective sparsity when pruning. Lastly, with effective sparsity allowing us to fairly zoom into higher compression regimes than previously possible, we examine random pruning with prescribed layerwise sparsities and propose our own readily computable quotas (IGQ) after establishing conditions reasonable LSQ should fulfill. We show that IGQ, while allowing for any level of sparsity, is more advantageous than all existing similar baselines (Uniform, ERK) and gives comparable performance to sparsity quotas derived from more sophisticated and computationally expensive algorithms like SynFlow.

Limitations and Broader Impacts: We hope that the lens of effective compression will spur more research in high compression regimes. One possible limitation is that it is harder to control effective compression exactly. In particular using different seeds might lead to slightly different effective compression rates. However, these perturbations are minor. Another small limitation is that our effective pruning strategies are not immediately applicable to some algorithms that prune while training (e.g., RigL (Evci et al., 2020)). However, in most cases our approach can be adapted. Lastly, one might argue that for some architectures accuracy drops precipitously with higher compression thus making very sparse subnetworks less practical. We hope that opening the study of high compressions will allow to explore how to use sparse networks as building blocks, for instance using the power of ensembling. Our framework allows a principled study of this regime.

Both authors were supported by the National Science Foundation under NSF Award 1922658. Neither of the authors has any competing interests to report.

References

  • S. Arora, N. Cohen, and E. Hazan (2018) On the optimization of deep networks: implicit acceleration by overparameterization. In

    35th International Conference on Machine Learning, ICML 2018

    , A. Krause and J. Dy (Eds.),
    35th International Conference on Machine Learning, ICML 2018, pp. 372–389 (English (US)). Cited by: §1.
  • C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, New York, NY, USA, pp. 535–541. External Links: ISBN 1595933395, Link, Document Cited by: §2.
  • W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen (2015) Compressing neural networks with the hashing trick. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2285–2294. External Links: Link Cited by: §2.
  • P. de Jorge, A. Sanyal, H. Behl, P. Torr, G. Rogez, and P. K. Dokania (2021) Progressive skeletonization: trimming more fat from a network at initialization. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2.
  • E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. External Links: Link Cited by: §1, §2.
  • T. Dettmers and L. Zettlemoyer (2019) Sparse networks from scratch: faster training without losing performance. CoRR abs/1907.04840. External Links: Link, 1907.04840 Cited by: §2.
  • X. Dong, S. Chen, and S. Pan (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. External Links: Link Cited by: §2.
  • U. Evci, T. Gale, J. Menick, P. S. Castro, and E. Elsen (2020) Rigging the lottery: making all tickets winners. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 2943–2952. External Links: Link Cited by: §1, §1, §2, §4, §6.
  • J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §2, §2, §2, footnote 1.
  • J. Frankle, G. K. Dziugaite, D. Roy, and M. Carbin (2021) Pruning neural networks at initialization: why are we missing the mark?. In International Conference on Learning Representations, External Links: Link Cited by: Table 1, Appendix A, §1, §2, §4, §4.
  • T. Gale, E. Elsen, and S. Hooker (2019) The state of sparsity in deep neural networks. arXiv e-prints arXiv:1902.09574. External Links: Link Cited by: §1, §2, §4.
  • Y. Gong, L. Liu, M. Yang, and L. D. Bourdev (2014) Compressing deep convolutional networks using vector quantization. CoRR abs/1412.6115. External Links: Link, 1412.6115 Cited by: §2.
  • S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. External Links: Link Cited by: §2.
  • B. Hassibi, D.G. Stork, and G.J. Wolff (1993) Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, Vol. , pp. 293–299 vol.1. External Links: Document Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 770–778. External Links: Document Cited by: Appendix A, §3.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, External Links: Link Cited by: §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. External Links: Link Cited by: §3.
  • M. Jaderberg, A. Vedaldi, and A. Zisserman (2014)

    Speeding up convolutional neural networks with low rank expansions

    .
    In Proceedings of the British Machine Vision Conference, External Links: Document Cited by: §2.
  • V. Lebedev and V. Lempitsky (2016) Fast convnets using group-wise brain damage. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2554–2564. External Links: Document Cited by: §2.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document Cited by: Appendix A, §3.
  • Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Advances in Neural Information Processing Systems, pp. 598–605. Cited by: §1, §2.
  • J. Lee, S. Park, S. Mo, S. Ahn, and J. Shin (2021) Layer-adaptive sparsity for the magnitude-based pruning. In International Conference on Learning Representations, External Links: Link Cited by: §2, §2, §4.
  • N. Lee, T. Ajanthan, S. Gould, and P. H. S. Torr (2020) A signal propagation perspective for pruning neural networks at initialization. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • N. Lee, T. Ajanthan, and P. H. S. Torr (2019) Snip: single-shot network pruning based on connection sensitivity. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: Table 1, §1, §1, §1, §2, §3.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.
  • Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2755–2763. External Links: Document Cited by: §2.
  • Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019) Rethinking the value of network pruning. In ICLR, Cited by: §2.
  • C. Louizos, M. Welling, and D. P. Kingma (2018) Learning sparse neural networks through regularization. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and A. Liotta (2018) Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications 9 (1), pp. 2383. External Links: Document, ISBN 2041-1723, Link Cited by: §1, §1, §2, §2, §4.
  • D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 2498–2507. External Links: Link Cited by: §2.
  • H. Mostafa and X. Wang (2019) Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the 36th International Conference on Machine Learning, pp. 4646–4655. Cited by: §2.
  • S. Narang, G. Diamos, S. Sengupta, and E. Elsen (2017)

    Exploring sparsity in recurrent neural networks

    .
    In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.
  • B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro (2019) The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019)

    Megatron-lm: training multi-billion parameter language models using model parallelism

    .
    CoRR abs/1909.08053. External Links: Link, 1909.08053 Cited by: §1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Appendix A, §3.
  • J. Su, Y. Chen, T. Cai, T. Wu, R. Gao, L. Wang, and J. D. Lee (2020) Sanity-checking pruning methods: random tickets can win the jackpot. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 20390–20401. External Links: Link Cited by: Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity, §1, §2, §4, §4.
  • H. Tanaka, D. Kunin, D. L. Yamins, and S. Ganguli (2020) Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 6377–6389. External Links: Link Cited by: Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity, §1, §1, §1, §2, §3, §3.
  • C. Wang, G. Zhang, and R. Grosse (2020) Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, External Links: Link Cited by: Table 1, Appendix A, §1, §2, §2.
  • M. Zhu and S. Gupta (2018) To prune, or not to prune: exploring the efficacy of pruning for model compression. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, External Links: Link Cited by: §2.

Appendix A Experimental details

Our experimental work encompasses five different architecture-dataset combinations: LeNet-300-100 (Lecun et al., 1998) on MNIST (Creative Commons Attribution-Share Alike 3.0 license), LeNet-5 (Lecun et al., 1998) and VGG-16 (Simonyan and Zisserman, 2015) on CIFAR-10 (MIT license), VGG-19 (Simonyan and Zisserman, 2015) on CIFAR-100 (MIT license), and ResNet-18 (He et al., 2016) on TinyImageNet (MIT license). Following Frankle et al. (2021)

, we do not reinitialize subnetworks after pruning (we revert back to the original initialization when pruning a pretrained model by LAMP). We use our own implementation of all pruning algorithms in TensorFlow except for GraSP, for which we use the original code in Torch published by

Wang et al. (2020)

. All runs were repeated 3 times for stability of results. Training was performed on an internal cluster equipped with NVIDIA RTX-8000, NVIDIA V-100, and AMD MI50 GPUs. Hyperparameters and training schedules used in our experiments are adopted from related works and are listed in Table

1.

Model Epochs Drop epochs Batch LR Weight decay Source Node type
LeNet-300-100 (Lee et al., 2019) CPU
LeNet-5 (Lee et al., 2019) CPU
VGG-16 (Frankle et al., 2021) GPU
VGG-19 (Wang et al., 2020) GPU
ResNet-18 (Frankle et al., 2021) GPU
Table 1:

Summary of experimental work. All architectures include batch normalization layers followed by ReLU activations. Models are initialized using Kaiming normal scheme (fan-avg) and optimized by SGD (momentum

) with a stepwise LR schedule ( drop factor applied on specified drop epochs). The categorical cross-entropy loss function is used for all models.

We apply standard augmentations to images during training. In particular, we normalize examples per-channel for all datasets and, additionally, randomly apply: shifts by at most 4 pixels in any direction and horizontal flips (CIFAR-10, CIFAR-100, and TinyImageNet), or rotations by up to 4 degrees (MNIST).

Appendix B Experiments with VGG-16

In Figure 9, we display the results of our experiments with VGG-16 on CIFAR-10. As we argued in Section 3, higher sparsities are required for purely convolutional architectures (such as VGG-16) to develop inactive connections since feature maps are harder to disconnect. At the same time, several algorithms (SNIP, SNIP-iterative, GraSP) suffer from layer-collapse at modest sparsities ( or less) and, hence, fail to develop significant amounts of inactive parameters. For this reason, as evident from Figures 3, 4, and 9, VGG-16 arguably showcases the least differences between effective and direct compression among all tested architectures.

Figure 9: Left: effective versus direct compression of VGG-16 when pruned by different algorithms. Right: test accuracy (min/average/max) of VGG-16 trained from scratch after being pruned by different algorithms plotted against direct (dashed) and effective (solid) compression. Dashed and solid curves overlap for SynFlow and SNIP-iterative.

Appendix C Magnitude pruning and IGQ

In addition to the ab-initio pruning experiments in Section 4, we test IGQ in the context of magnitude pruning after training. In this set of experiments, we pretrain fully-dense models and prune them by magnitude using global methods (Global Magnitude Pruning, LAMP) or layer-by-layer respecting sparsity allocation quotas (Uniform, Uniform+, ERK, and IGQ). Then, we revert the unpruned weights back to their original random values and fully retrain the resulting subnetworks to convergence. Results are displayed in Figure 10 in the framework of effective compression. Overall, our method for distributing sparsity in the context of magnitude pruning performs consistently well across all architectures and favorably compares to other baselines, especially in moderate compression regimes of or less. Even though Global Magnitude Pruning can marginally outperform IGQ, it is completely unreliable on VGG-19. ERK appears slighly better than IGQ on VGG-19 and ResNet-18 at extreme sparsities, however, it performs much worse on LeNet-300-100 and has other general deficiencies as we discussed in Section 4. The closest rival of IGQ is LAMP, which performs very similarly but is still unable to reach IGQ’s performance on VGG-19 and ResNet-18 in moderate compression regimes. Note, however, that all presented methods require practically equal compute and time; thus, the evidence in Figure 10 is not meant to advertise IGQ as a cheaper alternative to LAMP but rather to illustrate the effectiveness of IGQ.

Figure 10: Test performance of retrained subnetworks after pruning with different magnitude-based methods. Uniform+ is not shown for LeNet-300-100 since it is designed for convolutional networks.