Pruning the parameters of deep neural networks has generated intense interest due to potential savings in time, memory and energy both during training and at test time. Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a foundational question: can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design. We first mathematically formulate and experimentally verify a conservation law that explains why existing gradient-based pruning algorithms at initialization suffer from layer-collapse, the premature pruning of an entire layer rendering a network untrainable. This theory also elucidates how layer-collapse can be entirely avoided, motivating a novel pruning algorithm Iterative Synaptic Flow Pruning (SynFlow). This algorithm can be interpreted as preserving the total flow of synaptic strengths through the network at initialization subject to a sparsity constraint. Notably, this algorithm makes no reference to the training data and consistently outperforms existing state-of-the-art pruning algorithms at initialization over a range of models (VGG and ResNet), datasets (CIFAR-10/100 and Tiny ImageNet), and sparsity constraints (up to 99.9 percent). Thus our data-agnostic pruning algorithm challenges the existing paradigm that data must be used to quantify which synapses are important.READ FULL TEXT VIEW PDF
Overparameterization has been shown to benefit both the optimization and...
Network pruning is an effective methodology to compress large neural
RigL, a sparse training algorithm, claims to directly train sparse
Lottery Ticket Hypothesis raises keen attention to identifying sparse
That neural networks may be pruned to high sparsities and retain high
Sparse neural networks have generated substantial interest recently beca...
We present a novel network pruning algorithm called Dynamic Sparse Train...
Comparison of method "Pruning at initialization prior to training" (Synflow/SNIP/GraSP) in PyTorch
Network pruning, or the compression of neural networks by removing parameters, has been an important subject both for reasons of practical deployment [1, 2, 3, 4, 5, 6, 7] and for theoretical understanding of artificial  and biological  neural networks. Conventionally, pruning algorithms have focused on compressing pre-trained models [1, 2, 3, 5, 6]. However, recent works [10, 11] have identified through iterative training and pruning cycles (iterative magnitude pruning) that there exist sparse subnetworks (winning tickets) in randomly-initialized neural networks that, when trained in isolation, can match the test accuracy of the original network. Moreover, its been shown that some of these winning ticket subnetworks can generalize across datasets and optimizers . While these results suggest training can be made more efficient by identifying winning ticket subnetworks at initialization, they do not provide efficient algorithms to find them. Typically, it requires significantly more computational costs to identify winning tickets through iterative training and pruning cycles than simply training the original network from scratch [10, 11]. Thus, the fundamental unanswered question is: can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? Towards this goal, we start by investigating the limitations of existing pruning algorithms at initialization [13, 14], determine simple strategies for avoiding these limitations, and provide a novel data-agnostic algorithm that improves upon state-of-the-art results. Our main contributions are:
We study layer-collapse, the premature pruning of an entire layer making a network untrainable, and formulate the axiom Maximal Critical Compression that posits a pruning algorithm should avoid layer-collapse whenever possible (Sec. 3).
We demonstrate theoretically and empirically that synaptic saliency, a general class of gradient-based scores for pruning, is conserved at every hidden unit and layer of a neural network (Sec. 4).
We show that these conservation laws imply parameters in large layers receive lower scores than parameters in small layers, which elucidates why single-shot pruning disproportionately prunes the largest layer leading to layer-collapse (Sec. 4).
We prove that a pruning algorithm avoids layer-collapse entirely and satisfies Maximal Critical Compression if it uses iterative, positive synaptic saliency scores (Sec. 6).
We introduce a new data-agnostic algorithm Iterative Synaptic Flow Pruning (SynFlow) that satisfies Maximal Critical Compression (Sec. 6) and demonstrate empirically111All code is available at github.com/ganguli-lab/Synaptic-Flow. that this algorithm achieves state-of-the-art pruning performance on 12 distinct combinations of models and datasets (Sec. 7).
While there are a variety of approaches to compressing neural networks, such as novel design of micro-architectures [15, 16, 17], dimensionality reduction of network parameters [18, 19], and training of dynamic sparse networks [20, 21], in this work we will focus on neural network pruning.
Pruning after training. Conventional pruning algorithms assign scores to parameters in neural networks after training and remove the parameters with the lowest scores [5, 22, 23]. Popular scoring metrics include weight magnitudes [4, 6], its generalization to multi-layers , first- [1, 25, 26, 27] and second-order [2, 3, 27] Taylor coefficients of the training loss with respect to the parameters, and more sophisticated variants [28, 29, 30]. While these pruning algorithms can indeed compress neural networks at test time, there is no reduction in the cost of training.
Pruning before Training. Recent works demonstrated that randomly initialized neural networks can be pruned before training with little or no loss in the final test accuracy [10, 13, 31]. In particular, the Iterative Magnitude Pruning (IMP) algorithm [10, 11]
repeats multiple cycles of training, pruning, and weight rewinding to identify extremely sparse neural networks at initialization that can be trained to match the test accuracy of the original network. While IMP is powerful, it requires multiple cycles of expensive training and pruning with very specific sets of hyperparameters. Avoiding these difficulties, a different approach uses the gradients of the training loss at initialization to prune the network in a single-shot[13, 14]. While these single-shot pruning algorithms at initialization are much more efficient, and work as well as IMP at moderate levels of sparsity, they suffer from layer-collapse, or the premature pruning of an entire layer rendering a network untrainable [32, 33]. Understanding and circumventing this layer-collapse issue is the fundamental motivation for our study.
Broadly speaking, a pruning algorithm at initialization is defined by two steps. The first step scores the parameters of a network according to some metric and the second step masks the parameters (removes or keeps the parameter) according to their scores.
The pruning algorithms we consider will always mask the parameters by simply removing the parameters with the smallest scores. This ranking process can be applied globally across the network, or layer-wise. Empirically, its been shown that global-masking performs far better than layer-masking, in part because it introduces fewer hyperparameters and allows for flexible pruning rates across the network . However, recent works [32, 14, 33] have identified a key failure mode, layer-collapse, for existing pruning algorithms using global-masking. Layer-collapse occurs when an algorithm prunes all parameters in a single weight layer even when prunable parameters remain elsewhere in the network. This renders the network untrainable, evident by sudden drops in the achievable accuracy for the network as shown in Fig. 1. To gain insight into the phenomenon of layer-collapse we will define some useful terms inspired by a recent paper studying the failure mode .
Given a network, compression ratio () is the number of parameters in the original network divided by the number of parameters remaining after pruning. For example, when the compression ratio , then only one out of a thousand of the parameters remain after pruning. Max compression () is the maximal possible compression ratio for a network that doesn’t lead to layer-collapse. For example, for a network with layers and parameters, , which is the compression ratio associated with pruning all but one parameter per layer. Critical compression () is the maximal compression ratio a given algorithm can achieve without inducing layer-collapse. In particular, the critical compression of an algorithm is always upper bounded by the max compression of the network: . This inequality motivates the following axiom we postulate any successful pruning algorithm should satisfy.
Maximal Critical Compression. The critical compression of a pruning algorithm applied to a network should always equal the max compression of that network.
In other words, this axiom implies a pruning algorithm should never prune a set of parameters that results in layer-collapse if there exists another set of the same cardinality that will keep the network trainable. To the best of our knowledge, no existing pruning algorithm with global-masking satisfies this simple axiom. Of course any pruning algorithm could be modified to satisfy the axiom by introducing specialized layer-wise pruning rates. However, to retain the benefits of global-masking , we will formulate an algorithm, Iterative Synaptic Flow Pruning (SynFlow), which satisfies this property by construction. SynFlow is a natural extension of magnitude pruning, that preserves the total flow of synaptic strengths from input to output rather than the individual synaptic strengths themselves. We will demonstrate that not only does the SynFlow algorithm achieve Maximal Critical Compression, but it consistently outperforms existing state-of-the-art pruning algorithms (as shown in Fig. 1 and in Sec. 7), all while not using the data.
Throughout this work, we benchmark our algorithm, SynFlow, against two simple baselines, random scoring and scoring based on weight magnitudes, as well as two state-of-the-art single-shot pruning algorithms, Single-shot Network Pruning based on Connection Sensitivity (SNIP)  and Gradient Signal Preservation (GraSP) . SNIP  is a pioneering algorithm to prune neural networks at initialization by scoring weights based on the gradients of the training loss. GraSP  is a more recent algorithm that aims to preserve gradient flow at initialization by scoring weights based on the Hessian-gradient product. Both SNIP and GraSP have been thoroughly benchmarked by  against other state-of-the-art pruning algorithms that involve training [2, 34, 10, 11, 35, 21, 20], demonstrating competitive performance.
In this section, we will further verify that layer-collapse is a key obstacle to effective pruning at initialization and explore what is causing this failure mode. As shown in Fig. 2, with increasing compression ratios, existing random, magnitude, and gradient-based pruning algorithms will prematurely prune an entire layer making the network untrainable. Understanding why certain score metrics lead to layer-collapse is essential to improve the design of pruning algorithms.
Random pruning prunes every layer in a network by the same amount, evident by the horizontal lines in Fig. 2. With random pruning the smallest layer, the layer with the least parameters, is the first to be fully pruned. Conversely, magnitude pruning prunes layers at different rates, evident by the staircase pattern in Fig. 2
. Magnitude pruning effectively prunes parameters based on the variance of their initialization, which for common network initializations, such as Xavier or Kaiming , are inversely proportional to the width of a layer . With magnitude pruning the widest layers, the layers with largest input or output dimensions, are the first to be fully pruned. Gradient-based pruning algorithms SNIP  and GraSP  also prune layers at different rates, but it is less clear what the root cause for this preference is. In particular, both SNIP and GraSP aggressively prune the largest layer, the layer with the most trainable parameters, evident by the sharp peaks in Fig. 2. Based on this observation, we hypothesize that gradient-based scores averaged within a layer are inversely proportional to the layer size. We examine this hypothesis by constructing a theoretical framework grounded in flow networks. We first define a general class of gradient-based scores, prove a conservation law for these scores, and then use this law to prove that our hypothesis of inverse proportionality between layer size and average layer score holds exactly.
A general class of gradient-based scores. Synaptic saliency is any score metric that can be expressed as the Hadamard product
is a scalar loss function of the output of a feed-forward network parameterized by. When is the training loss , the resulting synaptic saliency metric is equivalent (modulo sign) to , the score metric used in Skeletonization , one of the first network pruning algorithms. The resulting metric is also closely related to the score used in SNIP , the score used in GraSP, and the score used in the pruning after training algorithm Taylor-FO . This general class of score metrics, while not encompassing, exposes key properties of gradient-based scores used for pruning.
The conservation of synaptic saliency. All synaptic saliency metrics respect two surprising conservation laws that hold at any initialization and step in training.
Neuron-wise Conservation of Synaptic Saliency. For a feedforward neural network with homogenous activation functions, , (e.g. ReLU, Leaky ReLU, linear), the sum of the synaptic saliency for the incoming parameters to a hidden neuron (
For a feedforward neural network with homogenous activation functions,
, (e.g. ReLU, Leaky ReLU, linear), the sum of the synaptic saliency for the incoming parameters to a hidden neuron () is equal to the sum of the synaptic saliency for the outgoing parameters from the hidden neuron ().
Consider the hidden neuron of a network with outgoing parameters and incoming parameters , such that and . The sum of the synaptic saliency for the outgoing parameters is
The sum of the synaptic saliency for the incoming parameters is
When is homogeneous, then . ∎
The neuron-wise conservation of synaptic saliency implies network conservation as well.
Network-wise Conservation of Synaptic Saliency. The sum of the synaptic saliency across any set of parameters that exactly222Every element of the set is needed to separate the input neurons from the output neurons. separates the input neurons from the output neurons of a feedforward neural network with homogenous activation functions equals .
We prove this theorem in Appendix 10 by applying the neuron-wise conservation law recursively. Similar conservation properties have been noted in the neural network interpretability literature and have motivated the construction of interpretability methods such as Conductance  and Layer-wise Relevance Propagation , which have recently been modified for network pruning [9, 40]. While the interpretability literature has focused on attribution to the input pixels and hidden neuron activations, we have formulated conservation laws that are more general and applicable to any parameter and neuron in a network. Remarkably, these conservation laws of synaptic saliency apply to modern neural network architectures and a wide variety of neural network layers (e.g. dense, convolutional, batchnorm, pooling, residual) as visually demonstrated in Fig. 3.
Conservation and single-shot pruning leads to layer-collapse. The conservation laws of synaptic saliency provide us with the theoretical tools to validate our earlier hypothesis of inverse proportionality between layer size and average layer score as a root cause for layer-collapse of gradient-based pruning methods. Consider the set of parameters in a layer of a simple, fully connected neural network. This set would exactly separate the input neurons from the output neurons. Thus, by the network-wise conservation of synaptic saliency (theorem 2), the total score for this set is constant for all layers, implying the average is inversely proportional to the layer size. We can empirically evaluate this relationship at scale for existing pruning methods by computing the total score for each layer of a model, as shown in Fig. 4. While this inverse relationship is exact for synaptic saliency, other closely related gradient-based scores, such as the scores used in SNIP and GraSP, also respect this relationship. This validates the empirical observation that for a given compression ratio, gradient-based pruning methods will disproportionately prune the largest layers. Thus, if the compression ratio is large enough and the pruning score is only evaluated once, then a gradient-based pruning method will completely prune the largest layer leading to layer-collapse.
Having demonstrated and investigated the cause of layer-collapse in single-shot pruning methods at initialization, we now explore an iterative pruning method that appears to avoid the issue entirely. Iterative Magnitude Pruning (IMP) is a recently proposed pruning algorithm that has proven to be successful in finding extremely sparse trainable neural networks at initialization (winning lottery tickets) [10, 11, 12, 41, 42, 43, 44]. The algorithm follows three simple steps. First train a network, second prune parameters with the smallest magnitude, third reset the unpruned parameters to their initialization and repeat until the desired compression ratio. While simple and powerful, IMP is impractical as it involves training the network several times, essentially defeating the purpose of constructing a sparse initialization. That being said it does not suffer from the same catastrophic layer-collapse that other pruning at initialization methods are susceptible to. Thus, understanding better how IMP avoids layer-collapse might shed light on how to improve pruning at initialization.
As has been noted previously [10, 11], iteration is essential for stabilizing IMP. In fact, without sufficient pruning iterations, IMP will suffer from layer-collapse, evident in the sudden accuracy drops for the darker curves in Fig. 4(a). However, the number of pruning iterations alone cannot explain IMP’s success at avoiding layer-collapse. Notice that if IMP didn’t train the network during each prune cycle, then, no matter the number of pruning iterations, it would be equivalent to single-shot magnitude pruning. Thus, something very critical must happen to the magnitude of the parameters during training, that when coupled with sufficient pruning iterations allows IMP to avoid layer-collapse. We hypothesize that gradient descent training effectively encourages the scores to observe an approximate layer-wise conservation law, which when coupled with sufficient pruning iterations allows IMP to avoid layer-collapse.
Gradient descent encourages conservation. To better understand the dynamics of the IMP algorithm during training, we will consider a differentiable score algorithmically equivalent to the magnitude score. Consider these scores throughout training with gradient descent on a loss function using an infinitesimal step size (i.e. gradient flow). In this setting, the temporal derivative of the parameters is equivalent to , and thus the temporal derivative of the score is
Surprisingly, this is a form of synaptic saliency and thus the neuron-wise and layer-wise conservation laws from Sec. 4 apply. In particular, this implies that for any two layers and of a simple, fully connected network, then . This invariance has been noticed before by  as a form of implicit regularization and used to explain the empirical phenomenon that trained multi-layer models can have similar layer-wise magnitudes. In the context of pruning, this phenomenon implies that gradient descent training, with a small enough learning rate, encourages the squared magnitude scores to converge to an approximate layer-wise conservation, as shown in Fig. 4(b).
Conservation and iterative pruning avoids layer-collapse. As explained in section 4, conservation alone leads to layer-collapse by assigning parameters in the largest layers with lower scores relative to parameters in smaller layers. However, if conservation is coupled with iterative pruning, then when the largest layer is pruned, becoming smaller, then in subsequent iterations the remaining parameters of this layer will be assigned higher relative scores. With sufficient iterations, conservation coupled with iteration leads to a self-balancing pruning strategy allowing IMP to avoid layer-collapse. This insight on the importance of conservation and iteration applies more broadly to other algorithms with exact or approximate conservation properties (e.g. Skeletonization, SNIP, and GraSP as demonstrated in Sec. 3). Indeed, very recent work empirically confirms that iteration improves the performance of SNIP .
In the previous section we identified two key ingredients of IMP’s ability to avoid layer-collapse: (i) approximate layer-wise conservation of the pruning scores, and (ii) the iterative re-evaluation of these scores. While these properties allow the IMP algorithm to identify high performing and highly sparse, trainable neural networks, it requires an impractical amount of computation to obtain them. Thus, we aim to construct a more efficient pruning algorithm while still inheriting the key aspects of IMP’s success. So what are the essential ingredients for a pruning algorithm to avoid layer-collapse and provably attain Maximal Critical Compression? We prove the following theorem in Appendix 10.
Iterative, positive, conservative scoring achieves Maximal Critical Compression. If a pruning algorithm, with global-masking, assigns positive scores that respect layer-wise conservation and if the algorithm re-evaluates the scores every time a parameter is pruned, then the algorithm satisfies the Maximal Critical Compression axiom.
directly motivates the design of our novel pruning algorithm, SynFlow, that provably reaches Maximal Critical Compression. First, the necessity for iterative score evaluation discourages algorithms that involve backpropagation on batches of data, and instead motivates the development of an efficient data-independent scoring procedure. Second, positivity and conservation motives the construction of a loss function that yields positive synaptic saliency scores. We combine these insights to introduce a new loss function (where
is the all ones vector andis the element-wise absolute value of parameters in the layer),
that yields the positive, synaptic saliency scores () we term Synaptic Flow. For a simple, fully connected network (i.e. ), we can factor the Synaptic Flow score for a parameter as
This perspective demonstrates that Synaptic Flow score is a generalization of magnitude score (), where the scores consider the product of synaptic strengths flowing through each parameter, taking the inter-layer interactions of parameters into account. We use the Synaptic Flow score in the Iterative Synaptic Flow Pruning (SynFlow) algorithm summarized in the pseudocode below.
Given a network and specified compression ratio , the SynFlow algorithm requires only one additional hyperparameter, the number of pruning iterations . We demonstrate in Appendix 11, that an exponential pruning schedule () with pruning iterations essentially prevents layer-collapse whenever avoidable (Fig. 1), while remaining computationally feasible, even for large networks.
We empirically benchmark the performance of our algorithm, SynFlow (red), against the baselines random pruning and magnitude pruning, as well as the state-of-the-art algorithms SNIP  and GraSP . In Fig. 6, we test the five algorithms on 12 distinct combinations of modern architectures (VGG-11, VGG-16, ResNet-18, WideResNet-18) and datasets (CIFAR-10, CIFAR-100, Tiny ImageNet) over an exponential sweep of compression ratios ( for ). See Appendix 12 for more details and hyperparameters of the experiments. Consistently, SynFlow outperforms the other algorithms in the high compression regime () and demonstrates significantly more stability, as indicated by its tight intervals. Furthermore, SynFlow is the only algorithm that reliably shows better performance to the random pruning baseline: SNIP and GraSP perform significantly worse than random pruning with ResNet-18 and WideResNet-18 trained on Tiny ImageNet. SynFlow is also quite competitive in the low compression regime (). Although magnitude pruning can partially outperform SynFlow in this regime with models trained on Tiny ImageNet, it suffers from catastrophic layer-collapse as indicated by the sharp drops in accuracy.
In this paper, we developed a unifying theoretical framework that explains why existing single-shot pruning algorithms at initialization suffer from layer-collapse. We applied our framework to elucidate how iterative magnitude pruning  overcomes layer-collapse to identify winning lottery tickets at initialization. Building on the theory, we designed a new data-agnostic pruning algorithm, SynFlow, that provably avoids layer-collapse and reaches Maximal Critical Compression. Finally, we empirically confirmed that our SynFlow algorithm consistently performs better than existing algorithms across 12 distinct combinations of models and datasets, despite the fact that our algorithm is data-agnostic and requires no pre-training. Promising future directions for this work are to (i) explore a larger space of potential pruning algorithms that satisfy Maximal Critical Compression, (ii) harness SynFlow as an efficient way to compute appropriate per-layer compression ratios to combine with existing scoring metrics, and (iii) incorporate pruning as a part of neural network initialization schemes. Overall, our data-agnostic pruning algorithm challenges the existing paradigm that data must be used to quantify which synapses of a neural network are important.
We thank Jonathan M. Bloom, Weihua Hu, Javier Sagastuy-Brena, Chengxu Zhuang, and members of the Stanford Neuroscience and Artificial Intelligence Laboratory for helpful discussions. We thank the Stanford Data Science Scholars program (DK), the Burroughs Wellcome, Simons and James S. McDonnell foundations, and an NSF career award (SG) for support.
Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 254–263. PMLR, 2018.
From deep learning to mechanistic understanding in neuroscience: the structure of retinal prediction.In Advances in Neural Information Processing Systems, pages 8535–8545, 2019.
Proceedings of the European Conference on Computer Vision (ECCV), pages 20–35, 2018.
Speeding up convolutional neural networks with low rank expansions.In Proceedings of the British Machine Vision Conference. BMVA Press, 2014. doi: http://dx.doi.org/10.5244/C.28.88.
Importance estimation for neural network pruning.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10(7), 2015.
We provide a proof for Theorem 2 which we rewrite below.
Theorem 2. Network-wise Conservation of Synaptic Saliency. The sum of the synaptic saliency across any set of parameters that exactly separates the input neurons from the output neurons of a feedforward neural network with homogenous activation functions equals .
We begin by defining the set of neurons () and the set of prunable parameters () for a neural network.
Consider a subset of the neurons , such that all output neurons and all input neurons . Consider the set of parameters cut by this partition
By theorem 1, we know that that sum of the synaptic saliency over is equal to the sum of the synaptic saliency over the set of parameters adjacent to and between neurons in , . Continuing this argument, then eventually we get that this sum must be equal to the sum of the synaptic saliency over the set of parameters incident to the output neurons , which is
We can repeat this argument iterating through the set till we reach the input neurons to show that this sum is also equal to . ∎
We provide a proof for Theorem 3 which we rewrite below.
Theorem 3. Iterative, positive, conservative scoring achieves Maximal Critical Compression. If a pruning algorithm, with global-masking, assigns positive scores that respect layer-wise conservation and if the algorithm re-evaluates the scores every time a parameter is pruned, then the algorithm satisfies the Maximal Critical Compression axiom.
We prove this theorem by contradiction. Assume that a pruning algorithm with global-masking and iterative, positive, conservative scoring does not satisfy the Maximal Critical Compression axiom. This implies that at some iteration, the algorithm will prune the last parameter in a layer (layer ), despite there existing more than one parameters () in another layer (layer ). Because the algorithm uses global-masking, then the score for the last parameter in layer , , is less than or equal to the scores for each parameter, , in layer :
Because the scores respect a layer-wise conservation, then . This implies, by the positivity of the scores and because , that for all ,
This is a contradiction to the previous inequality. ∎
Theorem 3 required that an algorithm re-evaluates the scores every time a parameter is pruned. However, theorem 2 provides a theoretical insight to drastically reduce the number of iterations needed to practically attain Maximal Critical Compression. We now introduce a modification to theorem 3 that motivates practical hyperparameter choices used in the SynFlow algorithm.
Achieving Maximal Critical Compression practically. If a pruning algorithm, with global-masking, assigns positive scores that respect layer-wise conservation and if the prune size, the total score for the parameters pruned at any iteration, is strictly less than the cut size, the total score for an entire layer, whenever possible, then the algorithm satisfies the Maximal Critical Compression axiom.
We prove this theorem by contradiction. Assume there is an iterative pruning algorithm that uses positive, layer-wise conserved scores and maintains that the prune size at any iteration is less than the cut size whenever possible, but doesn’t satisfy the Maximal Critical Compression axiom. At some iteration the algorithm will prune a set of parameters containing a subset separating the input neurons from the output neurons, despite there existing a set of the same cardinality that does not lead to layer-collapse. By theorem 2, the total score for the separating subset is , which implies by the positivity of the scores, that the total prune size is at least . This contradicts the assumption that the algorithm maintains that the prune size at any iteration is always strictly less than the cut size whenever possible. ∎
Motivated by Theorem 4, we can now choose a practical, yet effective, number of pruning iteration () and schedule for the compression ratios () applied at each iteration () for the SynFlow algorithm. Two natural candidates for a compression schedule would be either linear () or exponential (). Empirically we find that the SynFlow algorithm with 100 pruning iterations and an exponential compression schedule satisfies the conditions of theorem 4 over a reasonable range of compression ratios ( for ), as shown in Fig. 6(b). This is not true if we use a linear schedule for the compression ratios, as shown in Fig. 6(a). Interestingly, Iterative Magnitude Pruning also uses an exponential compression schedule, but does not provide a thorough explanation for this hyperparameter choice .
Potential numerical instability. The SynFlow algorithm involves computing the SynFlow objective,
, whose singular values may vanish or explode exponentially with depth. This may lead to potential numerical instability for very deep networks, although we did not observe this for the models presented in this paper. One way to address this potential challenge would be to appropriately scale network parameters at each layer to maintain stability. Because the SynFlow algorithm is scale invariant at each layer , this modification will not effect the performance of the algorithm.
An open source version of our code and the data used to generate all the figures in this paper are available atgithub.com/ganguli-lab/Synaptic-Flow.
All pruning algorithms we considered in our experiments use the following two steps: (i) scoring parameters, and (ii) masking parameters globally across the network with the lowest scores. Here we describe details of how we computed scores used in each of the pruning algorithms.
Random: We sampled independently from a standard Gaussian.
Magnitude: We computed the absolute value of the parameters.
SNIP: We computed the score using a random subset of the training dataset with a size ten times the number of classes, namely for CIFAR-10, for CIFAR-100, for Tiny ImageNet, and for ImageNet. The score was computed on a batch of size 256 for CIFAR-10/100, 64 for Tiny ImageNet, and 16 for ImageNet, then summed across batches to obtain the score used for pruning.
GraSP: We computed the score using a random subset of the training dataset with a size ten times the number of classes, namely for CIFAR-10, for CIFAR-100, for Tiny ImageNet, and for ImageNet. The score was computed on a batch of size 256 for CIFAR-10/100, 64 for Tiny ImageNet, and 16 for ImageNet, then summed across batches to obtain the score used for pruning.
We adapted standard implementations of VGG-11 and VGG-16 from OpenLTH, and ResNet-18 and WideResNet-18 from PyTorch models. We considered all weights from convolutional and linear layers of these models as prunable parameters, but did not prune biases nor the parameters involved in batchnorm layers. For convolutional and linear layers, the weights were initialized with a Kaiming normal strategy and biases to be zero.
Here we provide hyperparameters that we used to train the models presented in Fig. 1 and Fig. 6. These hyperparameters were chosen for the performance of the original model and were not optimized for the performance of the pruned networks.
|CIFAR-10/100||Tiny ImageNet||CIFAR-10/100||Tiny ImageNet||CIFAR-10/100||Tiny ImageNet||CIFAR-10/100||Tiny ImageNet|
|Learning Rate Drops||60, 120||30, 60, 80||60, 120||30, 60, 80||60, 120||30, 60, 80||60, 120||30, 60, 80|