1 Introduction
Stateoftheart deep learning typically operates in the overparametrized regime. However, a large body of literature has shown that a high number of carefully chosen parameters can be removed (i.e. pruned) while maintaining the network’s predictive performance
(lecun; molchanov; evci2020rigging; su2020sanitychecking; lee2018snip; Wang2020Picking).It was first believed that sparse networks obtained from pruning pretrained networks cannot be retrained from scratch. However, (frankle2018the) presented the Lottery Ticket Hypothesis (LTH): randomly initialized deep neural networks contain sparse subnetworks (winning tickets) that – when trained in isolation – achieve test performance comparable to the fully trained dense model. This hypothesis suggests that we could prune a large number of a network’s weights at initialization (i.e., before training) and still obtain the full performance after training. That being said, the procedure proposed in (frankle2018the) involves training the dense model to convergence multiple times, which is computationally very expensive. SNIP (lee2018snip) and GraSP (Wang2020Picking) were then proposed with the goal of pruning a randomly initialized model at initialization using a sensitivity criterion for each weight.
Due to its decreased performance on large network/dataset combinations, the LTH was later revised for very deep networks. The authors note that for training the sparse model, we need to initialize its weights to the dense model’s weights from a certain point early in training (frankle2020linear). This suggests that the best performing subnetworks can be found early in training (instead of before). Finding the earliest point at which we can prune without losing performance is challenging, and the authors present Linear Mode Connectivity (LMC), a computationally very expensive approach involving training multiple copies of the network. This suggests that to achieve a good tradeoff between performance and efficiency of finding these subnetworks, pruning should follow the same spirit as before training methods but be applied early instead.
Besides the question of when
to prune, an orthogonal dimension is structured vs. unstructured pruning. Unstructured pruning prunes individual weights (i.e., sets weight matrix elements to zero), while structured pruning removes entire neurons (i.e., rows/columns of weight matrices or convolution filters). Thus, structured pruning can reduce the training/inference time, memory footprint, and carbon emissions of the model; unstructured pruning has no significant impact on the above. On the other hand, structured pruning is generally much more challenging, and most previous works (including the LTH) perform unstructured pruning.
Unless we maintain the learning dynamics of a neural network, pruning will hinder the learning process. The learning dynamics of a feed forward neural network can be described through the Neural Tangent Kernel (NTK) which approximately remains constant after some epochs of training in networks
(goldblum2020truth). If we preserve the NTK while pruning, we expect the training process to not be affected. We develop a novel and principled pruning method that preserves the Gradient Flow (GF). By leveraging the close relation of the NTK to the GF, we show that we can prune a network while keeping the effect on the NTK minimal. Furthermore, we use the connection between GF and NTK to track when the learning dynamics become stable enough to perform early pruning.We present Early Compression via Gradient Flow Preservation (EarlyCroP), a method for pruning a network early in training. EarlyCroP requires training the model only once, yet maintains the dense network’s performance at high levels of sparsity. Thus, in the unstructured setting, EarlyCroP is about 5 times less expensive than the LTH. In addition, our method can be applied before or early in training, and extended to structured pruning. Performing structured pruning before training provides a better accuracy/efficiency tradeoff than most previous structured baselines, and enables us to train sparse networks whose dense versions would not fit into the GPU. Furthermore, EarlyCroP reduces carbon emissions by up to 70% without affecting dense performance and can thus help mitigate the environmental impact of deep learning and reduce training and inference costs.
Contributions. We approach neural network pruning with the explicit goal of unlocking realworld, practical improvements. Our key contributions are:

[leftmargin=*]

Why to prune? We transfer a GF based pruning criterion to be applicable for structured pruning, which allows faster forward and backward passes using less GPU memory and computational cost, while surpassing baselines in test accuracy;

How to prune? We leverage a connection between the NTK and GF by using a pruning criterion that aims to minimally affect the GF, and therefore the NTK and learning dynamics;

When to prune? We further utilize the connection between GF and NTK to indicate the smooth transition to the lazy kernel regime, the phase during which we can prune the network with little effect on the training dynamics. Thus, this brings the cost saving of structured pruning during training as well. We also show that our method can be applied before training, reducing costs even further at only a small drop in accuracy.
These contributions unlock substantial realworld benefits for practitioners and researchers: we can train large sparse models on commodity GPUs whose dense counterparts would be too large. We evaluate our approach extensively over a diverse set of model architectures, datasets, and tasks.
2 Related work
Pruning Criterion. In order to prune network weights, they need to be ranked according to an importance score. This concept is not new, in fact, it was introduced in ‘Optimal Brain Damage’ (lecun) and ‘Optimal Brain Surgeon’ (hassibi). Yet, it only regained traction when (han) showed successful deep compression by pruning based on weight magnitude. Most pruning research since then has followed this approach (NEURIPS2019_1113d7a7; evci2020rigging; mostafa2019parameter; bellec2018deep; dettmers2020sparse; mocanu2018scalable; You2020Drawing; chen2020lottery). However, the biggest drawback of using weight magnitudes is that the network needs to be trained first to achieve a good accuracy. Therefore, more recent works have focused on scoring weights without the need for training using first order (lee2018snip; tanaka2020pruning) and second order (Wang2020Picking; lubana2021gradientflowframework) information. Note that the pruning process can be applied in oneshot or iteratively (jorge2021progressive; verdenius2020pruning).
Pruning Time. Up until the introduction of the LTH (frankle2018the), the consensus in the literature was that pruned models cannot be trained from scratch. Therefore, all sparse networks were extracted either from pretrained networks (han; lecun; hassibi; wangchaoqi; li2016pruning), or throughout training (srinivas2016generalized; louizos2018learning; evci2020rigging; mostafa2019parameter; bellec2018deep; dettmers2020sparse; mocanu2018scalable). However, the LTH showed that there exist sparse models within the original randomly initialized dense model that can achieve comparable performance to the dense model. That being said, the LTH’s pruning algorithm, Iterative Magnitude Pruning (IMP), requires multiple iterations of a trainprune cycle. Nevertheless, the LTH’s findings motivated works that strove to extract these sparse networks directly from the randomly initialized dense network (su2020sanitychecking; lee2018snip; Wang2020Picking; dejorge2020progressive; frankle2020pruning; verdenius2020pruning).
The first method that introduced pruning before training was SNIP (lee2018snip), with the goal of preserving weights that have the highest effect on the loss. A subsequent work, GraSP (Wang2020Picking), uses the Hessiangradient product in its score and prunes the weights with the goal of increasing the Gradient Flow (GF). Finally, lubana2021gradientflowframework show that GraSP can lead to an increasing loss and instead propose to prune the weights that least affect the GF.
The performance of the LTH degrades with bigger networks and datasets (frankle2020stabilizing). Subsequently, the LTH was updated to indicate that the best performing sparse models do not necessarily exist at initialization but rather that they appear early in training.
To the best of our knowledge, the only work that explores the extraction of sparse models early in training is Early Bird Tickets (You2020Drawing). They perform structured pruning early in training when the Hamming Distance between pruning masks at subsequent epochs becomes smaller than some threshold. However, they do not offer any theoretical justification for pruning early in training and they only show their results for a maximum pruning ratio of 70%, suggesting that the Hamming Distance is not an effective criterion to achieve high sparsities.
The Early Phase of DNN Training. Another line of work aims to analyze the early phase of neural network training. gurari2019gradient
study the Hessian eigenspectrum and observe that during training, a few large eigenvalues emerge in which gradient descent happens, whereas the rest get close to zero. However, these observations depend on the architecture.
achille2019critical found that the network goes through critical training periods during which perturbing the data can cause irreversible damage to the network’s final performance, after which the network becomes robust to these perturbations. However, the critical periods occur very late in the training process. Finally, frankle2020linear propose the Linear Mode Connectivity (LMC) as a method for detecting when networks become stable to SGD noise. However, LMC is extremely expensive, requiring to train two copies of a network to completion at every epoch.Structured Pruning. Pruning methods are divided into two categories: (1) unstructured methods which generate a binary mask that is applied before every forward pass (frankle2018the; lee2018snip; tanaka2020pruning; Wang2020Picking), and (2) structured methods that remove entire neurons or convolutional filters (ding; li2016pruning; louizos2018learning; verdenius2020pruning; you2019gate). Unstructured pruning is the more common variant for its simplicity and ease of implementation. However, since the entire dense network remains the same size, pruning does not provide improvements in GPU RAM, time, and carbon emissions. While these improvements can be obtained for unstructured pruning by using operations on sparse compressed matrices, they require significant changes to the network when dealing with advanced layers. Conversely, structured pruning reduces the size of weight matrices, thereby requiring less space, time and energy during training and inference. We highlight: (1) SNAP (verdenius2020pruning), which adapts the SNIP (lee2018snip) score to the structured setting to prune before training, (2) Gate Decorators (you2019gate), which builds on top of (liu2017learning) by adding a sensitivitybased criterion and pruning the network iteratively during training, and (3) EfficientConvNets (li2016pruning), which prunes a pretrained network by scoring filters by their norm. Other recent works include (8816678; He_2019_CVPR; 10.1145/3295500.3356156) but we omit them since GateDecorators outperforms them.
3 Background
Neural Tangent Kernel (NTK). The NTK is defined as (jacot2018) where denotes the gradient of the model prediction w.r.t. the model parameters at time . The NTK is known to accurately describe the dynamics of the network’s prediction during training, under the assumption that the following Taylor expansion holds:
(1) 
Under the NTK assumption, a neural network reduces to a linear model with the Neural Tangent Kernel (NTK). The NTK assumption is particularly accurate for wide neural networks. In practice, this assumption holds (i.e. the NTK remains approximately constant) after the model training dynamic has transitioned from the rich active regime to the lazy kernel regime (see Section 4.3).
Gradient Flow (GF). We define the GF as (lubana2021gradientflowframework), where denotes the gradient of the model loss w.r.t. the model parameters . The GF is known to accurately describe the dynamics of the network’s gradient norm during training, under the assumption that the following Taylor expansion holds:
(2)  
(3) 
where denote the model’s Hessian at time . In order to prune the weights which least affect the GF, lubana2021gradientflowframework propose to use the following weight importance score:
(4) 
and remove % of the parameters with the lowest scores. Preserving the GF stands in stark contrast to the importance score of GraSP (Wang2020Picking) , that maximizes the GF. Note that while the importance score (4) was initially used before training, we propose to use this importance score to prune during training either in oneshot or iteratively.
4 Method
The core motivation of our work is to improve the applicability of sparse neural networks w.r.t. concrete realworld metrics such as carbon emissions, price, time or memory at both training and inference time. To this end, our method first transfers the pruning criterion (4) to structured pruning, thus allowing faster forward and backward passes (see Sec. 4.1). Second, we derive a relation between the NTK and the GF suggesting that preserving the GF also preserves the NTK (see Sec. 4.2). Hence, the pruning criterion (4) is a suitable importance weight score for pruning a neural network once the NTK assumption holds i.e. in the lazy kernel regime. Third, our method detects when we enter the lazy kernel regime to prune early in training without impacting the training dynamics (see Sec. 4.3), thereby extending the cost saving of our (structured) sparse neural networks to the training phase while achieving a high test accuracy.
4.1 Why to prune?
The main use case of unstructured pruning is to highlight the overparametrized nature of neural networks. In particular, while denselike sparsity for deep learning (zhou_learning_2021) is a promising research direction, it suffers from multiple downsides: (a) models are typically only sparsified in the forward pass and hence denselike sparsity has limited potential in speeding up training. (b)
Not all deep learning frameworks (e.g. PyTorch) support it.
(c) Only the newest GPUs (starting Nvidia Ampere 2020) support denselike sparsity.In order to really benefit from pruning, we need to prune full structures (neurons and channels) instead. This reduction in dimensions/channels directly translates into lower computational cost on existing GPUs without
further implementation efforts or any usage of any specialized tensor operations. Thus, this leads to a sparse model that provides improvements in time, memory and carbon emissions. Combined with the fact that we can apply our pruning method before and early in training, we can drastically reduce not only model costs after training but during training as well (see Sec.
4.3).In order to achieve structured pruning, we need to score entire nodes instead of individual weights, i.e. generate a score for a node’s activation function
. However, since is simply a function and not a learnable parameter, we cannot use pruning score (4) directly. Instead, similarly to verdenius2020pruning, we define auxiliary gates of a layer by over each node’s input, which in turn will act as a learnable parameter whose gradient information represents the activation’s information. We can formally define this for a linear layer with weight and bias , and an input in the following way:(5)  
(6) 
After scoring the nodes using the auxiliary gates, the pruning process follows the original one by removing % of nodes.
4.2 How to prune?
In this section, we draw an important connection between GF and NTK showing that pruning the weights with the lowest importance score (4) aims at preserving the training dynamics of both the network’s gradient norm and the network’s prediction during training. First, we observe that GF and NTK are connected by the following relation:
(7)  
(8)  
(9) 
Second, lubana2021gradientflowframework present evidence that preserving the GF also implicitly preserves the model loss . In particular, preserving the GF also preserves the gradient of the loss w.r.t. the prediction . Hence, the relation (7) and the preservation imply that the NTK is also preserved when the GF is preserved.
Furthermore, given that Taylor expansions (1) and (3) hold, the pruning criteria (4) which preserves the GF – which maintains the gradientnorm dynamics – is also preserving the NTK – which maintains the prediction dynamics. This remark is crucial since while the dynamics of the neural network’s predictions during training can be approximated well by (1) during the lazy kernel regime, this approximation might not be accurate during the rich active regime.
4.3 When to prune?
First, we show in an introductory experiment that the pruning time has an important impact on the final accuracy of the pruned model. Hence, we train multiple ResNet50 models on CIFAR100 and prune each to from epoch (i.e. before pruning) to epoch (see Fig.2). We observe that (1) the longer we train the dense model before pruning, the higher the final accuracy of the sparse model, and most importantly (2) after a certain point in time, further training of the dense model before pruning does not bring significant improvement on the final accuracy. Indeed, we observe a 3% improvement in the final accuracy of the sparse model when pruning at epoch 1 of training instead of before training, an 11% improvement when pruning at epoch 26, and no great improvement when pruning after epoch 30.
We now introduce the pruning time detection score used by EarlyCroP which is motivated from both practical and theoretical perspectives. EarlyCroP aims to detect the best time for pruning in two steps: (1) at every epoch we compute the pruning time score
(10) 
and, (2) if the difference of the scores at two subsequent epochs relative to the initial score is smaller than a defined threshold ,
(11) 
we run the EarlyCroP pruning algorithm described in Algorithm 1. It can be clearly deduced that the smaller the pruning time score, the more negligible the second order term in eq. 3 will be, making the latter a good approximation. Additionally, by the triangle inequality, we can extract the following upper bound from 11
(12) 
which is expected to lie in when weights change less significantly over time. Hence, the scale of the threshold is expected to be similar for different models and datasets. The complexity of computing the score is where is the number of model parameters, thus incurring only minor computation overhead at every epoch to detect the pruning time.
From a theoretical perspective, the pruning time detection algorithm’s goal is to detect when the linearization of the prediction dynamics (1) assumed by the NTK holds during training. In the early phase of training called rich active regime, neural network parameters move to a significant distance from the initial weights. Thus the linearization of the prediction dynamics (1) usually does not hold in early training epochs and the NTK quickly changes. This rich active regime is crucial to achieve high performance, in particular for deep models (woodworth2020regime). In the second phase of training called lazy kernel regime, the parameters move by a small distance, thus making the linearization (1) a good approximation of the training dynamics of the predictions (sun2019optimization). Since our importance weight score (4) assumes the linearization (1
), the best moment to prune is when the model transitions to the
lazy kernel regime during which the NTK is approximately constant. Further, previous works (sun2019optimization; amari2020target; ghorbani2020linearized) showed that constancy of the NTK is a consequence of a constant weight norm during training. The transition to the lazy kernel training regime is gradual and can be detected when the relative change in the weight norm from initialization becomes roughly constant i.e. when becomes very close to 0.In practice, we expect the pruning time criteria to be a reliable indicator of the final accuracy of the pruned model. Indeed, we observed that the detection score correlates well with the final test accuracy of the sparse model (see Fig.2). It can be clearly seen that the smaller the detection score at the moment of pruning, the higher the final test accuracy of the pruned model. In practice, we observed in Fig. 2 and in Fig. 8 in the appendix that pruning to higher sparsities benefits more from longer training. Therefore, we use which connects the time pruning threshold to the target sparsity . As desired, a higher target sparsity leads to longer dense training (see Fig.8).
5 Empirical Evaluation
We now show the effectiveness of our EarlyCroP for structured pruning (EarlyCroPS) and unstructured pruning (EarlyCroPU). For this we determine the point for early pruning as described in Section 4.3. We use CroP as our pruning criterion before training and CroPit if pruning is additionally performed iteratively. We use a cloud instance of GTX 1080 TIs for all experiments. Further details about the experimental setup can be found in the appendix. The code and further supplementary material is available online^{1}^{1}1www.cs.cit.tum.de/daml/earlycrop/.
Structured (Left) and Unstructured (Right) test accuracy for ResNet18/CIFAR10 (a), ResNet18/TinyImagenet (b), VGG16/CIFAR10 (c), and VGG16/CIFAR100 (d) with increasing weight sparsity.
Image Classification. The datasets used for Image Classification are the common public benchmarks CIFAR10 (Krizhevsky09), CIFAR100 (Krizhevsky09)
, and TinyImagenet
(5206848). Regarding networks, we use ResNet18, VGG16, ResNeXt101 32x16d and ResNeXt101 32x48d. For unstructured pruning baselines, we use random pruning, SNIP (lee2018snip), GraSP (Wang2020Picking), and LTR (frankle2020linear). For structured pruning baselines, we use random pruning, EfficientConvNets (li2016pruning), GateDecorators (you2019gate), and SNAP (verdenius2020pruning). All models are trained for 80 epochs, except LTR which retrains the network up to 10 times. We report train and test accuracy, weight and node sparsity, batch and total training time in seconds, GPU memory in GB, disk size in MB, carbon emitted from the extraction and training of the sparse model using CodeCarbon (codecarbon) in grams. Note that total training time includes the time to find and train the sparse model.Regression. We evaluate a Fully Convolutional Residual Network (laina2016deeper)
on the NYU Depth Estimation task
(Silberman:ECCV12). We compare EarlyCroPS and EarlyCroPU against all unstructured baselines since they are stronger than the structured baselines. All pruned models are trained for 10 epochs. We report the performance using the Root Mean Squared Error (RMSE).Natural Language Processing (NLP). We evaluate the Pointer Sentinal Mixture Model(Merity2017PointerSM) on the PTB language modeling dataset(PTBDATASET). We compare EarlyCroPS and EarlyCroPU to all unstructured baselines since they are stronger than the structured baselines. We train the pruned models for 30 epochs and we report the achieved log perplexity.
Reinforcement Learning (RL). We use the FLARE framework (akbik2019flair) to evaluate a simple 3layer FCNN with layer size 256 using the A2C algorithm on the classic control game CartPolev0 (openaigym). We run 20 agents with 640 games each. We compare our EarlyCroPS and EarlyCroPU against LTR and Random baselines. All pruned models are trained for 30 epochs. We report the performance of the pruned models using the average returned environment reward.
5.1 Image Classification
Accuracy. We present the accuracy over different sparsity levels for the modeldataset combinations ResNet18/ CIFAR10, ResNet18/TinyImagenet, VGG16/CIFAR10, and VGG16/CIFAR100 in Figure 3 ad, respectively. Our methods EarlyCroPS and EarlyCroPU consistently outperform all other methods but the LTR, where we perform on par. However, as will be discussed later, the LTR comes with a 35 times higher training time than the dense model while our methods reduce training time. There are two further exceptions when it comes to the best accuracy on ResNet18/TinyImageNet. First, GateDecorators performs as well as EarlyCroPS. Second, for lower sparsity rates, EarlyCroPU is outperformed by some methods that prune before training.
Structured vs. Unstructured. EarlyCroPS closes the accuracy gap of existing approaches between unstructured and structured pruning on CIFAR10 dataset. However, a gap remains for the larger and more complex datasets CIFAR100 and TinyImagenet. Nevertheless, structured pruning can be used to reduce the training time and memory requirements. This also implies that with our EarlyCroPS we can use a larger model while saving compute (see Sec. 5.2).
Method  Test accuracy 








Dense  91.5% 0.12      0.78  109  2.38  398  83  

RandomS  86.3% 0.06  93.7%  75.0%  0.68  82  0.62  24.9  38  
SNAP  87.6% 0.94  93.6%  72.6%  0.70  81  0.63  25.4  39  
CroPS  87.5% 0.36  93.6%  72.3%  0.67  91  0.63  25.4  43  
CroPitS  87.8% 0.33  95.0%  74.5%  0.52  64  0.59  19.6  35  
EarlyBird  84.3% 0.32  95.3%  65.0%  0.48  72  0.58  19.1  55  
EarlyCroPS  91.0% 0.52  95.1%  65.8%  0.52  66  0.56  19.2  68  
GateDecorators  87.3% 0.09  95.7%  73.7%  0.72  83  0.58  17.2  54  
EfficientConvNets  70.5% 0.53  95.9%  79.7%  0.77  83  0.76  25.4  63  

RandomU  84.9% 0.24  95.0%    0.78  102  2.86  12.0  79  
SNIP  88.2% 0.57  95.0%    0.79  105  2.86  12.0  80  
GRASP  88.4% 0.13  95.0%    0.79  106  2.84  12.0  81  
CroPU  87.9% 0.16  95.0%    0.75  107  2.88  12.0  79  
CroPitU  89.1% 0.24  95.0%    0.80  113  2.88  12.0  87  
EarlyCroPU  91.1% 0.23  95.0%    0.74  97  2.86  12.0  83  
LTR  91.5% 0.26  95.0%    1.94  111  2.51  12.0  202 
denotes standard deviation, and
/ indicate metrics where higher/lower is better. Bold/underline indicate best/second best results. GPU RAM and Disk correspond to those of the final pruned model.Method  Test accuracy 








  Dense  90.2%      1.82  290  1.02  1720  246  

RandomS  89.3%  98.0%  86.1%  0.67  82  0.23  33.6  43  
SNAP  89.8%  98.2%  89.0%  0.68  89  0.22  30  55  
CroPS  91.1%  98.0%  88.0%  0.71  91  0.23  33.6  83  
CroPitS  92.4%  98.0%  88.0%  0.81  112  0.23  30.4  100  
EarlyBird  85.9%  98%  89 %  0.52  110  0.32  36.2  160  
EarlyCroPS  93.0%  98.0%  89.0%  1.16  112  0.63  36.0  156  
GateDecorators  90.0%  98.0%  87.0%  1.07  111  0.23  37.8  143  
EfficientConvNets  84.2%  98.0%  86.0%  1.66  89  0.64  34.2  209  

RandomU  88.5%  98.0%    2.03  159  1.22  35.0  247  
SNIP  90.1%  98.0%    2.02  157  1.22  35.0  248  
GRASP  92.0%  98.0%    2.03  157  1.23  35.0  249  
CroPU  91.8%  98.0%    2.02  157  1.22  35.0  248  
CroPitU  91.6%  98.0%    2.02  157  1.22  35.0  249  
EarlyCroPU  93.0%  98.0%    2.01  157  1.22  35.0  250  
LTR  93.6%  98.0%    4.07  158  1.22  35.0  592 
Training cost. In Tables 1 & 2 we complement the accuracy with the training time, batch time, GPU RAM, Disk space and CO emissions for a sparsity of 95% on ResNet18/CIFAR10 and 98% on VGG16/CIFAR10 respectively. We see that our EarlyCroPS is not only preserving the high accuracy but also comes with significant improvement in training time (33% and 36% resp.) and time per batch (39% and 61% resp.). It is as efficient as the other structured pruning methods or outperforms them. When only considering the CO footprint, CroPitS and SNAP outperform methods that prune later in training. In the appendix, we also give details about different modeldataset combinations. In summary, the stated observations also hold for the other evaluated modeldataset combinations.
Pruning early vs. before. Pruning early in training (i.e. when we enter the lazy kernel regime) outperforms pruning before training. Our EarlyCroPS and EarlyCroPU have a clear edge over the methods that prune before training. This is even true if we use the same pruning criterion (CroP) and additionally prune iteratively (CroPit). The only drawback of pruning early in training vs. before is that for the first epochs we either require more GPU RAM or need to reduce the batch size. For the model size on disk we do not find a significant difference among pruning methods.
5.2 Pruning a Large Model
The goal of this experiment is twofold: we show that (1) our criterion can be used to prune large models that don’t fit on commodity GPUs, and (2) the resulting sparse model matches the performance of the dense one, and outperforms a dense model of the same size. To this end, we introduce the ResNext101_32x48d as our large model, a network that has 829 Million parameters and requires 15.5 GB to be loaded into GPU memory; exceeding the memory of common GPUs such as the RTX 3080 Ti. Nevertheless, with our method we can still efficiently train such a large model. For this, we perform one initial pruning step before training using the CPU and then continue on a commodity GPU as usual. We also introduce the ResNext101_32x16d as our smaller dense model which has 193 Million parameters and requires 3.9 GB to be loaded into GPU memory. The results of the experiment are depicted in Table 3.
First, we observe that the large ResNext101_32x48d pruned to 98.5% weight sparsity matches the performance of its dense counterpart in test accuracy. Moreover, training the sparse subnetwork has a 14 times smaller carbon footprint, is 7 times faster to train, is 192 times smaller on disk, and takes 4.9 times less GPU memory than the large dense model. Interestingly, the pruned model also outperforms the ResNext101_32x16d model of the same size, while training 6.2 times faster and emitting 9.5 times less carbon. Finally, we show that when training for more epochs, the sparse model achieves an even bigger performance gap compared to both dense models while still taking less total training time. This experiment not only shows that CroPS makes training large models on commodity machines possible, but can extract sparse models that are more efficient and more accurate compared to dense models of the same size.
Model 



Epochs 




RN48  92.4%      30  4.60  18.84  634  
RN16  92.1%      30  4.02  3.89  445  
RN48S  92.5%  98.5%  89.9%  30  0.64  3.56  47  
RN48S  93.2%  98.5%  89.9%  80  2.60  3.56  194 
5.3 Regression
Regarding Regression, we can see from Figure 5 that both variants of EarlyCroP preserve the dense model’s RMSE even at 99.9% weight sparsity. All Before training methods except GraSP have an instant increase to 0.20 RMSE with continuous decline at higher sparsities. Surprisingly, Random Pruning outperforms GraSP at all pruning ratios. This is due to GraSP pruning entire layers and limiting the network’s learning capabilities.
5.4 Natural Language Processing
NLP presents itself as the most challenging out of all evaluated tasks. Nevertheless, both versions of EarlyCroP outperform all other baselines until 89% sparsity (see Figure 6). Beyond that, the unstructured version is on par or slightly better than other unstructured baselines whereas the structured version continues to outperform all compared baselines. In this task, the importance of early pruning is accentuated by the large gap between the early and before versions of CroP. Interestingly, LTR performs very poorly compared to all other baselines on all reported pruning ratios. Indeed, certain layers in the PSMM network converge to small weight magnitudes during training compared to the rest of the network. This means that any pruning method relying solely on the weight magnitudes, and operating at a global scale in the network, would prune these layers entirely, leading to an untrainable network. Thus, this experiment highlights the importance of gradientbased information when evaluating the importance of model parameters. We show additional NLP results by evaluating BERT on multiple language tasks (see Appendix E.2).
5.5 Reinforcement Learning
We can observe from Figure 4 that EarlyCroP outperforms LTR in both the structured and unstructured setting. Note that the EarlyCroPS once again outperforms its unstructured counterpart. However, if we allow the unstructured models to train for a longer time, they achieve similar performance to the structured version. This can be explained by the ease of training of structured models, which are still fullyconnected models where all computed gradients contribute to the weight updates, whereas unstructured models compute gradients that are not used by the pruned weights, rendering the training slower.
6 Conclusion
We have demonstrated that, for vision, NLP, and RL tasks, EarlyCroPU extracts winning tickets matching and often outperforming those found by the LTR by pruning early in training when the model enters lazy kernel training. Additionally, we showed that EarlyCroPS outperforms other structured methods, providing the best tradeoff between final test accuracy and efficiency in terms of time, space, and carbon emissions. Finally, we show that we can use CroPS to train models that do not fit on commodity GPUs by extracting sparse models that preserve the initial model’s performance and outperform a similarly sized dense model for the same number of epochs. Thus, our methods bring tangible realworld benefits for researchers and practitioners. We hope that the results shown in this paper motivate more research on the study of structured pruning in the early phase of DNN training.
References
Appendix A Training and Inference Cost Computation
This section details the computations in Figure 1. We introduce the V100 16GB GPU (2.48$/h) and the V100 32GB GPU (4.96$/h). We use the total training time needed to train RN48 and RN48S for 30 epochs each (see Table 3. Consequently, in order to train RN48, we need . In order to train RN48S, we need
Appendix B Algorithm
Appendix C Experimental Setup
c.1 Optimization
Image Classification. For all experiments, we use the ADAM(adamoptimizer) optimizer and a learning rate of . The One Cycle Learning Rate scheduler is used to train all models except VGG16. The batch size used for CIFAR10 and CIFAR100 experiments is 256 while for TinyImagenet it is 128. All sparse models are allowed to train the same amount of epochs (80) to converge which, except for LTR, includes the number of epochs required to extract the sparse model. In the case of LTR the final sparse model is allowed to train for 80 epochs.
Regression. For all experiments, we use a batch size of 8, the ADAM optimizer with a learning rate of . All pruned models are trained for 10 epochs.
Natural Language Processing. For all experiments, we use a batch size of 128, and the ADAM optimizer with a learning rate of . All pruned models are trained for 30 epochs.
Reinforcement Learning. A description of the models used and number of runs used for each environment can be found in Table 4.
Name  Network  Algorithm  Agents  Games 
CartPolev0  MLP(128128128out)  A2C  16  8000 
Acrobotv1  MLP(256256256out)  A2C  16  8000 
LunarLanderv2  MLP(256256256out)  A2C  16  8000 
c.2 Datasets PreProcessing
Cifar10 (Krizhevsky09)
We augment the normalized CIFAR10 with Random Crop and Random Horizontal Flip. Images are additionally resized to .
Cifar100 (Krizhevsky09)
We augment the normalized CIFAR100 with Random Crop, Random Horizontal Flip, and Random Rotation.
TinyImagenet (5206848)
We normalize the dataset and resize each image to .
Appendix D Evaluation Metrics
In this section we describe how specific metrics are calculated.
Time
We report time in two separate ways. First, we report the total time required (Training time), which is defined from the start of the experiment until the sparse model’s training is finished. Second, we report the time it takes to perform a full forward and backward pass (Batch time) on a given batch using the CUDA time measurement tool (NEURIPS2019_9015).
Gpu Ram
The RAM footprint of a process refers to how much memory it consumes on the GPU. This effectively includes the costs of loading the model and performing a training step on it. We use the CUDA memory measurement tool to report this metric(NEURIPS2019_9015).
Disk Storage
We estimate the storage needed to store a model on disk using the CSR sparse matrix format (10.1145/1583991.1584053). Similarly to (verdenius2020pruning)
, we used a ratio of 16:1 float precision on all vectors of the CSR format.
Energy Emissions
We estimate CO emissions in g using CodeCarbon emissions tracker (codecarbon). These estimates consider all emissions from the start of experiments until the end of training.
Appendix E Additional Results
e.1 Reinforcement Learning
See Figure 7 for experiments on the Acrobotv1 and LunarLanderv2 environments.
e.2 Natural Language Processing
We evaluated BERT on multiple language tasks (see Table 8). At the same pruning sparsity, EarlyCroPU outperforms LTR on 5 out of 8 tasks while training 10× faster.
e.3 Pruning Point Experiments
In Figure 8 we present more experiments on pruning models at different points in training. We can clearly observe a correlation between the desired pruning rate and the optimal time to prune. The higher the desired final sparsity, the longer the network should be trained before being pruned.
e.4 Vgg16/cifar100
See Table 5 for comparisons between different pruning criterions at the same pruning level on VGG16/CIFAR100.
Method  Test accuracy  Weight sparsity  Node sparsity  Training time (h)  Batch time (ms)  GPU RAM (GB)  Disk (MB)  Emissions (g)  
  Dense  62.1%      0.77  114  1.03  1745  88 
Structured  RandomS  53.9%  98.0%  86.0%  0.59  53  0.23  35  29 
SNAP  49.3%  98.0%  89.0%  0.67  54  0.16  36  33  
CroPS  57.4%  98.0%  89.0%  0.61  46  0.23  36  35  
CroPitS  56.5%  98.1%  89.0%  0.62  44  0.23  33  30  
EarlyBird  60.7%  98.0%  89.0%  0.56  68  0.20  36  62  
EarlyCroPS  62.2%  97.9%  88.0%  0.64  69  0.23  36  58  
GateDecorators  55.0%  97.9%  87.0%  0.61  78  0.23  36  68  
EfficientConvNets  29.5%  98.0%  86.0%  0.72  55  0.24  36  83  
Unstructured  RandomU  55.8%  98.0%    0.74  118  1.23  35  99 
SNIP  61.9%  98.0%    0.79  109  1.24  35  90  
GRASP  63.4%  98.0%    0.79  113  1.24  35  91  
CroPU  63.8%  98.0%    0.74  109  1.23  35  94  
CroPitU  56.3%  98.0%    0.74  111  1.23  35  91  
EarlyCroPU  65.1%  98.0%    0.74  109  1.23  35  91  
LTR  64.7%  98.0%    3.44  109  1.28  35  301 
e.5 ResNet18/TinyImageNet
See Table 6 for comparisons between different pruning criterions at the same pruning level on ResNet18/TinyImagenet.
Method  Test accuracy  Weight sparsity  Node sparsity  Training time (h)  Batch time (ms)  GPU RAM (GB)  Disk (MB)  Emissions (g)  
  Dense  51.3%      7.26  320  3.53  569  882 
Structured  RandomS  37.3%  91.2%  80.0%  6.23  289  1.08  51  464 
SNAP  38.3%  90.4%  82.6%  6.06  268  0.84  55  514  
CroPS  39.1%  90.1%  77.7%  6.72  237  1.11  54  615  
CroPitS  39.1%  91.4%  79.3%  6.66  236  1.08  49  591  
EarlyCroPS  39.2%  90.8%  84.1%  7.03  202  0.25  49  676  
GateDecorators  30.1%  89.2%  91.2%  6.20  193  0.87  61  930  
EfficientConvNets  27.7%  91.0%  79.8%  6.60  226  0.22  52  769  
Unstructured  RandomU  49.3%  90.0%    7.25  351  4.20  57  932 
SNIP  46.2%  90.0%    7.27  314  4.18  57  854  
GRASP  43.7%  90.0%    7.27  315  4.22  57  881  
CroPU  46.7%  90.0%    7.26  314  4.22  57  877  
CroPitU  19.1%  90.0%    7.26  313  4.22  57  890  
EarlyCroPU  49.8%  90.0%    7.26  314  4.23  57  880  
LTR  46.3%  90.0%    44.7  603  3.68  57  5540 
e.6 VGG16/ImageNet
In Table 7 we present a comparison between a dense VGG16 and a sparse VGG16 pruned using EarlyCroPS on the ImageNet2012 (ILSVRC2012) (5206848) classification dataset. Given 62 hours of training on ImageNet2012 (ILSVRC2012) on a signle V100 GPU, EarlyCroPS on VGG16 (50% weights pruned) achieves an accuracy of 61.43% while the dense model only achieves 58.78%. Moreover, for the same number of training epochs, EarlyCropS achieves 60.01% in 51 hours while the dense model achieves 58.78% in 62 hours.
Method  Top1 Accuracy  Top5 Accuracy  Train Time (hours)  Epochs  Batch Time (seconds)  GPU Memory (GB)  
Dense  58.78%  82.55%  62  18  1.01  12.15  
EarlyCroPS  61.43%  87.01%  62  26  0.66  10.63  
EarlyCroPS  60.01%  83.38%  51  18  0.66  10.63 
MNLI  QQP  STSB  WNLI  QNLI  RTE  SST2  CoLA  Training time  
Dense BERT  82.39  90.19  88.44  54.93  89.14  63.30  92.12  54.51  1x  
Sparsity  70%  90%  50%  90%  70%  60%  60%  50%  
LTR (Rewind 0)%  82.45  89.20  88.12  54.93  88.05  63.06  91.74  52.05  10x  
LTR (Rewind 50)%  82.94  89.54  88.41  53.32  88.72  62.45  92.66  52.00  10x  
EarlyCroPU  82.11  89.99  88.02  56.33  89.12  62.1  92.03  52.2  1x 