Single Shot Structured Pruning Before Training

07/01/2020 ∙ by Joost van Amersfoort, et al. ∙ University of Oxford 13

We introduce a method to speed up training by 2x and inference by 3x in deep neural networks using structured pruning applied before training. Unlike previous works on pruning before training which prune individual weights, our work develops a methodology to remove entire channels and hidden units with the explicit aim of speeding up training and inference. We introduce a compute-aware scoring mechanism which enables pruning in units of sensitivity per FLOP removed, allowing even greater speed ups. Our method is fast, easy to implement, and needs just one forward/backward pass on a single batch of data to complete pruning before training begins.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pruning techniques (LeCun et al., 1990; Hassibi and Stork, 1993; Wang et al., 2019a) are able to successfully compress and speed up inference in trained neural networks. However they do nothing to address the speed and computational cost of training the initial model, which can have a significant CO footprint (Strubell et al., 2019). Recently, it has been shown that it is possible to prune a deep network before training (Lee et al., 2019; Wang et al., 2020). These methods are unstructured pruning methods, i.e., they prune individual weights from convolutional or linear layers. Unstructured pruning methods only lead to speed improvements with specialized hardware (Sze et al., 2017), and because sparse weights do not induce sparse activations, they do not reduce run-time memory footprint either.

In this work, we propose a structured and compute-aware Pruning Before Training (PBT) method. Structuring pruning methods remove entire channels in convolutional layers and hidden units in linear layers, leading to speed ups and reduced memory consumption on standard compute devices. Our method, Single Shot Structured Pruning (3SP), is easy to implement and has few hyper-parameters to tune. The pruned model trains 2x faster (on a GPU) and performs inference 3x faster (on a CPU), with only a 0.5% loss in accuracy on CIFAR-10. Our method needs just one forward-backward pass through the model with a single minibatch of data. To the best of our knowledge, structured pruning has never been attempted before training.

3SP is based on the SNIP objective (Lee et al., 2019), and we give an extensive empirical analysis of extending this objective to the structured setting. We further introduce a compute-aware weighting of the pruning score which measures the impact on the loss per unit compute removed. This actively biases pruning to remove more compute-intensive channels: a channel could be removed if either it has a small effect on the loss, or if it has a significant computational cost. Using this additional term, we are able to increase compute reduction from 60% to 85%.

Speeding up the training of large neural networks is useful when the training data changes quickly and models need to adjust rapidly. One example of this is active learning

(Settles, 2010), where trained models are used to identify the most informative datapoints to label for inclusion in the training data. In §4.1 we show how 3SP can be used to identify valuable data to acquire faster than a full model, allowing us to achieve better accuracy within a time-budget than an un-pruned model could.

Our main contributions are:

  • We introduce 3SP, a structured pruning before training (PBT) method that speeds up training by 2x and inference by 3x with only a small loss in accuracy.

  • We show how to prune explicitly in units of compute cost, rather than number of weights, allowing even greater reduction in compute.

  • We study empirically different objectives for unstructured pruning before training (SNIP and GraSP) and study the impact of moving to structured pruning domain.

2 Background

Figure 1:

Binary mask tensor (orange) in structured vs. unstructured pruning for a fully-connected layer’s weight matrix (green). There is one mask per weight in the unstructured pruning mask

, while in structured pruning, the binary mask has one entry per hidden unit (a column of the weight tensor).

In this section, we review key concepts that 3SP builds upon, leaving a comparison to other model compression techniques in the literature to §6.

Pruning Before Training. SNIP (Lee et al., 2019) introduced an effective method to prune individual weights before training. SNIP aims to remove weights from the neural network such that the difference between the loss of the full model and the loss of the pruned model is as small as possible. This is the same goal as the classic post-training pruning approaches in Optimal Brain Damage (LeCun et al., 1990) and Optimal Brain Surgeon (Hassibi and Stork, 1993), but applied before training. To approximate the effect on the loss of removing a weight, SNIP attaches a multiplicative binary mask to each weight. The mask value is one when a weight is kept as part of the model and zero when it is pruned. Their method can therefore be thought of as finding the value of the binary mask tensor which minimizes the change in loss:


with the neural network evaluated with weights masked by mask . This discrete optimization problem is intractable. Instead Lee et al. (2019) solve a continuous relaxation by differentiating the loss with respect to the mask parameters on a single batch of training data and using the first-order approximation: . Using the tensor of scores, , a threshold is computed given the desired prune ratio. All entries below the threshold are set to zero. SNIP relies on the strong assumption that the importance of the weights does not depend on which other weights are removed, i.e. no higher-order terms in the Taylor expansion are considered.

GraSP (Wang et al., 2020), an alternative PBT method, uses a different criterion based on preserving gradient flow. GraSP keeps weights that contribute to the norm of the gradient, removing weights can potentially even improve gradient flow. While it works well in unstructured pruning, especially at high levels of sparsity, we show in §5 that GraSP’s approximation has shortcomings which are particularly problematic for structured pruning. In the rest of this paper, we therefore focus on the SNIP criterion.

Structured Pruning. In structured pruning, only entire channels in convolutional layers and columns of linear layers can be removed (Figure 1). This is a significant restriction compared to unstructured pruning where any individual weight can be removed. However, structured pruning reduces the computational cost of training and evaluating a model, because when entire channels are removed the size of the activations is also reduced, leading to a smaller model. Removing single weights does not lead to reduced (or even sparse) activations, and speeding up the computation of sparse convolutions and linear layers requires specialized hardware (Sze et al., 2017). In contrast, computational benefits from structured pruning are straightforward: the architecture is smaller.

The assumption that the importance of weights are independent of each other, as described in the previous section, is even stronger in the structured case. Removing a single (output) channel can, for example, lead to the removal of 512 x 3 x 3 weights, such as in the later layers of VGG (Simonyan and Zisserman, 2014) and ResNet (He et al., 2016). Extending any unstructured method to a structured method is therefore not trivial, especially when the original score is noisy as in pruning before training.

(a) Unstructured
(b) Structured
Figure 2: Approximation accuracy of scores in unstructured and structured pruning. We compare the predicted change in loss from the Taylor expansion (scores) with the actual change in loss (normalized). The X-axis is sorted by the scores. While scores in structured pruning are noisy, we can still successfully identify parameters that contribute the most to the change in the loss.

3 Method

Our method 3SP introduces a structing PBT method, based on the unstructured SNIP objective, enabling a significant improvement in training and inference speed using only a single forward-backward pass on one minibatch of data. We further provide a compute-aware extension to 3SP which aims to optimize the mask tensor in units of compute cost rather than numbers of weights. We provide a step by step description of how 3SP and its compute-aware extension works in Algorithm 1.

1:Data: minibatch of data from dataset .
2:Result: Best model mask for pruned model.
3:Initialization: NN with in channels , spatial height and width and and kernel size for each layer , initialized with He et al. (2016); initialized at 1 for every entry; target pruning ratio ; compute smoothing term ; ;
4: order apprx.  using one minibatch.
5:if Compute-Aware then
6:      Calculate cost-per-layer.
7:      Apply compute cost smoothing.
8:      Normalize compute cost.
9:     for Score and associated layer  do
10:          Convert 3SP score to cost-space
11:     end for
13:      Non-compute-aware 3SP score.
14:end if
15: ’th percentile of entries in
16: Keep channels/units with high score.
Algorithm 1 Structured Compute-Aware Pruning - 3SP

3.1 Structured Pruning Before Training

SNIP defines with the same shape as the weights, allowing it to turn off individual weights. We instead define to remove entire operations. In particular, for convolutional layers, each output channel gets one binary mask variable governing its entire spatial extent. Linear layers have a binary mask per hidden unit; we visualize this in Figure 1. Masking, therefore, can be implemented by changing the shape of the weight tensors and is equivalent to using a smaller model, unlike unstructured methods. We minimize a similar objective to SNIP:


where is the model given by masking channels using , in practice this is done by removing all the weights associated with masked operations from the model leading to a smaller, faster model. This is in contrast to unstructured pruning where the mask is applied using an element-wise product.

3SP assumptions. We approximate this change in loss with three assumptions: 1. We approximate the binary mask as a continuous variable; 2. We use a first-order Taylor expansion——approximated with just one minibatch of data; 3. (Given 2.) We approximate the change implied by changing a set of entries of as being the sum of individual changes.

These assumptions are similar to the unstructured SNIP assumptions but note that in the structured setting, these approximations are stronger, requiring additional justifications. Unstructured models have a mask variable with potentially millions of entries (the number of weights), while a structurally pruned VGG-19 has an with roughly 5,000 entries. This means that the interaction between any two entries will often be much larger, making assumptions 2 and 3 less plausible. In Figure 2, we assess how appropriate the first-order Taylor approximation is for predicting the change in loss when removing individual weights. We show on the left that the unstructured SNIP approximation is close to the actual change in the loss when a single weight is removed. (For the unstructured curve, we compute the actual change in loss for every five thousandth weight, rather than computing all of the millions of scores.) On the right, we show that in the structured case, there is significantly more noise, but the predicted change in loss correlates strongly with the actual change. We discuss this effect and a comparison to GraSP in more detail in §5.

Rescaling Initializations.

Pruning before training changes the variance of activations at initialization and in the structured case even changes the number of activations. Previous research into initialization schemes for neural networks

(He et al., 2015) shows the importance of the variance and the number of activations for model training. By pruning a model, we reduce the number of output channels of most layers, which means the variance no longer has the right value relative to the number of output activations. We attempt to correct for this by studying the effects of rescaling weights by the ratio: which amounts to recovering the variance scaling suggested by He et al. (2015). However, we note that because large and small weights are not pruned uniformly, the resulting variance of the model weights of the pruned model is not exactly what one would get from initializing the model from scratch. The authors of SNIP reinitialize their models after pruning, which may address the same issue.

3.2 Compute-Aware 3SP (3SP + CA)

Different layers in a model have a different computationtal cost. For example the computational cost in FLOPs of a convolution operation is where and , are the height and width of the output, , are the number of out and in channels, and is the size of the kernel. In earlier layers of the model, the spatial dimensions tend to be high, while in later layers the number of input channels increase. Since our aim is to remove as much compute from the model while preserving model accuracy, we extend 3SP to be compute-aware, by dividing the score by the normalized compute cost per channel:


We calculate a retention score, , for layer , channel , corresponding to each mask entry, which measures the impact on the loss per unit of compute removed:


This retention score can be high either if a channel has a big effect on the loss, or if the channel has a negligible effect on the compute cost. A low retention score means that the channel is either harmful to prune or would offer little speed-up. In practice, we would like to trade off the importance of compute and change in the loss, so we introduce a Laplace smoothing parameter (Manning et al., 2009): A larger values of makes the pruning depend more on the predicted change in loss and pay less attention to the compute costs of different layers.

Figure 3:

The relation between pruning parameters and the effect on compute on CIFAR-10 with VGG-19. Uniform is a naive baseline which prunes all channels with uniform probability, and therefore ignores both model loss and compute cost. 3SP removes channels least important to the loss, which happen to also use less compute on average. 3SP + CA, without compute smoothing, actively prunes FLOPs while taking model performance into account (see Figure

Figure 4: The performance of 3SP and 3SP + CA (compute-aware) against a baseline of uniformly removing channels from the model on CIFAR-10 with VGG-19. At every amount of compute pruned, 3SP outperforms uniform, with the difference increasing as more FLOPs are pruned. After removing 75% of FLOPs, 3SP becomes unable to prune more without pruning entire layers. Our compute-aware extension, however, is able to remove even more compute without failing.

4 Experiments

In this section, we show that models pruned with 3SP are substantially faster than unpruned models and models pruned using baseline PBT methods without sacrificing much in accuracy. We show that a compute aware version (3SP + CA) can be even faster, but with a more significant reduction in performance. Finally, we demonstrate how this capability might be beneficial in a pratical setting where training speed is a bottleneck: we show that a 3SP model can achieve higher accuracy than a full model with a time-budget in active learning.

Baselines. We compare to previously published unstructured PBT methods SNIP (Lee et al., 2019) and GraSP (Wang et al., 2020), though we note that these methods do not provide a speed-up. We do not compare to methods that prune after or during training, because these incur a significant upfront computational cost, while this paper is focussed on constraining the training cost. 3SP and its compute-aware extension are the first methods to structurally prune a model before training, showing that this approach is feasible and a ground for further research.

We also compare to a naive structured pruning method which uniformly prunes over all mask entries and layers. This “Uniform” baseline samples entries for binary mask tensors or from in order to prune a ’th of the weights. In expectation, the model width is pruned by a uniform ratio in each layer. This is similar to using a narrower model. Liu et al. (2018) showed that these sorts of untargeted pruning methods are effective baselines that are able to obtain strong performance.

Architecture and random seeds. Our experiments are executed using VGG-19; we describe the exact architecture in Appendix A

. We run all our experiments 5 times using different random seeds, reporting the mean and standard error of each experiment in tables (standard deviation in figures, as s.e. was not visible).

Measuring compute. We report model compute cost in Floating Point operations (FLOPs). The exact number of FLOPs used to perform a calculation depends on the hardware, so we make the common assumption that roughly two FLOPs are required for each multiply-accumulate operation.

Compute-cost vs. Accuracy Trade-off. In Figure 4 we show that our approach is able to reduce the FLOPs required for a forward pass in the model. 3SP tries to preserve model performance, so it removes compute less aggressively than the Uniform baseline. However, 3SP + CA (compute-aware) with is able to very aggressively remove compute. In Figure 4 we show the trade-off between the compute cost of the model and accuracy for 3SP, 3SP + CA and Uniform pruning on CIFAR10. We show that there is a significant gap between 3SP and uniform, indicating that we are able to successfully prune our model. When trying to prune very large amounts of compute, having ignored compute costs during pruning, 3SP is forced to remove entire layers. Our compute aware extension, however, is able to continue pruning to very high levels of compute sparsity.

Method CIFAR-10 CIFAR-100
Parameters Pruned (Acc. s.e.) Parameters Pruned (Acc. s.e.)
80% 90% 95% 80% 90% 95%
Unstructured Uniform 92.6 0.04 % 91.4 0.04 % 89.8 0.04 % 70.3 0.16 % 68.1 0.11 % 64.7 0.27 %
SNIP 93.6 0.12 % 93.6 0.05 % 93.4 0.04 % 72.8 0.10 % 72.4 0.08 % 70.7 0.11 %
GraSP 93.2 0.09 % 93.0 0.03 % 92.8 0.08 % 71.2 0.08 % 70.6 0.15 % 69.5 0.07 %
Structured Uniform 92.0 0.08 % 90.4 0.12 % 89.0 0.15 % 67.5 0.16 % 63.8 0.13 % 60.1 0.29 %
3SP 93.4 0.03 % 93.1 0.04 % 92.5 0.12 % 69.9 0.14 % 68.3 0.12 % 63.2 0.52 %
3SP + re-init 93.4 0.04 % 93.0 0.02 % 92.6 0.09 % 70.3 0.16 % 69.0 0.08 % 64.2 0.35 %
3SP + re-scale 93.3 0.03 % 93.0 0.06 % 92.5 0.06 % 70.5 0.13 % 69.2 0.11 % 63.5 0.63 %
Table 1: Accuracy of 3SP (grey), SNIP and GraSP for VGG19 on CIFAR-10 and CIFAR-100. We evaluate on the basis of pruned parameters because prior PBT methods are unstructured and do not reduce compute cost. However, the speed-based evaluation of Figure 4 is a much better view of our method’s performance. Even on the basis of parameter sparsity, 3SP performs comparably with unstructured methods on CIFAR-10 though a gap forms on the more complex CIFAR-100. Rescaling/reinitializing has little effect on the smaller dataset, but appears to matter for CIFAR-100. The original (unpruned) model obtains 93.6% accuracy on CIFAR-10 and 72.5% on CIFAR-100.

Parameter vs. Accuracy Trade-off. In Table 1 we consider accuracy when pruning different proportions of parameters using VGG-19 (Simonyan and Zisserman, 2014) on CIFAR-10 and CIFAR-100. As we have discussed, removing parameters is not as important as reducing compute cost, which is why the best evaluation of our method is that provided in Figure 4. But the only prior PBT methods are unstructured and lead to no compute improvement at all. Therefore, in order to compare to prior methods, we show in Table 1 that even when looking at parameter-sparsity, 3SP peforms nearly as well as unstructured methods for CIFAR-10. At 80% parameter sparsity 3SP has similar accuracy, and even at 95% accuracy 3SP is less than a percentage point less accurate than the unstructured SNIP and GraSP methods. For CIFAR-100, the structured constraint has a bigger effect on performance, though 3SP still performs significantly better than Uniform pruning. This can be explained by the fact that CIFAR-100 is a more difficult dataset, and VGG19 is therefore less overparameterized. Re-initializing the weights has a very small effect on the accuracy on CIFAR-10 and helps with CIFAR-100. This suggests that these pruning before training methods predominantly identify an important model architecture rather than individual weights.

Wall Clock Time and Model Size. In Table 2, we show the tangible benefits of using structured pruning. The 50% reduction model is without compute aware, while the 80% model is with compute aware and smoothing set to

. On a GPU, reducing the FLOPs by 80% or 50% leads to a 4x or 2x speed-up in training time per epoch. On a CPU the benefit is even greater. Because structured pruning results in a much smaller model, and induces a smaller activation map, the forward pass can be kept almost entirely in CPU cache, leading to a 5x or 3x speedup with the same levels of pruning instead. The smaller memory footprint of our method would allow pruned models to be used in settings that unstructured pruning might not help with. Adopting our method can therefore help researchers and organizations reduce run-time and power consumption of training, evaluating, and deploying their models, leading to a reduced carbon footprint. Full details are provided in Appendix


Figure 5: Active Learning CIFAR-10. 3SP model at 50% and 3SP-CA at 80% FLOPs reduction have better accuracy than a full model within a time-budget because they train, and therefore acquire data, more quickly.

4.1 Active Learning with 3SP Pruning

In some settings, we have a large pool of unlabelled data for which we can request labels, but the labelling process is costly (for example, requiring highly-trained experts). Active learning (Settles, 2010) attempts to pick the most informative datapoints out of the pool to reduce the labelling cost. Unfortunately, most active learning approaches require retraining the model on the labelled set before acquiring new points. This can mean that the model must be trained tens or hundreds of times during the acquisition process. We show that a 3SP model can reach a higher accuracy within a time-budget than an unpruned model (Figure 5). Even though pruning hurts model performance, the fact that we prune before training in a structured way lets us train much more quickly. This lets the 3SP model acquire more data within the same time-budget, which allows higher accuracy. Full experimental details are provided in Appendix C.

Original 50% FLOPs Reduction 80% FLOPs Reduction
Training Time per Epoch (GPU) 15 s 8 s (~2x) 4 s (~4x)
Prediction Time (CPU) 12.8 ms 4.2 ms (~3x) 2.6 ms (~5x)
Model Size 87 MB 12 MB (~7x) 3.3 MB (~26x)
Pruning Time 41 ms 41 ms
Accuracy (CIFAR-10) 93.6% 0.04 93.1% 0.09 90.9 0.09%
Table 2: Effect of the computational cost reduction on training/inference time. Reducing FLOPs directly reduces training time and prediction time, and indirectly reduces the model size. These results are obtained using the VGG-19 model, CIFAR10 dataset, and a GTX 1080 Ti GPU. The prediction time is measured using a batch with a single element on a i7 8700K CPU. The pruning time is equal to one step in the original model.

5 Alternative Pruning Criterion - GraSP

Wang et al. (2020) propose to keep network parameters that contribute the most to loss reduction during optimization. Instead of using the change in loss as a pruning score, like SNIP, they use the predicted change in the magnitude of the gradient with respect to the weights, approximated with a first-order Taylor expansion. In order to avoid finding a mask that satisfies their objective trivially by creating a very large loss, they strongly smooth outputs so that removing weights cannot greatly change the loss. Although Wang et al. (2020) motivate their method by the fact that their score measures the interaction between weights, we observe that it is still a first-order method and does not assess the interaction between the decision to prune multiple weights.

The naive form of GraSP cannot be directly applied to structured models because GraSP computes gradients with respect to weights directly, not mask variables. But we consider a structured method inspired by GraSP which instead computes gradients with respect to mask tensor entries. We found in our experiments that this objective performed substantially worse than one based on SNIP. On CIFAR10, structured GraSP achieves accuracies of 91.96%, 91.6%, and 90.92% at 80-90-95% prune ratios respectively, worse than Uniform pruning (compare to Table 1).

We believe that this is because estimates of the GraSP-style objective are too noisy in the structured setting, and so assumption 2 considered in §

3 is violated. Similar to the experiment in Figure 2, we compare the predictions implied by the gradient to the actual effect on the objective. Figure 6 shows that in the unstructured case, there is a correlation between the approximated change in the objective and the actual change. However, the structured prediction is uncorrelated with the actual change.

(a) Unstructured
(b) Structured
Figure 6: Evaluating approximation accuracy of GraSP objective for structured and unstructured pruning of VGG-19. We compare the predicted change in the gradient-norm given by the Taylor expansion (scores) with the actual change in the gradient-norm (normalized). The unstructured predictions have signal, but in the structured setting they are too noisy.

6 Related Work

Large neural networks are overparameterized, resulting in extensive efforts to reduce model size through distillation (Hinton et al., 2015), pruning (Reed, 1993) or quantization (Gong et al., 2014; Rastegari et al., 2016). All of these methods aim to make networks cheaper to evaluate by reducing compute cost, memory usage, or enabling hardware-efficient implementations. However, none of these works focusses on reducing computation cost both during and after training. SNIP (Lee et al., 2019) and GraSP (Wang et al., 2020) are the first to look at reducing the number of parameters before training but are unable to substantially reduce compute cost due to being unstructured methods. We discuss these methods extensively in §2.

Wang et al. (2019b) similarly proposes a channel-wise mask as in 3SP, and directly optimizes it using a sparsity penalty similar to Liu et al. (2017). The proposed algorithm requires 10 epochs of training with a computational cost similar to the original model in order to do pruning on CIFAR-10, whereas we compute pruning scores with just one batch. We consider this a prohibitive additional cost and do not explicitly compare.

Within the pruning after training literature, there are a number of methods that explicitly consider computational cost and others that apply structured pruning. Gordon et al. (2018) explicitly regularize FLOPS while training with a sparsity penalty, while Veniat and Denoyer (2018) and Theis et al. (2018) adopt a similar method motivated as approximate constrained optimization. He et al. (2019)

frame model crafting as a reinforcement learning problem where the reward is based on compute usage. Many methods for pruning after training impose a structured pruning mask, which results in compute savings

(Li et al., 2016; He et al., 2017; Liu et al., 2017; Luo et al., 2017). In particular Wang et al. (2019a) obtains very strong results but needs 160 epochs of training a full-sized network and another 160 epochs of fine-tuning.

Neural Architecture Search (NAS) is related to our work insofar as it optimizes a network architecture before training. However, our work requires only one forwards-backwards pass on a single batch to prune, rather than requiring extensive retraining. In general, NAS methods have been accused of consuming enormous amounts of energy (Strubell et al., 2019), counter to the goal of this paper.

7 Conclusion

We introduced 3SP; a single shot structured pruning technique that is applied before training. Our results show that it is possible to speed up training by 2x, and inference by 3x while losing only 0.5% accuracy. Aside from the reduced time and energy needed to obtain a converged model, this is important in settings where training speed is a bottleneck on performance, such as active learning. We have further shown that the approximations in SNIP and our structured extension are reasonable and that making 3SP compute aware leads to even more speed ups, at the expense of accuracy. In general, our work has shown for the first time, that structured pruning before training is feasible and can be an exciting way to reduce wasteful training cost and practioner architecture selection time.


We thank Aidan Gomez, Bas Veeling and Yee Whye Teh for helpful discussions and feedback. We also thank others in OATML and OxCSML for feedback at several stages in the project. JvA/MA are grateful for funding by the EPSRC (grant reference EP/N509711/1 and EP/R512333/1, respectively). JvA is also grateful for funding by Google-DeepMind, while MA is supported by Arm Inc. SF gratefully acknowledges the Engineering and Physical Sciences Research Council for their grant administered by the Centre for Doctoral Training in Cyber Security at the University of Oxford.

Broader Impact

Our work’s key impact is the development of new compute-aware pruning techniques. We think this is broadly beneficial because of the reduction in energy consumption and carbon dioxide emissions it makes possible. We can separate the impacts into immediate beneficial applications, medium-term extensions, and possible unintended consequences.

Immediate benefits.

Modern machine learning methods consume huge amounts of energy.

Strubell et al. (2019)

have noted that this is not just a problem at production scale, large amounts of energy are spent on training models and especially on architecture search. We provide a mechanism for creating compute-efficient models before training, which reduces energy consumption in training models. Moreover, unlike neural architecture search methods, we require just one batch of backpropagation on the target dataset before training, rather than huge numbers of model evaluations. The immediate impact of our paper is therefore to help researchers and labs reduce their energy consumption and carbon dioxide emissions.

In addition, because of how expensive neural architecture search is, it is mostly only available to large companies. Our method will be especially useful in resource-poor contexts where teams do not have large datacenters and need to be as efficient as possible, for example in some less economically developed countries.

Medium-term extensions.

Beyond our direct method, we introduce the framing of optimizing pruning in compute-space. We hope very much that future work can adopt this mindset and create further reductions in energy consumption. It may also be useful for some groups to consider pruning in energy-space, however this is too hardware-dependent to be appropriate for a paper of our generality.

Unintended Consequences

We do not believe there are adverse distributional consequences of our work, in general it should make it more possible for less economically developed actors to achieve results closer to those that only the largest companies can currently achieve.

However, it is possible that improving the computational efficiency of deep learning will increase overall demand for deep learning, thereby increasing overall energy consumption (an instance of the Jevons Paradox

(Jevons, 1865)). We hope that in such a situation, the benefits to humanity from the increased use of deep learning are worth the greater overall cost.


  • Gal et al. [2017] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian Active Learning with Image Data. Proceedings of The 34th International Conference on Machine Learning, 2017.
  • Gong et al. [2014] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev.

    Compressing Deep Convolutional Networks using Vector Quantization.

    Computer Vision and Pattern Recognition, December 2014.
  • Gordon et al. [2018] Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1586–1595, Salt Lake City, UT, June 2018. IEEE.
  • Hassibi and Stork [1993] Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pages 164–171, 1993.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

    In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. CVPR, 7(3):171–180, 2016.
  • He et al. [2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel Pruning for Accelerating Very Deep Neural Networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1398–1406, Venice, October 2017. IEEE.
  • He et al. [2019] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. Computer Vision and Pattern Recognition, January 2019.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Jevons [1865] William Jevons. The Coal Question. Macmillan and Co., 1865.
  • LeCun et al. [1990] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
  • Lee et al. [2019] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.
  • Li et al. [2016] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for Efficient ConvNets. International Conference on Learning Representations, November 2016.
  • Liu et al. [2017] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pages 2736–2744, 2017.
  • Liu et al. [2018] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
  • Luo et al. [2017] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
  • Manning et al. [2009] Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2009.
  • Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
  • Reed [1993] Russell Reed. Pruning Algorithms: A Survey. In IEEE Transactions on Neural Networks, volume 4, pages 740–747, 1993.
  • Settles [2010] Burr Settles. Active Learning Literature Survey. Machine Learning, 15(2):201–221, 2010. ISSN 00483931. doi:
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Strubell et al. [2019] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019.
  • Sze et al. [2017] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.
  • Theis et al. [2018] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with dense networks and Fisher pruning. arXiv:1801.05787 [cs, stat], July 2018.
  • Veniat and Denoyer [2018] Tom Veniat and Ludovic Denoyer. Learning Time/Memory-Efficient Deep Architectures with Budgeted Super Networks. Computer Vision and Pattern Recognition, May 2018.
  • Wang et al. [2019a] Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In International Conference on Machine Learning, pages 6566–6575, 2019a.
  • Wang et al. [2020] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020.
  • Wang et al. [2019b] Yulong Wang, Xiaolu Zhang, Lingxi Xie, Jun Zhou, Hang Su, Bo Zhang, and Xiaolin Hu. Pruning from scratch. arXiv preprint arXiv:1909.12579, 2019b.

Appendix A Experimental setup

For VGG-19 we use an architecture of five blocks, with 2x2 max-pooling layers in between the blocks. The first two blocks consist of two conv-BN-relu operations, with 64 and 128 channels, respectively. The last three blocks consist of three conv-BN-relu operations and are of width 256, 512 and 512. After our model we use average pooling to reduce spatial dimensions to 2 by 2. This is followed by three linear layers with 1024, 512, and number of classes as number of hidden units. We use relu activation functions throughout the model. During structured pruning, we consider all elements of the model prunable except the output of the final linear layer and the input of the first convolution layer. We train our model for a fixed 160 epochs, with an initial learning rate of 0.1, which is halved at epoch 80 and again at 120. We preprocess CIFAR-10 by doing mean and standard deviation normalization, random horizontal flips, and random 4-pixel pads followed by a 32x32 crop. We report results on the CIFAR-10 test set. This is the same experimental setup as used in

Wang et al. [2020]

; we did not consider other hyperparameters. We use the default initialization method of Pytorch 1.4 for the weights, which is based on

He et al. [2015]. All results are based on 5 training/evaluation runs with different random seeds.

For all full model measurements of computational cost in this paper, we utilize the pytorch-OpCounter111Available from package, which takes into account all operations of the model, excluding preprocessing.

Appendix B Wall Clock Time and Model Size Experiments

We performed the experiments in Table 2 using the VGG-19 described in Appendix A. The training time per epoch was measured as the average of 5 epochs, excluding the first epoch. The prediction time is the average prediction time per data point after going through an epoch of data one by one. The model size is determined by saving the parameters to disk. Pruning time is determined by doing a forward pass with a single batch of data in the original model. The time necessary for adjusting the architecture is excluded but would add some overhead (under 10 ms) in practice. The 50% FLOPs reduction model is 3SP without compute aware. The 80% FLOPs reduction model is 3SP with compute aware.

Figure 7: Active Learning CIFAR-10. 3SP model at 50% and 3SP-CA at 80% FLOPs reduction get to acquire more data within a time-budget because they train faster.

Appendix C Active Learning Experiments

In active learning [Settles, 2010], we want to be efficient about manually labeling data and only acquire labels for the most informative data points. It is also referred to as a ‘human in the loop’ approach to data labeling. One begins with a small training dataset, and obtains labels only for the most informative datapoints in the unlabeled set. In our case, we start with 100 examples per class, leading to a start dataset of size 1000 for CIFAR-10. We train the model on the current labelled dataset and subsequently use it to estimate which datapoints in the original CIFAR-10 training set (the ‘unlabeled set’) would be most informative to acquire a label for. We use a common proxy for informativeness: the entropy of the softmax distribution , where represents the output distribution for a particular data point [Gal et al., 2017]. We select 50 datapoints from the unlabeled pool in each acquisition step, and add these (including labels) to the training set and restart training (using the final state of the model before acquiring more data). We repeat this process until the time runs out. We obtain data from the CIFAR-10 training set and report results on the CIFAR-10 test set. We prune the models once at the beginning of the active learning process. We start the timer after we acquired the 1000 initial points and are done with pruning. The experiments were done using the VGG-19 architecture and training settings described in Appendix A. For 3SP - 50%, we pruned 50% of the FLOPS of the model, without compute awareness. For 3SP + CA - 80%, we pruned 80% of the FLOPS, but with compute awareness. We repeat the experiments 5 times, and show one standard deviation errors.

Throughout active learning, the pruned models train more quickly. This means that they are able to select points for inclusion more quickly (see Figure 7). 3SP + CA sees almost 50% more data than the un-pruned model within the time-budget.