Current neural networks are heavily growing in depth, with many fully connected layers. As every fully connected layer includes large matrices, models often contain millions of parameters. This is commonly seen as an over-parameterization (Dauphin and Bengio, 2013; Denil et al., 2013). Different techniques have been proposed to decide which weights can be pruned. In structured pruning techniques (Voita et al., 2019
), whole neurons or even complete layers are removed from the network. Unstructured pruning only removes individual connections between neurons of succeeding layers, keeping the global network architecture intact. The first technique directly results in smaller model sizes and faster inference, while the second offers more flexibility in the selection of which parameters to prune. Although the reduction in necessary storage space can be realized using sparse matrix representations(Stanimirović and Tasic, 2009), most popular frameworks currently do not have sufficient support for sparse operations. However, there is active development for possible solutions (Liu et al., 2015; Han et al., 2016; Elsen et al., 2019). This paper compares and improves several unstructured pruning techniques. The main contributions of this paper are to:
demonstrate significant improvements for high sparsity levels over magnitude pruning by using it in combination with the lottery ticket hypothesis.
confirm that the signs of the initial parameters are more important than the specific values to which they are reset, even for large networks like the transformer.
show that magnitude pruning cannot be used to find winning lottery tickets, i.e., the final mask reached using magnitude pruning is no indicator for which initial weights are most important.
2 Related Work
Han et al. (2015) propose the idea of pruning weights with a low magnitude to remove connections that have little impact on the trained model. Narang et al. (2017) incorporate the pruning into the main training phase by slowly pruning parameters during the training, instead of performing one big pruning step at the end. Zhu and Gupta (2018) provide an implementation for magnitude pruning in networks designed using the tensor2tensor software (Vaswani et al., 2018).
Frankle and Carbin (2018) propose the lottery ticket hypothesis, which states that dense networks contain sparse sub-networks that can be trained to perform as good as the original dense model. They find such sparse sub-networks in small architectures and simple image recognition tasks and show that these sub-networks might train faster and even outperform the original network. For larger models, Frankle et al. (2019) propose to search for the sparse sub-network not directly after the initialization phase, but after only a few training iterations. Using this adapted setup, they are able to successfully prune networks having up to 20M parameters. They also relax the requirement for lottery tickets so that they only have to beat randomly initialized models with the same sparsity level.
Zhou et al. (2019) show that the signs of the weights in the initial model are more important than their specific values. Once the least important weights are pruned, they set all remaining parameters to fixed values, while keeping their original sign intact. They show that as long as the original sign remains the same, the sparse model can still train more successfully than one with a random sign assignment. Frankle et al. (2020) reach contradicting results for larger architectures, showing that random initialization with original signs hurts the performance.
Gale et al. (2019) compare different pruning techniques on challenging image recognition and machine translation tasks and show that magnitude pruning achieves the best sparsity-accuracy trade-off while being easy to implement.
In concurrent work, Yu et al. (2020) test the stabilized lottery ticket on the transformer architecture and the WMT 2014 EnglishGerman task, as well as other architectures and fields.
This paper extends the related works by demonstrating and comparing the applicability of different pruning techniques on a deep architecture for two translation tasks, as well as proposing a new combination of pruning techniques for improved performance.
3 Pruning Techniques
In this section, we give a brief formal definition of each pruning technique. For a more detailed description, refer to the respective original papers.
In the given formulas, a network is assumed to be specified by its parameters . When training the network for iterations, for represents the parameters at timestep .
Magnitude Pruning (MP)
relies on the magnitude of parameters to decide which weights can be pruned from the network. Different techniques to select which parameters are selected for pruning have been proposed (Collins and Kohli, 2014; Han et al., 2015; Guo et al., 2016; Zhu and Gupta, 2018). In this work, we rely on the implementation from Zhu and Gupta (2018) where the parameters of each layer are sorted by magnitude, and during training, an increasing percentage of the weights are pruned. It is important to highlight that MP is the only pruning technique not requiring multiple training runs.
Lottery Ticket (LT)
pruning assumes that for a given mask , the initial network already contains a sparse sub-network that can be trained to the same accuracy as . To determine , the parameters of each layer in the converged model are sorted by magnitude, and is chosen to mask the smallest ones such that the target sparsity is reached. We highlight that even though is determined using , it is then applied to before the sparse network is trained. To reach high sparsity without a big loss on accuracy, Frankle and Carbin (2018) recommend to prune iteratively, by training and resetting multiple times.
Stabilized Lottery Ticket (SLT)
pruning is an adaptation of LT pruning for larger models. Frankle et al. (2019) propose to apply the computed mask not to the initial model , but to an intermediate checkpoint where is chosen to be early during the training. They recommend to use and refer to it as iterative magnitude pruning with rewinding. We highlight that Frankle et al. (2019) always choose from the first, dense model, while this work choses from the last pruning iteration.
Constant Lottery Ticket (CLT)
pruning assumes that the specific random initialization is not important. Instead, only the corresponding choice of signs affects successful training. To show this, Zhou et al. (2019) propose to compute as in SLT pruning, but then to train as the sparse model. Here, sets all remaining parameters in each layer to , i.e., all parameters in each layer have the same absolute value, but their original sign. In all of our experiments, is chosen to be where and are the respective incoming and outgoing connections to other layers.
is a new pruning technique, proposed in this work. It combines both SLT pruning and MP in the following way: First, SLT pruning is used to find a mask with intermediate sparsity . This might be done iteratively. with sparsity is then used as the initial model for MP (i.e., ). Here, in the formula for MP, . We argue that this combination is beneficial, because in the first phase, SLT pruning removes the most unneeded parameters, and in the second phase, MP can then slowly adapt the model to a higher sparsity.
is analogue to SLT-MP: First, MP is applied to compute a trained sparse network with sparsity . This trained network directly provides the corresponding mask . is then used for SLT pruning until the target sparsity is reached. This pruning technique tests whether MP can be used to find winning lottery tickets.
We train the models on the WMT 2014 EnglishGerman and EnglishFrench datasets, consisting of about 4.5M and 36M sentence pairs, respectively. newstest2013 and 2014 are chosen to be the development and test sets.
All experiments have been performed using the base transformer architecture as described in (Vaswani et al., 2017).111 Using the hyperparameters in
tensor2tensor/models/transformer.py with the corresponding adaptations for TPUs. The models are trained for 500k iterations on a single v3-8 TPU, saving checkpoints every 25k iterations. For all experiments, we select the best model based on the Bleu score on the development set. For MP, we only evaluate the last 4 checkpoints, as earlier checkpoints do not have the targeted sparsity. Intermediate MP sparsity levels are computed as (Zhu and Gupta, 2018). For efficiency reasons, weights are only pruned every 10k iterations. Unless stated otherwise, we start with initial sparsity . The final sparsity is individually given for each experiment.
Using the hyperparameters intransformer_base_v3 in https://github.com/tensorflow/tensor2tensor/
We prune only the matrices, not biases. We report the approximate memory consumption of all trained models using the Compressed Sparse Column (CSC) format (Stanimirović and Tasic, 2009), which is the default for sparse data storage in the SciPy toolkit (Virtanen et al., 2020).
Our initial experiments have shown that Adafactor leads to an improvement of 0.5 Bleu compared to Adam. Hence, we select it as our optimizer with a learning rate of for k warmup steps. We note that this differs from the implementation by Gale et al. (2019), in which Adam has been used. We highlight that for all experiments that require a reset of parameter values (i.e., LT, SLT, CLT, SLT-MP, and MP-SLT), we reset to , to include the warmup phase in every training run.
A shared vocabulary of 33k tokens based on word-pieces (Wu et al., 2016) is used. The reported case-sensitive, tokenized Bleu scores are computed using SacreBLEU (Post, 2018), Ter scores are computed using MultEval (Clark et al., 2011). All results are averaged over two separate training runs. For all experiments that require models to be reset to an early point during training, we select a checkpoint after 25k iterations.
All iterative pruning techniques except SLT-MP are pruned in increments of 10 percentage points up to 80%, then switching to 5 points increments, and finally pruning to 98% sparsity. SLT-MP is directly trained using SLT pruning to 50% and further reduced by SLT to 60%, before switching to MP.
5 Experimental Results
In this section, we evaluate the experimental results for EnglishGerman and EnglishFrench translation given in Tables 1 and 2 to provide a comparison between the different pruning techniques described in Section 3.
Tables 1 and 2 clearly show a trade-off between accuracy and network performance. For every increase in sparsity, the performance degrades accordingly. We especially note that even for a sparsity of 50%, the baseline performance cannot be achieved. In contrast to all other techniques in this paper, MP does not require any reset of parameter values. Therefore, the training duration is not increased.
which is applied to ImageNet(Russakovsky et al., 2015). Gale et al. (2019) apply LT pruning to the larger transformer architecture and the translation task WMT 2014 EnglishGerman, noting that it has been outperformed by MP. As seen in Table 1, simple LT pruning is outperformed by MP at all sparsity levels. Because LT pruning is an iterative process, training a network with sparsity 98% requires to train and reset the model 13 times, causing a big training overhead without any gain in performance. Therefore, simple LT pruning cannot be recommended for complex architectures.
The authors of the SLT hypothesis (Frankle et al., 2019) state that after 0.1-7% of the training, the intermediate model can be pruned to a sparsity of 50-99% without serious impact on the accuracy. As listed in Tables 1 and 2, this allows the network to be pruned up to 60% sparsity without a significant drop in Bleu, and is on par with MP up to 85% sparsity.
As described in Section 4, for resetting the models, a checkpoint after k iterations is used. For a total training duration of 500k iterations, this amounts to 5% of the training and is therefore within the 0.1-7% bracket given by Frankle et al. (2019). For individual experiments, we have also tried and have gotten similar results to those listed in this paper. It should be noted that for the case k, SLT pruning becomes a form of MP, as no reset happens anymore. We propose a more thorough hyperparameter search for the optimal value as future work.
Importantly, we note that the magnitude of the parameters in both the initial and the final models increases with every pruning step. This causes the model with 98% sparsity to have weights greater than 100, making it unsuitable for checkpoint averaging, as the weights become too sensitive to minor changes. Yu et al. (2020) report that they do successfully apply checkpoint averaging. This might be because they choose from the dense training run for resetting, while we choose from the most recent sparse training.
The underlying idea of the LT hypothesis is, that the untrained network already contains a sparse sub-network which can be trained individually. Zhou et al. (2019) show that only the signs of the remaining parameters are important, not their specific random value. While Zhou et al. (2019) perform their experiments on MNIST and CIFAR-10, we test this hypothesis on the WMT 2014 EnglishGerman translation task using a deep transformer architecture.
Surprisingly, CLT pruning outperforms SLT pruning on most sparsity levels (see Table 1). By shuffling or re-initializing the remaining parameters, Frankle and Carbin (2018) have already shown that LT pruning does not just learn a sparse topology, but that the actual parameter values are of importance. As the good performance of the CLT experiments indicates that changing the parameter values is of little impact as long as the sign is kept the same, we verify that keeping the original signs is indeed necessary. To this end, we randomly assign signs to the parameters after pruning to 50% sparsity. After training, this model scores 24.6% Bleu and 67.5% Ter, a clear performance degradation from the 26.7% Bleu and 65.2% Ter given in Table 1. Notably, this differs from the results by Frankle et al. (2020), as their results indicate that the signs alone are not enough to guarantee good performance.
Across all sparsity levels, the combination of SLT pruning and MP outperforms all other pruning techniques. For high sparsity values, SLT-MP models are also superior to the SLT models by Yu et al. (2020), even though they start of from a better performing baseline. We hypothesize that by first discarding 60% of all parameters using SLT pruning, MP is able to fine-tune the model more easily, because the least useful parameters are already removed.
We note that the high weight magnitude for sparse SLT models prevents successful MP training. Therefore, we have to reduce the number of SLT pruning steps by directly pruning to 50% in the first pruning iteration. However, as seen by comparing the scores for 50% and 60% sparsity on SLT and SLT-MP, this does not hurt the SLT performance.
For future work, we suggest trying different sparsity values for the switch between SLT and MP.
Switching from MP to SLT pruning causes the models to perform worse than for pure MP or SLT pruning. This indicates that MP cannot be used to find winning lottery tickets.
In conclusion, we have shown that the stabilized lottery ticket (SLT) hypothesis performs similar to magnitude pruning (MP) on the complex transformer architecture up to a sparsity of about 85%. Especially for very high sparsities of 90% or more, MP has proven to perform reasonably well while being easy to implement and having no additional training overhead. We also have successfully verified that even for the transformer architecture, only the signs of the parameters are important when applying the SLT pruning technique. The specific initial parameter values do not significantly influence the training. By combining both SLT pruning and MP, we can improve the sparsity-accuracy trade-off. In SLT-MP, SLT pruning first discards 60% of all parameters, so MP can focus on fine-tuning the model for maximum accuracy. Finally, we show that MP cannot be used to determine winning lottery tickets.
In future work, we suggest performing a hyperparameter search over possible values for in SLT pruning (i.e., the number of training steps that are not discarded during model reset), and over for the switch from SLT to MP in SLT-MP. We also recommend looking into why CLT pruning works in our setup, while Frankle et al. (2020) present opposing results.
We would like to thank the anonymous reviewers for their valuable feedback.
This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 694537, project ”SEQCLAS”), the Deutsche Forschungsgemeinschaft (DFG; grant agreement NE 572/8-1, project ”CoreTec”). Research supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC). The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains.
- Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 176–181. External Links: Cited by: §4.
- Memory bounded deep convolutional networks. CoRR abs/1412.1442. External Links: Cited by: §3.
- Big neural networks waste capacity. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §1.
Predicting parameters in deep learning. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, Red Hook, NY, USA, pp. 2148–2156. Cited by: §1.
- Fast sparse convnets. External Links: Cited by: §1.
- The lottery ticket hypothesis: training pruned neural networks. CoRR abs/1803.03635. External Links: Cited by: §2, §3, §5, §5.
- The lottery ticket hypothesis at scale. CoRR abs/1903.01611. External Links: Cited by: 1st item, §2, §3, §5, §5.
- The early phase of neural network training. In International Conference on Learning Representations, External Links: Cited by: §2, §5, §6.
- The state of sparsity in deep neural networks. CoRR abs/1902.09574. External Links: Cited by: §2, §4, §5.
- Dynamic network surgery for efficient dnns. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1379–1387. External Links: Cited by: §3.
- EIE: efficient inference engine on compressed deep neural network. CoRR abs/1602.01528. External Links: Cited by: §1.
- Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1135–1143. External Links: Cited by: §2, §3.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Cited by: §5.
- . In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
Exploring sparsity in recurrent neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Cited by: 1st item, §2.
- A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Cited by: §4.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §5.
- Performance comparison of storage formats for sparse matrices. Facta Universitatis. Series Mathematics and Informatics 24, pp. . Cited by: §1, §4.
Tensor2Tensor for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), Boston, MA, pp. 193–199. External Links: Cited by: §2.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: 1st item, §4.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods. External Links: Cited by: §4.
- Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5797–5808. External Links: Cited by: §1.
- Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. External Links: Cited by: §4.
- Playing the lottery with rewards and multiple languages: lottery tickets in rl and nlp. In International Conference on Learning Representations, External Links: Cited by: §2, §5, §5.
- Deconstructing lottery tickets: zeros, signs, and the supermask. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3597–3607. External Links: Cited by: §2, §3, §5.
- To prune, or not to prune: exploring the efficacy of pruning for model compression. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, External Links: Cited by: §2, §3, §4.