1 Introduction
Weight pruning has been widely studied and utilized to effectively remove the redundant weights in the overparameterized deep neural networks (DNNs) while maintaining the accuracy performance
(Han et al., 2015a, 2016; Wen et al., 2016; He et al., 2017; Min et al., 2018; He et al., 2019; Dai et al., 2019; Lin et al., 2020; He et al., 2020). The typical pruning pipeline has three main stages: 1) train an overparameterized DNN, 2) prune the unimportant weights in the original DNN, and 3) finetune the pruned DNN to restore accuracy.Many works have been proposed to investigate the behaviors on weight pruning (Tanaka et al., 2020; Ye et al., 2020; Renda et al., 2020; Malach et al., 2020). The Lottery Ticket Hypothesis (LTH) (Frankle and Carbin, 2018) reveals that, inside a dense network with randomly initialized weights, a small sparse subnetwork, when trained in isolation using the identical initial weights, can reach a similar accuracy as the dense network. Such a sparse subnetwork with the initial weights is called the winning ticket.
For a more rigorous definition, let be a given network initialization, where denotes the initial weights. We then formalize pretraining, pruning and sparse training below. Pretraining: The network is trained for epochs arriving at weights and network function . Pruning: Based on the pretrained weights , adopt certain pruning algorithm to generate a sparse mask . Sparse Training: The LTH paper considers two cases of sparse training. The first (“winning ticket”) is the direct application of mask to initial weights , resulting in weights and network function . The second (random reinitialization) is the application of mask to a random initialization of weights , resulting in weights and network function . The winning property has two aspects ①② for identification: ① Training for epochs (or fewer) will result in similar accuracy as that of the dense pretrained network . ② There should be a notable accuracy gap between training for epochs and training , and the former shall be higher.
In the standard LTH setup (Frankle and Carbin, 2018)
, the winning property can be observed in the case of low learning rate via the simple iterative magnitude pruning algorithm, but fails to occur at higher initial learning rates especially in deeper neural networks. For instance, the LTH work identifies the winning tickets on the CIFAR10 dataset for the CONV2/4/6 architectures (the downscaled variants of VGG
(Simonyan and Zisserman, 2014)), with the initial learning rate as low as 0.0001. For deeper networks such as ResNet20 and VGG19 on CIFAR10, the winning tickets can be identified only in the case of low learning rate. At higher learning rates, additional warm up is needed to find the winning tickets. In Liu et al. (2018) (the latest ArXiv version), it revisits LTH and finds out that with a widelyadopted learning rate, the winning ticket has no accuracy advantage over the random reinitialization. This questions the second aspect of winning property on the accuracy gap between training and training .Further, the following work Frankle et al. (2019) proposes the iterative pruning with rewinding to stabilize identifying the winning tickets. Specifically, it resets the weights to in each pruning iteration, where denotes the weights trained from for a small number of epochs.
In this paper, we investigate the underlying condition and rationale behind winning property. We ask if such a property is a natural characteristic of DNNs across their architectures and/or applications. We revisit LTH via extensive experiments built upon various representative DNN models and datasets, and have confirmed that the winning property only exists at a low learning rate. In fact, such a “low learning rate” (e.g., 0.01 for ResNet20 and 0.0001 for CONV2/4/6 architectures on CIFAR10) is already significantly deviated from the standard learning rate, and results in notable accuracy degradation in the pretrained DNN. Besides, training from the “winning ticket” at such a low learning rate can only restore the accuracy of the pretrained DNN under the same insufficient learning rate, instead of that under the desirable learning rate. By introducing a correlation indicator for quantitative analysis, we found that the underlying reason is largely attributed to the correlation between initialized weights and finaltrained weights when the learning rate is not sufficiently large. We draw the following conclusions:

As a result of low learning rate, such weight correlation results in low accuracy in DNN pretraining.

Such weight correlation is also a key condition of winning property, concluded through a detailed analysis of the cause of winning property.

Thus, the existence of winning property is correlated with an insufficient DNN pretraining, i.e., it is unlikely to occur for a welltrained DNN.
Different from sparse training under lottery ticket setting, we propose the “pruning & finetuning” method, i.e., apply mask to pretrained weights and perform finetuning for epochs. The generated sparse subnetwork can largely achieve the accuracy of the pretrained dense DNN. Through comprehensive experiments and analysis we draw the following conclusions:

“Pruning & finetuning” consistently outperforms lottery ticket setting under the same pruning algorithm for mask generation, and the same total training epochs.

The pruning algorithm responsible for mask generation plays an important role in the quality of generated sparse subnetwork.

Thus, if one wants to optimize the accuracy of sparse subnetwork and restore the accuracy of the pretrained dense DNN, we suggest adopting the pruning & finetuning method instead of lottery ticket setting.
2 Related Work
2.1 DNN Weight Pruning
DNN weight pruning as a model compression technique can effectively remove the redundant weights in DNN models and hence reduce both storage and computation costs. The general flow of weight pruning consists of three steps: (1) train the neural network first; (2) derive a subnetwork structure (i.e., removing unimportant weights) using a certain pruning algorithm; and (3) finetune the remaining weights in the subnetwork to restore accuracy. Different pruning algorithms will deliver different capabilities to search for the bestsuited sparse subnetwork and lead to different final accuracies.
The most straightforward method is the magnitudebased oneshot pruning, which will directly zeroout a given percentage of trained weights with the smallest magnitude. However, this method usually leads to a severe accuracy drop under a relatively high pruning rate. Iterative magnitude pruning is proposed in (Han et al., 2015b), which removes the weights with the smallest magnitude in an iterative manner. It repeats step (1) and step (2) multiple times until reaching the target pruning rate. In (Frankle and Carbin, 2018), iterative pruning is adopted to find the sparse subnetwork (i.e., winning ticket). The iterative pruning process is still a greedy search, and has been extended in (Zhu and Gupta, 2017; Tan and Motani, 2020; Liu et al., 2020) to derive better subnetwork structures.
To overcome the greedy nature in the heuristic pruning methods, the more mathematicsoriented regularizationbased algorithm
(Wen et al., 2016; He et al., 2017) has been proposed, to generate sparsity by incorporating orstructured regularization in the loss function. However, this method directly applies fixed regularization terms that penalize all weights equally and will lead to a potential accuracy drop. Later work
(Zhang et al., 2018; Ren et al., 2019) incorporate Alternating Direction Methods of Multipliers (ADMM) (Boyd et al., 2011; Ouyang et al., 2013) to solve the pruning problem as an optimization problem, which adopts dynamic regularization penalties and maintains high accuracy.2.2 Lottery Ticket Hypothesis
2.2.1 The Origin and Controversy of Lottery Ticket Hypothesis
The recent work (Frankle and Carbin, 2018) reveals that, inside a dense network with randomly initialized weights, a small sparse subnetwork can reach a similar test accuracy when trained in isolation using the identical initial weights as training the dense network. Such sparse subnetwork is called the winning ticket and can be found by pruning the pretrained dense network under a nontrivial pruning ratio.
As demonstrated in (Frankle and Carbin, 2018), winning tickets can be found in small networks and small dataset when using relatively low learning rates (e.g., 0.01 for SGD). The work from the same period (Liu et al., 2018) finds that, when using a relatively large learning rate (e.g., 0.1 for SGD), training a “winning ticket” with identical initialized weights will not provide any unique advantage in accuracy compared to training with randomly initialized weights. The following work (Frankle et al., 2019; Renda et al., 2020) also confirms that, for deeper networks and using relatively large learning rates, the winning property can hardly be observed. They propose a weight rewinding technique to identify small subnetworks, which can be trained in isolation to achieve competitive accuracy as the dense pretrained network.
2.2.2 Other Aspects and Applications
Later work (Chen et al., 2020) further extends the lottery ticket hypothesis to a pretrained BERT model to evaluate the transferability of the sparse subnetworks among different downstream NLP tasks. Recent works (Morcos et al., 2019; Chen et al., 2020)
have studied the lottery ticket hypothesis in computer vision tasks and in unsupervised learning.
The potential of sparse training suggested by the lottery ticket hypothesis has motivated the study of deriving the “winning tickets” at an early stage of training, thereby accelerating training process. There is a number of work in this direction (Frankle et al., 2020a; You et al., 2019; Frankle et al., 2020b), which are orthogonal to the discussions in this paper.
3 Notations in this Paper
In this paper, we follow the notations from (Frankle and Carbin, 2018) and generalize to the “pruning & finetuning” setup. Detailed notations (as shown in Figure 1) are illustrated as follows:

Initialization: Given a network , where denotes the initial weights.

Pretraining: Train the network for epochs arriving at weights and network function .

Pruning: Based on the trained weights , adopt certain algorithm to generate a pruning mask . The LTH paper (Frankle and Carbin, 2018) uses the iterative pruning algorithm. We start from this algorithm for a fair evaluation, but are not restricted to it. Other algorithms, e.g., oneshot pruning and ADMMbased pruning are also employed to evaluate the impact on sparse training and pruning & finetuning, as shown in Section 5.

Sparse Training (Lottery Ticket Setting): The LTH paper considers two cases in the sparse training setup. The first is the direct application of mask to initial weights , resulting in weights and network function . The LTH paper termed this case the “winning tickets”^{1}^{1}1We inherit this terminology, although it does not result in the winning property in many of our testing results.. The second is the application of mask to a random initialization of weights , resulting in weights (network function ). This case is termed “random reinitialization” in the LTH paper. The weights after training for epochs are denoted by , while the weights after training for epochs are denoted by . Please note that the mask is kept through this training process.

Pruning & finetuning: After generating the mask , we directly apply it to the trained weights , resulting in weights , and perform finetuning (retraining) for another epochs. The final weights are denoted by . To maintain the same number of total epochs as the lottery ticket setting, we set . Please note that the mask is kept through this finetuning process.
The winning property has twofold meaning: First, training for epochs (or fewer) will result in similar accuracy as (pretraining result of the dense network), under a nontrivial pruning rate. Second, there should be a notable accuracy gap between training for epochs and training , and the former shall be higher.
4 Why Lottery Ticket Exists? An Analysis from the Weight Correlation Perspective
4.1 Revisiting Lottery Ticket: When does this winning property exist?
We revisit the lottery ticket experiments on various DNN architectures including VGG11, ResNet20, and MobileNetV2 on the CIFAR10 and CIFAR100 datasets. Our goal is to investigate the precise condition when winning property exists. We explore two different initial learning rates. The pruning approach for deriving masks follows the iterative pruning in Frankle and Carbin (2018). Namely, iteratively remove a percentage of the weights with the least magnitudes in each layer. In each iterative pruning round, reset the weights to the initial weight . We use the uniform perlayer pruning rate. Note the first convolutional layer is not pruned for all DNNs in this work.
Figure 2 illustrates the experiments of accuracy comparison between random reinitialization and “winning ticket” (both sparse training) for ResNet20 on CIFAR10 at learning rates and over a range of different sparsity ratios (Frankle and Carbin (2018) uses the low learning rate 0.01). We conduct each experiment five times (result variation shown in the figure). We set the same training epochs 150 rounds for training the original DNN with initial weights (i.e., pretraining), training from randomly reinitialized weights with the mask (random reinitialization), and training from the initial weights with the mask
(“winning ticket”). The hyperparameters used are the same as
(Frankle and Carbin, 2018): SGD with momentum (0.9), and the learning rates decrease by a factor of 10 after 80 and 120 epochs. The batch size is 128. No additional training tricks are utilized throughout the paper for fairness in comparison.In the case of the initial learning rate of 0.01, the pretrained DNN’s accuracy is 89.62%. The “winning tickets” consistently outperform the random reinitialization over different sparsity ratios. It achieves the highest accuracy 90.04% (higher than the pretrained DNN) at sparsity ratio of 62%. This is similar to the observations found in (Frankle and Carbin, 2018) on the same network and dataset. On the other hand, in the case of the initial learning rate of 0.1, the pretrained DNN’s accuracy is 91.7%. In this case, the “winning ticket” has a similar accuracy performance as the random reinitialization, and cannot achieve the accuracy close to the pretrained DNN with a reasonable sparsity ratio (say 50% or beyond). Thus no winning property is satisfied. Similar results can be found in the experiments of ResNet20, VGG11 and MobileNetv2 on both CIFAR10 and CIFAR100, while the results of the rest of the experiments are detailed in Appendix A.
From these experiments, the winning property exists at a low learning rate but does not exist at a relatively high learning rate, which is also observed in (Liu et al., 2018). However, we would like to point out that the relatively high learning rate 0.1 (which is in fact the standard learning rate on these datasets) results in a notably higher accuracy in the pretrained DNN (91.7% vs. 89.6%) than the low learning rate^{2}^{2}2As CIFAR10 is a relatively small dataset, 2% accuracy is a notable accuracy difference that should not be ignored.. The associated sparse training results (“winning ticket”, random reinitialization) in the lottery ticket setting are also higher with the learning rate 0.1. This point is largely missing in the previous discussions. Now the key question is: Are the above two observations correlated? If the answer is yes, it means that the winning property is not universal to DNNs, nor is it a natural characteristic of DNN architecture or application. Rather, it indicates that the learning rate is not sufficiently large, and the original, pretrained DNN is not welltrained.
Our hypothesis is that the above observations are correlated, and this is largely attributed to the correlation between initialized weights and finaltrained weights when the learning rate is not sufficiently large. Before validating our hypothesis, we will introduce a correlation indicator (CI) for quantitative analysis.
4.2 Weight Correlation Indicator
Consider a DNN with two collections of weights and . Note that this is a general definition that applies to both the original DNN and sparse DNN (when the mask is applied and a portion of weights eliminated). We define the correlation indicator to quantify the amount of overlapped indices of largemagnitude weights between and . More specifically, given a DNN with layers, where the th layer has weights, the weight index set () is the top largestmagnitude weights in the layer. Similarly, we define . Please note that for a sparse DNN, the portion is defined with respect to the number of remaining weights in the sparse (sub)network^{3}^{3}3In this way the formula can be unified for dense and sparse DNNs.. The intersection of these two sets includes those weights that are large (top) in magnitude in both and , and denotes the number of such weights in layer . The correlation indicator (overlap ratio) between and is finally defined as:
(1) 
When , the top largestmagnitude weights in and are largely independent. In this case the correlation is relatively weak^{4}^{4}4We cannot say that there is no correlation here because is only a necessary condition.. On the other hand, if there is a large deviation of from , then there is a strong correlation. Especially when , the weights that are large in magnitude in are likely to also be large in , indicating a positive correlation. Otherwise, when , it implies a negative correlation.
As shown in Figure 3, the above correlation indicator will be utilized to quantify the correlation between a dense DNN and a dense DNN, i.e., for DNN pretraining, and between a sparse DNN and a sparse DNN, i.e., and for the cases of “winning ticket” and random reinitialization under lottery ticket setting. Next, we will use the former to demonstrate the effect of different learning rates in DNN pretraining and the latter to demonstrate the rationale and condition of winning property.
4.3 Weight Correlation in DNN PreTraining
Intuitively, the weight correlation means that if a weight is large in magnitude at initialization, it is likely to be large after training. The reason for such correlation is that the learning rate is too low and weight updating is slow. Such weight correlation is not desirable for DNN training and typically results in lower accuracy, as weights in a welltrained DNN should depend more on the location of those weights instead of initialization (Liu et al., 2018). Thus when such weight correlation is strong, the DNN accuracy will be lower, i.e., not welltrained.
To validate the above statement, we have performed experiments to derive on DNN pretraining with different initial learning rates. Using ResNet20 on CIFAR10 dataset as an illustrative example. Figure 4 illustrates the correlation indicator between the initial weights and the trained weights from DNN pretraining at learning rates of 0.01 and 0.1, respectively. We use the same as Section 4.1, also the same other hyperparameters and no additional training tricks. We can observe that at learning rate 0.01 has a notably higher correlation compared to the case of learning rate 0.1. This observation indicates that the largemagnitude weights of are not fully updated at a low learning rate 0.01, indicating that the pretrained DNN is not welltrained. In the case of learning rate 0.1, the weights are sufficiently updated thus largely independent from the initial weights (, where ), indicating a welltrained DNN. Results on other DNN models and datasets are provided in Appendix B, and a similar conclusion can be drawn.
As shown in the result discussions, learning rates 0.1 and 0.01 (for ResNet20) are not merely two candidate hyperparameter values. Rather, they result in a welltrained DNN (so a desirable learning rate) and a not welltrained DNN (so a notsogood learning rate), respectively. We shall not rely on the conclusion drawn from the latter that results in an insufficient DNN pretraining.
4.4 Cause and Condition of the Winning Property
Weight Correlation under Lottery Ticket Setting: In this subsection, our goal is to understand the different accuracy from training (“winning ticket”) and training (random reinitialization) when the learning rate is low, thereby revealing the cause and condition of winning property. We will achieve this goal by studying the weight correlation.
Consider the “pruning & finetuning” case formally defined in Section 3, in which we apply mask on the trained weights from DNN pretraining, and then perform finetuning for another epochs. The final weights are denoted by . Using ResNet20 on CIFAR10 as an illustrative example. Figure 5(a) and 5(b) show the accuracy of the “pruning & finetuning” result at different sparsity ratios, with learning rates 0.01 and 0.1, respectively. Again we use epochs and the same other hyperparameters. The accuracies of the pretrained DNN with corresponding learning rates are also provided. One can observe that achieves relatively high accuracy, close to or higher than the accuracy of the pretrained DNN at the same learning rate (even at the desirable learning rate 0.1)^{5}^{5}5In fact, the relatively high accuracy of is one major reason for us to explore the correlation between () and . In Section 5 we will generalize to the conclusion that “pruning & finetuning” results in higher accuracy in general than sparse training (the lottery ticket setting).. Results on other DNN models and datasets are provided in Appendix C, and a similar conclusion can be drawn.
We study the correlation between () and to shed some light on the cause of winning property. Again use ResNet20 on CIFAR10 as an illustrative example, while the results on other DNN models and datasets are provided in Appendix D with similar conclusion. Figure 5(c) shows the correlation indicator between (“winning ticket”) and , and between (random reinitialization) and , under the insufficient learning rate 0.01. While Figure 5(d) shows the correlation indicator results under the desirable learning rate 0.1. One can observe the positive correlation between and at the low learning rate, when the winning property exists. Such correlation is minor in the other cases.
Analysis of Weight Correlation and Condition of Winning Property: Let us investigate the cause of correlation between and at low learning rate. As shown in Section 4.3, there is a correlation between and at the insufficient learning rate. Then there is also a correlation between and (both applied the same mask). As includes the pretrained weights and only applies additional finetuning, there will be positive correlation between and . Combining the above two statements will yield the correlation between and . When we consider random reinitialization, there is no correlation between and as a reinitialization is applied. So there is no correlation between and , or between and .
At a desirable learning rate 0.1, there is a minor (or no) correlation between and as shown in Section 4.3. As a result, there is minor (or no) correlation between and , or between and . From the above analysis, one can observe that the correlation between and is the key condition in weight correlation analysis.
The positive correlation between and helps to explain the winning property at low learning rate. Compared with random reinitialization , the “winning ticket” is “closer” to a reasonably accurate solution. As the weight upscaling is slow (learning rate insufficient), it takes less effort to reach a higher accuracy starting from compared with starting from . Besides, as pointed out in Section 4.3, the insufficient learning rate (and correlation between and ) results in a low accuracy in the pretrained DNN, which makes it easier for sparse training to reach its accuracy. On the other hand, at a sufficient learning rate, such correlations do not exist (or are very minor), and then the winning property does not exist.
Remarks: From the above analysis, we conclude that a key condition of winning property is the correlation between and . However, as already demonstrated in Section 4.3, under the same condition the pretrained network will not be welltrained, as weights in a welltrained DNN should depend more on the location of those weights instead of initialization. In fact, as shown in Figure 2 and Appendix A, the “winning ticket” can only restore the accuracy of the pretrained DNN under the same insufficient learning rate, instead of reaching the pretrained DNN accuracy at a desirable learning rate. This makes the value of the winning property questionable.
4.5 Takeaway
As discussed above, the existence of winning property is correlated with an insufficient DNN pretraining. It seems that winning property is not a natural characteristic of DNN architecture or application, and is unlikely to occur for a welltrained DNN (with a desirable learning rate). As a result, we do not suggest investigating the winning property under an insufficient learning rate.
5 Pruning & Finetuning – A Better Way to Restore Accuracy under Sparsity
As concluded from the above discussions, it is difficult for sparse training to restore the accuracy of the pretrained dense DNN, when a desirable learning rate is applied. On the other hand, as already hinted in Section 4.4, the “pruning & finetuning” (i.e., finetuning from ) exhibits a higher capability in achieving the accuracy of a pretrained DNN, no matter what the learning rate is. Compared with the lottery ticket setting, the only difference in “pruning & finetuning” is that the mask is applied to the pretrained weights . Is this the key reason for the high accuracy? Is this a universal property for different DNN architectures and applications? If the answer is yes, what is the underlying reason? We aim to answer these questions.
In this section, we only consider the desirable learning rate (e.g., 0.1 for ResNet20 on CIFAR10 dataset) and sufficient DNN pretraining, as the associated conclusions will be more meaningful.
Fair Comparison with Lottery Ticket Setting: We claim that under the same epochs for finetuning and sparse training, it is a fair comparison. The generation of mask is the same. The only difference is that pruning & finetuning applies mask to the pretrained weights , while sparse training applies to . Note that is available before as the latter is derived based on using pruning algorithm. Thus pruning & finetuning will have no additional training epochs compared with sparse training.
Comparison between Pruning & Finetuning and Sparse Training: We use ResNet20 on CIFAR10 dataset as an illustrative example, and the rest of results are provided in Appendix E (with the similar conclusion). We use the desirable learning rate 0.1, epochs, and the same as Section 4.1 for the rest of hyperparameters. Figure 6(a) shows the accuracy comparison between pruning & finetuning (i.e., training (finetuning) from ) and the two sparse training scenarios “winning ticket” (i.e., training from ) and random reinitialization (i.e., training from ) at different sparsity ratios. Iterative pruning algorithm is used to derive mask here. One can clearly observe the accuracy gap between pruning & finetuning and the two sparse training cases (lottery ticket setting). In fact, the pruning & finetuning scheme can consistently outperform the pretrained dense DNN up to sparsity ratio 70%. Again, there is no accuracy difference between the two sparse training cases.
Furthermore, we consider other two candidate pruning algorithms, ADMMbased pruning (Zhang et al., 2018) and oneshot pruning, for pruning mask generation. Figure 6(b) and Figure 6(c) demonstrate the corresponding accuracy comparison results between pruning & finetuning and the two sparsity training scenarios. Again one can observe the notable advantage of pruning & finetuning over the lottery ticket setting, even with a weak oneshot pruning algorithm for mask generation. In fact, pruning & finetuning under ADMMbased pruning can restore the accuracy of pretrained DNN with 80% sparsity. The property is not found under any of these pruning algorithms. Clearly, the consistent advantage of pruning & finetuning is attributed to the fact that mask is applied to pretrained weights instead of the initialized weights . In fact, information in is important for the sparse subnetwork to maintain accuracy of the pretrained dense network. Or in other words, weights in the desirable sparse subnetwork should have correlation with instead of .
Effect of Different Pruning Algorithms – Towards a Better Mask Generation: We have tested three pruning algorithms for mask generation. How to evaluate their relative performance? Figure 7 combines the above results and demonstrates the accuracy performances of pruning & finetuning and sparse training (“winning ticket” case), under all three pruning algorithms. The rest of results are in Appendix E. One can observe the order in the accuracy performance: ADMMbased pruning on top, iterative pruning in the middle, and oneshot pruning the lowest. This order is the same for pruning & finetuning and sparse training. Note that the pruning algorithm is utilized to generate mask , while the other conditions are the same (i.e., , finetuning epochs on , or sparse training on ). Hence, the relative performance is attributed to the quality in mask generation. One can conclude that the selection of pruning algorithm is critical in generating the sparse subnetwork as the quality of mask generation plays a key role here.
An Analysis from Weight Correlation Perspective: We conduct a weight correlation analysis of pruning & finetuning results that can largely restore the accuracy of pretrained, dense DNN, between the final weights and the initialization . Detailed results and discussions are provided in Appendix F. The major conclusion is that there is a lack of correlation between and , but there is a correlation between and . It further strengthens the conclusion that it is not desirable to have the weight correlation between finaltrained weights and weight initialization.
Comparison with Frankle et al. (2019): The work Frankle et al. (2019) suggests applying mask to and then apply sparse training, where denotes the weights trained from for a small number of epochs. This technique is training from , and is in between sparse training (training from ) and pruning & finetuning (training from ). We point out that these three cases require the same number of total epochs under the same pruning algorithm, as mask is generated later than or . We conduct a comprehensive comparison on the relative performance, with detailed results and discussions in Appendix G. The major conclusion is that pruning & finetuning consistently outperforms the method Frankle et al. (2019) over different sparsity ratios, DNN models, and datasets. As they exhibit the same number of training epochs, we suggest directly applying the mask to and perform finetuning, instead of applying to .
Remarks: If one wants to optimize the accuracy of sparse subnetwork and restore the accuracy of the pretrained dense DNN, we suggest adopting the pruning & finetuning method instead of lottery ticket setting.
6 Conclusion
In this work, we investigate the underlying condition and rationale behind lottery ticket property. We introduce a correlation indicator for quantitative analysis. Extensive experiments over multiple deep models on different datasets have been conducted to justify that the existence of winning property is correlated with an insufficient DNN pretraining, and is unlikely to occur for a welltrained DNN. Meanwhile, the sparse training of lottery ticket setting is difficult to restore the accuracy of the pretrained dense DNN. To overcome this limitation, we propose the “pruning & finetuning” method that consistently outperforms lottery ticket sparse training under the same pruning algorithm and total training epochs over various DNNs on different datasets.
References

Distributed optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends® in Machine Learning
3 (1), pp. 1–122. Cited by: §2.1.  The lottery ticket hypothesis for pretrained bert networks. arXiv preprint arXiv:2007.12223. Cited by: §2.2.2.
 NeST: a neural network synthesis tool based on a growandprune paradigm. IEEE Transactions on Computers. Cited by: §1.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. ICLR. Cited by: Appendix A, Lottery Ticket Implies Accuracy Degradation, Is It a Desirable Phenomenon?, §1, §1, §2.1, §2.2.1, §2.2.1, 3rd item, §3, §4.1, §4.1, §4.1.
 Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611. Cited by: Appendix G, Appendix G, §1, §2.2.1, §5.
 Pruning neural networks at initialization: why are we missing the mark?. arXiv preprint arXiv:2009.08576. Cited by: §2.2.2.
 The early phase of neural network training. arXiv preprint arXiv:2002.10365. Cited by: §2.2.2.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR). Cited by: §1.
 Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143. Cited by: §1.
 Learning both weights and connections for efficient neural network. In Advances in neural information processing systems (NeurIPS), pp. 1135–1143. Cited by: §2.1.

Learning filter pruning criteria for deep convolutional neural networks acceleration
. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1.  Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4340–4349. Cited by: §1.
 Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1389–1397. Cited by: §1, §2.1.
 HRank: filter pruning using highrank feature map. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 AutoCompress: an automatic dnn structured pruning framework for ultrahigh compression rates. In AAAI, Cited by: §2.1.
 Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §1, §2.2.1, §4.1, §4.3.
 Proving the lottery ticket hypothesis: pruning is all you need. In International Conference on Machine Learning, pp. 6682–6691. Cited by: §1.
 2PFPCE: twophase filter pruning based on conditional entropy. arXiv preprint arXiv:1809.02220. Cited by: §1.
 One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. arXiv preprint arXiv:1906.02773. Cited by: §2.2.2.
 Stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pp. 80–88. Cited by: §2.1.
 ADMMnn: an algorithmhardware codesign framework of dnns using alternating direction methods of multipliers. In ASPLOS, Cited by: §2.1.
 Comparing rewinding and finetuning in neural network pruning. arXiv preprint arXiv:2003.02389. Cited by: §1, §2.2.1.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
 DropNet: reducing neural network complexity via iterative pruning. In International Conference on Machine Learning, pp. 9356–9366. Cited by: §2.1.
 Pruning neural networks without any data by iteratively conserving synaptic flow. arXiv preprint arXiv:2006.05467. Cited by: §1.
 Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (NeurIPS), pp. 2074–2082. Cited by: §1, §2.1.
 Good subnetworks provably exist: pruning via greedy forward selection. In International Conference on Machine Learning, pp. 10820–10830. Cited by: §1.
 Drawing earlybird tickets: towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957. Cited by: §2.2.2.
 Systematic weight pruning of dnns using alternating direction method of multipliers. arXiv preprint arXiv:1802.05747. Cited by: Appendix E, §2.1, Figure 6, §5.
 To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §2.1.
Appendix A Revisit Lottery Tickets
We show the experiment results of MobileNetV2 on CIFAR10, and ResNet20, VGG11, and MobileNetV2 on CIFAR100 over a range of different sparsity ratios with the masks generated from iterative pruning (Frankle and Carbin, 2018) at learning rate 0.01 and 0.1, respectively. We conduct each experiment five times (result variation shown in figures). We set the same training epochs (i.e., 150 epochs) for training the original DNNs with initial weights (i.e., pretraining), training from randomly reinitialized weights with the mask (random reinitialization), and training from the initial weights with the mask (“winning ticket”).
CIFAR10 Results: Figure 1(a) and 1(b) illustrate the result on MobileNetV2 using CIFAR10. The pretrained MobileNetV2’s accuracy on CIFAR10 is 92.20% at initial learning rate 0.01, and 93.86% at initial learning rate 0.1.
CIFAR100 Results: Figure 1(c) and 1(d) show the result on MobileNetV2 using CIFAR100. The pretrained MobileNetV2’s accuracy on CIFAR100 is 73.10% at initial learning rate 0.01, and 74.76% at initial learning rate 0.1. Figure 1(e) and 1(f) show the result on ResNet20 for CIFAR100. The pretrained ResNet20’s accuracy on CIFAR100 is 63.10% at initial learning rate 0.01, and 67.15% at initial learning rate 0.1 (see the significant gap here). Figure 1(g) and 1(h) show the result on VGG11 for CIFAR100. The pretrained VGG11’s accuracy on CIFAR100 is 67.74% at initial learning rate 0.01, and 69.83% at initial learning rate 0.1. In the case of MobileNetV2 on CIFAR100 at low learning rate, we observe that the “winning ticket” can outperform the random reinitialization but failed to restore the baseline accuracy (73.10%). This indicates the low learning rate is not desirable. For all illustrated cases, the “winning ticket”’s accuracy is close to the random reinitialization at the initial learning rate 0.1. While in the case of learning rate 0.01, the “winning ticket” can outperform the random reinitialization over different sparsity ratios. Note there is a clearly accuracy gap between the pretrained DNNs with the initial learning rate 0.1 and with the initial learning rate 0.01.
From these experiments, the winning property exists at a low learning rate but does not exist at a relatively high learning rate. However, we would like to point out that the relatively high learning rate of 0.1 (which is, in fact, the standard learning rate on these datasets) results in notably higher accuracy in the pretrained DNNs than the low learning rate (MobiletNetV2 on CIFAR10 93.86% vs. 92.20%, MobiletNetV2 on CIFAR100 74.76% vs. 73.10%, VGG11 on CIFAR100 69.83% vs. 67.74%, ResNet20 on CIFAR100 67.15% vs. 63.10%). We should not draw conclusion basd on the low (insufficient) learning rate in general.
Appendix B Weight Correlation in DNN PreTraining
We investigate the correlation indicator between the initial weights and the trained weights from DNN pretraining on VGG11, ResNet20, and MobileNetV2 on CIFAR10 and CIFAR100 under learning rates of 0.01 and 0.1, respectively.
We have performed experiments to derive on different DNN pretraining with different initial learning rates. Figure 2 illustrates the correlation indicator between the initial weights and the trained weights from DNN pretraining at learning rates of 0.01 and 0.1 on VGG11 and MobileNetV2 using CIFAR10/100, respectively. We use the same hyperparameters mentioned in the setup without additional training tricks. Figure 2(a) and 2(b) illustrate the result on VGG11 for CIFAR10/100. Figure 2(c) and 2(d) illustrate the result on MobileNetV2 for CIFAR10/100.
We can observe that at a learning rate of 0.01 has a notably higher correlation compared to the case of learning rate 0.1. This observation indicates that the largemagnitude weights of are not fully updated at a low learning rate of 0.01, indicating that the pretrained DNN is not welltrained. In the case of learning rate 0.1, the weights are sufficiently updated thus largely independent from the initial weights (, where ), indicating a welltrained DNN.
Appendix C Pruning & Finetuning
Consider the “pruning & finetuning” case formally defined in Section 3, in which we apply mask on the trained weights from DNN pretraining, and then perform finetuning for another epochs. The final weights are denoted by . We study accuracy of the “pruning & finetuning” result at different sparsity ratios, with learning rates of 0.01 and 0.1 on different DNNs using CIFAR10 and CIFAR100. We use the same hyperparameters as mentioned in the setup (). The accuracies of the pretrained DNNs with corresponding learning rates are also provided. Figure 3(a) and 3(b) illustrate the “pruning & finetuning” result on MobileNetV2 for CIFAR10 using learning rates of 0.01 and 0.1, respectively. Figure 3(c) and 3(d) illustrate the “pruning & finetuning” result on MobileNetV2 for CIFAR100 with learning rates of 0.01 and 0.1, respectively. In the case of MobileNetV2 for CIFAR100 with the initial learning rate 0.1, the “pruning & finetuning” scheme consistently perform better than the pretrained dense DNN (74.76%).
We can observe that achieves relatively high accuracy, close to or higher than the accuracy of the pretrained DNN at the same learning rate (even at the desirable learning rate 0.1).
Appendix D Sparse Correlation
We study the correlation between () and to shed some light on the cause of winning property. We illustrate the correlation on ResNet20, VGG11 and MobileNetV2 for CIFAR10/100 at learning rate 0.01, 0.1, respectively. We show the correlation indicator between (“winning ticket”) and , and between (random reinitialization) and at learning rate 0.01, 0.1. Figure 4 illustrates the result of ResNet20 for CIFAR100 at the learning rate 0.01 and 0.1. Figure 5 shows the result of VGG11 for CIFAR10/100 and Figure 6 shows the result of MobileNetV2 for CIFAR10/100 at learning rates 0.01, 0.1, respectively. In the case of high learning rate 0.1, the weight correlation between (“winning ticket”) and (pruned&finetuned weights), and between (random reinitialization) and (pruned&finetuned weights) are similar (and minor) under different sparsity ratios.
From these results we can observe the positive correlation between and at the low learning rate, when the winning property exists. Such correlation is minor in the other cases.
Appendix E Different Pruning Algorithms
We explore the different pruning algorithms on ResNet20, MobileNetV2 and VGG11 using CIFAR10/100. We use the desirable learning rate 0.1, epochs, and the same hyperparameters introduced in Section 4.1. We compare accuracy between pruning & finetuning (i.e., training (finetuning) from ) and the two sparse training scenarios “winning ticket” (i.e., training from ) and random reinitialization (i.e., training from ) at different sparsity ratios. We investigate three pruning algorithms to derive mask : Iterative pruning algorithm, ADMMbased pruning (Zhang et al., 2018) and oneshot pruning algorithm. We explore accuracy comparison results between pruning & finetuning and the two sparsity training scenarios. Figure 10 and 10 illustrate the accuracy comparison on MobileNetV2 using CIFAR10 and CIFAR100, respectively. Figure 10 shows the result on ResNet20 using CIFAR100. Figure 10 illustrates the result on VGG11 for CIFAR100.
From these results, we can clearly observe the accuracy gap between pruning & finetuning and the two sparse training cases (lottery ticket setting). For MobiletNetV2 on CIFAR100, with the masks generated from iterative pruning and ADMMbased pruning, the pruning & finetuning scheme can consistently outperform the pretrained dense DNN up to sparsity ratio 85%. Similarly results can be observed on VGG11 using CIFAR100. Meanwhile, at sparsity ratio 0.39 (39%), the pruning & finetuning scheme with mask generated from ADMMbased pruning can achieve accuracy 76.04% while the pretrained DNN’s accuracy is only 74.76% (under the desirable learning rate 0.1).
We observe the notable advantage of pruning & finetuning over the lottery ticket setting, even with a weak oneshot pruning algorithm for mask generation. Note there is no accuracy difference between the two sparse training cases. Pruning & finetuning under ADMMbased pruning can restore the accuracy of pretrained DNN with the highest sparsity ratio compared to the other two pruning algorithms. Clearly, the consistent advantage of pruning & finetuning is attributed to the fact that mask is applied to pretrained weights instead of the initialized weights . In fact, information in is important for the sparse subnetwork to maintain accuracy of the pretrained dense network. Or in other words, weights in the desirable sparse subnetwork should have correlation with instead of .
Further we evaluate the relative performance (accuracy) of these three pruning algorithms. We combine the above results and demonstrate the accuracy performances of pruning & finetuning and sparse training (“winning ticket” case), under all three pruning algorithms. Figure 11(a) and 11(b) show the overall accuracy performance comparison on MobileNetV2 using CIFAR10 and CIFAR100, respectively. Figure 12(a) shows the result on ResNet20 using CIFAR100 and Figure 12(b) shows the result on VGG11 using CIFAR100.
We observe the order in the accuracy performance: ADMMbased pruning on top, iterative pruning in the middle, and oneshot pruning the lowest. This order is the same for pruning & finetuning and sparse training. Note that the pruning algorithm is utilized to generate mask , while the other conditions are the same (i.e., , finetuning epochs on , or sparse training on ). Hence, the relative performance is attributed to the quality in mask generation. We can conclude that the selection of pruning algorithm is critical in generating the sparse subnetwork as the quality of mask generation plays a key role in the context of pruning scenario.
Appendix F An Analysis from Weight Correlation Perspective
We provide the correlation between and , and between and for VGG11, ResNet20 and MobileNetV2 using CIFAR10/100. The training epoch and the initial learning rate is 0.1. The masks are generated by ADMMbased pruning algorithm. Note that and are dense models, while is a sparse model. To utilize the correlation indicator, we extend the correlation scenario of dense DNNs vs. dense DNNs to sparse DNNs vs. dense DNNs by restricting less than sparsity ratio of sparse DNNs. In this experiment, we consider weight correlation at and the sparsity ratio is 0.50 (50%) for the DNNs. The results are illustrated in Table 1. The results indicate that there is a lack of correlation between and , but there is a correlation between and . It further strengthens the conclusion that it is not desirable to have the weight correlation between finaltrained weights and weight initialization.
Model  Dataset  

ResNet20  CIFAR100  20.36%  63.97% 
MobileNetV2  CIFAR100  20.11%  64.71% 
VGG11  CIFAR100  20.41%  49.32% 
MobileNetV2  CIFAR10  20.26%  49.36% 
VGG11  CIFAR10  20.21%  48.08% 
Appendix G Comparison with (Frankle et al., 2019)
The work Frankle et al. (2019) suggests applying mask to and then apply sparse training, where denotes the weights trained from for a small number of epochs. This technique is training from , and is in between sparse training (training from ) and pruning & finetuning (training from ). We evaluate the relative sparse training performance among (“winning ticket”), (pruned&finetuned) and (“rewind”) under a desirable learning rate. We set , and the initial learning rate is 0.1. The same hyperparameters are adopted as introduced in Section 4.1. We study the accuracy performance comparison on MobileNetV2, ResNet20 and VGG11 on CIFAR100. We use the masks generated from the ADMMbased pruning algorithm. Figure 13 illustrates the accuracy comparison results of MobileNetV2, ResNet20 and VGG11 on CIFAR100. We can observe the order in the accuracy performance: (pruned&finetuned) on top, (“rewind”) in the middle, and (“winning ticket”) the lowest. As they exhibit the same number of training epochs (please note that is generated later than or ), we suggest directly applying the mask to and perform finetuning, instead of applying to .