Scope of Reproducibility
We are reproducing Comparing Rewinding and Fine-tuning in Neural Networks, by Renda. In this work the authors compare three different approaches to retraining neural networks after pruning: 1) fine-tuning, 2) rewinding weights as in Frankle and 3) a new, original method involving learning rate rewinding, building upon Frankle. We reproduce the results of all three approaches, but we focus on verifying their approach, learning rate rewinding, since it is newly proposed and is described as a universal alternative to other methods.
We used CIFAR10 for most reproductions along with additional experiments on the larger CIFAR100, which extends the results originally provided by the authors. We have also extended the list of tested network architectures to include Wide ResNets (wrn). The new experiments led us to discover the limitations of learning rate rewinding which can worsen pruning results on large architectures.
We implemented the code ourselves in Python with TensorFlow 2, basing our implementation of the paper alone and without consulting the source code provided by the authors. We ran two sets of experiments. In the reproduction set, we have striven to exactly reproduce the experimental conditions ofRenda. We have also conducted additional experiments, which use other network architectures, effectively showing results previously unreported by the authors. We did not cover all originally reported experiments – we covered as many as needed to state the validity of claims. We used Google Cloud resources and a local machine with 2x RTX 3080 GPUs.
We were able to reproduce the exact results reported by the authors in all originally reported scenarios. However, extended results on larger Wide Residual Networks have demonstrated the limitations of the newly proposed learning rate rewinding – we observed a previously unreported accuracy degradation for low sparsity ranges. Nevertheless, the general conclusion of the paper still holds and was indeed reproduced.
What was easy
Re-implementation of the pruning and retraining methods was technically easy, as it is based on a popular and simple pruning criterion – magnitude pruning. Original work was descriptive enough to reproduce the results with satisfying results without consulting the code.
What was difficult
Not every design choice was mentioned in the paper, thus reproducing the exact results was rather difficult and required a meticulous choice of hyper-parameters. Experiments on ImageNet and WMT16 datasets were time consuming and required extensive resources, thus we did not verify them.
Communication with original authors
We did not consult the original authors, as there was no need to.
Neural network pruning is an algorithm leading to decrease the size of a network, usually by removing its connections or setting their weights to 0. This procedure generally allows obtaining smaller and more efficient models. It often turns out that these smaller networks are as accurate as their bigger counterparts or the accuracy loss is negligible. A common way to obtain such high quality sparse network is to prune it after the training has finished (rethinking, Frankle). Networks that have already converged are easier to prune than randomly initialized networks (rethinking, snip). After pruning, more training is usually required to restore the lost accuracy. Although there are a few ways to retrain the network, finetuning might be the easiest and most often chosen by researchers and practitioners (rethinking, Renda).
Lottery Ticket Hypothesis from Frankle formulates a hypothesis that for every dense neural network, there exists a smaller subnetwork that matches or exceeds results of the original. The algorithm originally used to obtain examples of such networks is iterative magnitude pruning with weight rewinding, and it is one of the methods of retraining after pruning compared in this work.
2 Scope of reproducibility
Renda formulated the following claims:
We aimed to compare three retraining approaches: 1) finetuning, 2) weight rewinding and 3) learning rate rewinding. Our general strategy that repeated across all experiments was as follows:
train a dense network to convergence,
prune the network using magnitude criterion: remove weights with smallest L1 norm,
retrain the network using selected retraining approach.
In the case of structured pruning: in step 2, we removed structures (rows or convolutional channels) with the smallest average L1 norm (structured), rather than removing separate connections.
In the case of iterative pruning: the network in step 1 was not randomly initialized, but instead: weights from a model from a previous iterative pruning step were loaded as the starting point.
We trained all our networks using Stochastic Gradient Descent with Nesterov Momentum. The learning rate was decreased in a piecewise manner during the training, but momentum coefficient was constant and equal to.
3.1 Model descriptions
In this report, we were focusing on an image recognition task using convolutional neural networks (lecun_convnet). For most of our experiments, we chose to use identical architectures as Renda to better validate their claims and double-check their results, rather than provide additional ones. Therefore, most of the used networks are residual networks, which were originally proposed in resnet. Additionally, to verify the general usefulness of pruning and retraining methods proposed in Renda we extend the list of tested network architectures to much larger wide residual networks from wrn.
3.1.1 Residual networks (ResNet)
Just as Renda, we chose to use the original version of ResNet as described in resnet rather than the more widely used, improved version (with preactivated blocks) from resnetv2. We created the models ourselves, using TensorFlow (tensorflow
) and Keras. We strove to replicate the exact architectures used byRenda and resnet and train them from scratch.
|Model||Trainable parameters||Kernel parameters||CIFAR-10||CIFAR-100|
|ResNet-20||272 282||270 896||92.46%||–|
|ResNet-56||855 578||851 504||93.71%||71.90%|
|ResNet-110||1 730 522||1 722 416||94.29%||72.21%|
Learning rate started with and was multiplied by twice, after 36 000 and 54 000 iterations. One training cycle was 72 000
iterations in total. For all batch normalization layers, we set the batch norm decay to 0.997, followingRenda which was also used in the original TensorFlow implementation111https://github.com/tensorflow/models/blob/r1.13.0/official/resnet/resnet_model.py. We initialize network’s weights with what is known as He uniform initialization from he_uniform. We regularize ResNets, during both training and finetuning, using penalty with
coefficient. In other words, the loss function (from which we calculate the gradients) looks as follows:
3.1.2 Wide Residual Networks (Wide ResNet, WRN)
WRN networks were introduced in wrn. They are networks created by simply increasing the number of filters in preactivated ResNet networks (resnetv2).
|Model||Trainable parameters||Kernel parameters||CIFAR-10|
|WRN-16-8||10 961 370||10 954 160||95.72%|
As Wide ResNets are newer and much larger than ResNets, hyper-parameters are slightly different. To choose them, we follow wrn. Learning rate starts with and multiplied by thrice: after 32 000, 48 000 and 64 000 iterations. Training lasts for 80 000 iterations. For all batch normalization layers, we use hyper-parameters from the newer TensorFlow implementation222https://github.com/tensorflow/models/blob/r2.5.0/official/vision/image_classification/resnet/resnet_model.py with batch norm decay set to 0.9. Following wrn, we use larger penalty for this network: . Finally, the loss function is as follows:
CIFAR-10 and CIFAR-100 are image classification datasets introduced in cifar10. Following Renda, we use all (50 000) training examples to train the model.
|Dataset||Training examples||Validation examples||Classes||Resolution|
|CIFAR-10||50 000||10 000||10||3232|
|CIFAR-100||50 000||10 000||100||3232|
We used a standard postprocessing for both CIFAR-10 and CIFAR-100 datasets (Renda, Frankle, wrn). During training and just before passing data to the model, we:
standardized the input by subtracting the mean and dividing by the std of RGB channels (calculated on training dataset),
randomly flipped in horizontal axis,
added a four pixel reflection padding,
randomly cropped the image to its original size.
During the validation, we did only the first step of the above.
3.3 Experimental setup and code
Our ready-to-use code, which includes experiment definitions, can be found at https://github.com/gahaalt/reproducing-comparing-rewinding-and-finetuning. It’s written using TensorFlow (tensorflow) version 2.4.2 in Python. More details are included in the repository.
3.4 Computational requirements
Recreating the experiments required a modern GPU, training all models on CPU was virtually impossible. Training time varies depending on a lot of factors: network version and size, exact version of the deep learning library, and even the operating system. In our case, using TensorFlow 2.4.2 on Ubuntu and a single RTX 3080 GPU, the smallest of the used models, ResNet-20, takes about 20 minutes to train on CIFAR-10 dataset. To replicate our experiments, training at least a single baseline network and then, separately, a single pruned network, is required. To reduce computational requirements, we reused one dense baseline for multiple compression ratios. Approximated training time requirements can be seen in the table below.
|Model||Dataset||Number of iterations||Iterations per second||Time for training cycle|
|ResNet-20||CIFAR-10||72 000||59.0||22 min|
|ResNet-56||CIFAR-10||72 000||28.6||43 min|
|ResNet-110||CIFAR-10||72 000||15.9||77 min|
|WRN-16-8||CIFAR-10||80 000||17.4||78 min|
For all our experiments in total, we used around 536 GPU hours.
4 Method description
We compare three methods of retraining after pruning. For all of them, the starting point is a network that was already trained to convergence, then pruned to a desired sparsity. The difference between the three retraining methods is what follows after it.
Fine-tuning is retraining with a small, constant learning rate – in our case, whenever fine-tuning was used, the learning rate was set to 0.001 as in Renda. We finetune the network for the same number of iterations as the baseline – 72 000 iterations in the case of the original ResNet architecture. In this method, such long retraining would not be necessary in practical applications, since the network converges much faster.
4.2 Weight rewinding
Weight rewinding restores the network’s weights from a previous point (possibly beginning) in the training history and then continues training from this point using the original training schedule – in our case a piecewise constant decaying learning rate schedule. When rewinding a network to iteration that originally trained for iterations: first prune the dense network that was trained for iterations. Then, for connections that survived, restore their values to -th iteration from the training history. Then train to the convergence for the remaining iterations.
4.3 Learning rate rewinding
Learning rate rewinding continues training with weights that have already converged, but restores the learning rate schedule to the beginning, just as if we were training from scratch, and then trains to the convergence once again. This reminds the cyclical learning rates from cyclical. Learning rate rewinding really is weight rewinding for , but the final retraining is always for iterations.
In most of our experiment, just as Renda, we investigate how does the trade-off between prediction accuracy and compression ratio look like. In one of the experiments (Table 5
) we verify only one compression ratio, but for the rest, we verify multiple. We report a median result out of 2 up to 12 trials for each compression ratio. To better utilize our compute capabilities, we decided to spend more training cycles in situations where there is no clear winner between the compared methods. On each plot, we include error bars showing 80% confidence intervals.
5.1 Results reproducing original paper
In this section, we include experiments that we successfully reproduced. They match the original ones within 1% error margin.
Across all scenarios where finetuning was tested, it was by far the worst of the three methods, which directly supports claim 1 (Section 2). Weight rewinding and learning rate rewinding most often are equally matched, but in some cases learning rate rewinding works a little better.
ResNets on CIFAR-10 dataset
5.2 Results beyond original paper
ResNets on CIFAR-100 dataset
WRN-16-8 on CIFAR-10 dataset
WRN-16-8 shows consistent behaviour – accuracy in the low sparsity regime is reduced in comparison to the baseline. In the case of iterative pruning, where each step is another pruning in the low sparsity regime, it leads to a large difference between the two retraining methods. Since for WRN-16-8 one-shot, low sparsity pruning shows a small regression in comparison to the baseline, this regression accumulates when pruning multiple times, as we do in iterative pruning. This can be seen in Figure 5.
For iterative pruning (figures 1, 2) we used a nonstandard step size of 30% per iterative pruning iteration, which was a way to reduce the computational requirements. We provide a comparison of our step size to the more commonly used 20%. We show that there is virtually no difference between both versions and the aforementioned catastrophic degradation occurs in both cases, as long as the step size is in the low sparsity regime.
We were able to confirm the general conclusion of Renda. Fine-tuning can mostly be replaced by other retraining techniques, e.g., by weight rewinding as done by Frankle. However, we have also shown in Figure 5 that the newly proposed learning rate rewinding is a poor choice when we are pruning larger networks – in our case that is WRN-16-8. We believe this should be further examined as there might exist a simple workaround to this problem – a retraining procedure in between weight rewinding and learning rate rewinding, which works in all cases. Furthermore, it would be interesting to see where exactly learning rate rewinding starts losing accuracy in comparison to weight rewinding and why this catastrophic accuracy degradation occurs. Perhaps, the reason for it not occurring with the original ResNet architecture is the degree to which the larger networks overtrain – larger networks tend to overtrain more. Such an overtrained network might not be a good starting point for the retraining.
The authors thank Polish National Science Center for funding under the OPUS-18 2019/35/B/ST6/04379 grant and the PlGrid consortium for computational resources.