PyTorch models and training code for 'Planet: Understanding the Amazon from Space' Kaggle
Modern deep neural networks have a large number of parameters, making them very hard to train. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint. In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network. Experiments show that DSD training can improve the performance for a wide range of CNNs, RNNs and LSTMs on the tasks of image classification, caption generation and speech recognition. On ImageNet, DSD improved the Top1 accuracy of GoogLeNet by 1.1 by 1.1 DeepSpeech2 WER by 2.0 NeuralTalk BLEU score by over 1.7. DSD is easy to use in practice: at training time, DSD incurs only one extra hyper-parameter: the sparsity ratio in the S step. At testing time, DSD doesn't change the network architecture or incur any inference overhead. The consistent and significant performance gain of DSD experiments shows the inadequacy of the current training methods for finding the best local optimum, while DSD effectively achieves superior optimization performance for finding a better solution. DSD models are available to download at https://songhan.github.io/DSD.READ FULL TEXT VIEW PDF
PyTorch models and training code for 'Planet: Understanding the Amazon from Space' Kaggle
PyTorch Implementation of Deep Compression
Deep neural networks (DNNs) have shown significant improvements in many application domains, ranging from computer vision (He et al. (2015)
) to natural language processing (Luong et al. (2015)) and speech recognition (Amodei et al. (2015)
). The abundance of more powerful hardware makes it easier to train complicated DNN models with large capacities. The upside of complicated models is that they are very expressive and can capture the highly non-linear relationship between features and output. The downside of such large models is that they are prone to capturing the noise, rather than the intended pattern, in the training dataset. This noise does not generalize to new datasets, leading to over-fitting and a high variance.
In contrast, simply reducing the model capacity would lead to the other extreme, causing a machine learning system to miss the relevant relationships between features and target outputs, leading to under-fitting and a high bias. Bias and variance are hard to optimize at the same time.
To solve this problem, we propose a dense-sparse-dense training flow (DSD), a novel training strategy that starts from a dense model from conventional training, then regularizes the model with sparsity-constrained optimization, and finally increases the model capacity by restoring and retraining the pruned weights. At testing time, the final model produced by DSD still has the same architecture and dimension as the original dense model, and DSD training doesn’t incur any inference overhead. We experimented DSD training on 7 mainstream CNN / RNN / LSTMs and found consistent performance gains over its comparable counterpart for image classification, image captioning and speech recognition.
Initial Dense Training:
The first D step learns the connection weights and importance via normal network training on the dense network. Unlike conventional training, however, the goal of this D step is not only to learn the values of the weights; we are also learning which connections are important. We use the simple heuristic to quantify the importance of the weights using their absolute value.
Sparse Training: The S step prunes the low-weight connections and trains a sparse network. We applied the same sparsity to all the layers, thus there’s a single hyper parameter: the sparsity, the percentage of weights that are pruned to 0. For each layer with parameters, we sorted the parameters, picked the k-th largest one as the threshold where , and generated a binary mask to remove all the weights smaller than . Details are shown in Algorithm 1 .
We remove small weights because of the Taylor expansion. The loss function and its Taylor expansion are shown in Equation (1)(2). We want to minimize the increase in when conducting a hard thresholding on the weights, so we need to minimize the first and second terms in Equation 2. Since we are zeroing out parameters, is actually . At the local minimum where and , only the second order term matters. Since second order gradient is expensive to calculate and has a power of 2, we use as the metric of pruning. Smaller means a smaller increase to the loss function.
Retraining while enforcing the binary mask in each iteration, we converted a dense network into a sparse network that has a known sparsity support and can fully recover or even increase the original accuracy of initial dense model under the sparsity constraint. The sparsity is the same for all the layers and can be tuned using validation. We find a sparsity value between 25% and 50% generally works well in our experiments.
Final Dense Training: The final D step recovers the pruned connections, making the network dense again. These previously-pruned connections are initialized to zero and the entire network is retrained with 1/10 the original learning rate (since the sparse network is already at a good local minima). Hyper parameters like dropout ratios and weight decay remained unchanged. By restoring the pruned connections, the final D step increases the model capacity of the network and makes it possible to arrive at a better local minima compared with the sparse model from S step.
To visualize the DSD training flow, we plotted the progression of weight distribution in Figure 2. The figure is plotted using GoogLeNet inception_5b3x3 layer, and we found this progression of weight distribution very representative for VGGNet and ResNet as well. The original distribution of weight is centered on zero with tails dropping off quickly. Pruning is based on absolute value so after pruning the large center region is truncated away. The un-pruned network parameters adjust themselves during the retraining phase, so in (c), the boundary becomes soft and forms a bimodal distribution. In (d), at the beginning of the re-dense training step, all the pruned weights come back again and are reinitialized to zero. Finally, in (e), the pruned weights are retrained together with the un-pruned weights. In this step, we kept the same learning hyper-parameters (weight decay, learning rate, etc.) for pruned weights and un-pruned weights. Comparing Figure (d) and (e), the un-pruned weights’ distribution almost remained the same, while the pruned weights became distributed further around zero. The overall mean absolute value of the weight distribution is much smaller. This is a good phenomenon: choosing the smallestvector that solves the learning problem suppresses irrelevant components of the weight vector Moody et al. (1995).
Dropout and DropConnect: DSD, Dropout (Srivastava et al. (2014)) and DropConnnect (Wan et al. (2013)) can all regularize neural networks and prevent over-fitting. The difference is that Dropout and DropConnect use a random sparsity pattern at each SGD iteration, while DSD training learns with a deterministic data driven sparsity pattern throughout sparse training. Our experiments on VGG16, GoogLeNet and NeuralTalk show that DSD training can work together with Dropout.
Distillation: Model distillation (Hinton et al. (2015)) is a method that can transfer the learned knowledge from a large model to a small model, which is more efficient for deployment. This is another method that allows for performance improvements in neural networks without architectural changes.
Model Compression: Both model compression (Han et al. (2016, 2015)) and DSD training use network pruning (LeCun et al. (1990); Hassibi et al. (1993)). The difference is that the focus of DSD training goes beyond maintaining the accuracy. DSD is able to further improve the accuracy by considerable margins. Another difference is that DSD training doesn’t require aggressive pruning. A modestly pruned network (50%-60% sparse) can work well. However, model compression requires aggressively pruning the network to achieve high compression rate.
Sparsity Regularization and Hard Thresholding: the truncation-based sparse network has been theoretically analyzed for learning a broad range of statistical models in high dimensions (Langford et al. (2009); Yuan & Zhang (2013); Wang et al. (2014)). Similar training strategy with iterative hard thresholding and connection restoration is proposed by Jin et al. (2016) as same period but independent work. Sparsity regularized optimization is heavily applied in Compressed Sensing (Candes & Romberg (2007)) to find optimal solutions to the inverse problems in highly under-determined systems based on the sparsity assumption.
We applied DSD training to different kinds of neural networks in different domains. We found that DSD training improved the accuracy for ALL these networks compared to the baseline networks that were not trained with DSD. The neural networks are chosen from CNN, RNN and LSTMs; the datasets covered image classification, speech recognition, and caption generation. For networks trained for ImageNet, we focus on GoogLeNet, VGG and ResNet, which are widely used in research and production. An overview of the networks, dataset and accuracy results are shown in Table 1. For the convolutional networks, we do not prune the first layer during the sparse phase, since it has only 3 channels and is very sensitive to pruning. The sparsity is the same
for all the other layers, including convolutional and fully-connected layers. We do not change any other training hyper-parameters, and the initial learning rate at each stage is decayed the same as conventional training. The epochs are decided by when the loss converges. When the loss no longer decreases, we stop the training.
|Neural Network||Domain||Dataset||Type||Baseline||DSD||Abs. Imp.||Rel. Imp.|
Top-1 error. VGG/GoogLeNet baselines from Caffe model zoo, ResNet from Facebook.
BLEU score baseline from Neural Talk model zoo, higher the better.
Word error rate: DeepSpeech2 is trained with a portion of Baidu internal dataset with only max decoding to show the effect of DNN improvement.
We experimented with the BVLC GoogLeNet (Szegedy et al. (2015)) model obtained from the Caffe Model Zoo (Jia (2013)). It has 13 million parameters and 57 convolutional layers. We pruned each layer (except the first) to 30% sparsity. Retraining the sparse network gave some improvement in accuracy due to regularization, as shown in Table 2. After the final dense training step, GoogLeNet’s error rates were reduced by 1.12% (Top-1) and 0.62% (Top-5) over the baseline.
We compared DSD v.s. conventional training for the same number of epochs by dropping the learning rate upon "convergence" and continuing to learn. The result is shown as LLR (lower the learning rate). The training epochs for LLR is equal to that of Sparse+re-Dense as a fair comparison. LLR can not achieve the same accuracy as DSD.
|GoogLeNet||Top-1 Err||Top-5 Err||Sparsity||Epochs||LR|
We explored DSD training on VGG-16 (Simonyan & Zisserman (2014)
), which is widely used in detection, segmentation and transfer learning. The baseline model is obtained from the Caffe Model Zoo (Jia (2013)). Similar to GoogLeNet, each layer is pruned to 30% sparsity. DSD training greatly reduced the error by 4.31% (Top-1) and 2.65% (Top-5), detailed in Table 3. DSD also wins over the LLR result by a large margin.
|VGG-16||Top-1 Err||Top-5 Err||Sparsity||Epochs||LR|
Deep Residual Networks (ResNets, He et al. (2015)) were the top performer in the 2015 ImageNet challenge. The baseline ResNet-18 and ResNet-50 models are provided by Facebook (2016). We prune to 30% sparsity uniformly, and a single DSD pass for these networks reduced top-1 error by 1.13% (ResNet-18) and 0.85% (ResNet-50), shown in Table 4. A second DSD iteration can further improve the accuracy. As a fair comparison, we continue train the original model by lowering the learning rate by another decade, but can’t reach the same accuracy as DSD, as shown in the LLR row.
|Top-1 Err||Top-5 Err||Top-1 Err||Top-5 Err||Sparsity||Epochs||LR|
We evaluated DSD training on RNN and LSTM beyond CNN. We applied DSD to NeuralTalk (Karpathy & Fei-Fei (2015)), an LSTM for generating image descriptions. It uses a CNN as an image feature extractor and an LSTM to generate captions. To verify DSD training on LSTMs, we fixed the CNN weights and only train the LSTM weights. The baseline NeuralTalk model we used is the flickr8k_cnn_lstm_v1.p downloaded from NeuralTalk Model Zoo.
In the pruning step, we pruned all layers except , the word embedding lookup table, to 80% sparse. We used a higher sparsity than CNN’s experiments based on the validation set of flickr8k. We retrained the remaining sparse network using the same weight decay and batch size as the original paper. The learning rate is tuned based on the validation set, shown in Table 5. Retraining the sparse network improved the BLUE score by [1.2, 1.1, 0.9, 0.7]. After getting rid of the sparsity constraint and retraining the dense network, the final results of DSD further improved the BLEU score by [2.0, 2.1, 2.0, 1.7] over baseline.
The BLEU score is not the sole criteria measuring auto-caption system. We visualized the captions generated by DSD training in Figure 3. In the first image, the baseline model mistakes the girl with a boy and the girl’s hair with a rock wall; the sparse model can tell that it’s a girl; and the DSD model can further identify the swing. In the the second image, DSD training can more accurately tell the player is in a white uniform and trying to make a shot, rather than the baseline just saying he’s in a red uniform and playing with a ball. The performance of DSD training generalizes beyond these examples; more image caption results generated by DSD training are provided in the Appendix.
The DS1 model is a 5 layer network with 1 Bidirectional Recurrent layer, as described in Table 6. The training dataset used for this model is the Wall Street Journal (WSJ), which contains 81 hours of speech. The validation set consists of 1 hour of speech. The test sets are from WSJ’92 and WSJ’93 and contain 1 hour of speech combined. The Word Error Rate (WER) reported on the test sets for the baseline models is different from Amodei et al. (2015) due to two factors. First, in DeepSpeech2, the models were trained using much larger data sets containing approximately 12,000 hours of multi-speaker speech data. Secondly, WER was evaluated with beam search and a language model in DeepSpeech2; here the network output is obtained using only max decoding to show improvement in the neural network accuracy, and filtering out the other parts.
The first dense phase was trained for 50 epochs. In the sparse phase, weights are pruned in the Fully Connected layers and the Bidirectional Recurrent layer only (they are the majority of the weights). Each layer is pruned to achieve the same 50% sparsity and trained for 50 epochs. In the final dense phase, the pruned weights are initialized to zero and trained for another 50 epochs. For a fair comparison of baseline, we used Nesterov SGD to train, reduce the learning rate with each re-training, and keep all other hyper parameters unchanged. The learning rate is picked using our validation set.
We first wanted to compare the DSD results with a baseline model trained for the same number of epochs. The first 3 rows of Table 7 shows the WER when the DSD model is trained for 50+50+50=150 epochs, and the 6th line shows the baseline model trained by 150 epochs (the Same #Epochs as DSD). DSD training improves WER by 0.13 (WSJ ’92) and 1.35 (WSJ ’93) given the same number of epochs as the conventional training.
Given a second DSD iteration, accuracy can be further improved. In the second DSD iteration, each layer is pruned away 25% of the weights. Similar to the first iteration, the sparse model and subsequent dense model are further retrained for 50 epochs. The learning rate is scaled down for each re-training step. The results are shown in Table 7. Compared with the fully trained and converged baseline, the second DSD iteration improves WER by 0.58 (WSJ ’92) and 1.96 (WSJ ’93), a relative improvement of 2.07% (WSJ ’92) and 5.84% (WSJ ’93). So, we can do more DSD iterations (DSDSD) to further improve the performance. Adding more DSD iterations has a diminishing return.
|DeepSpeech 1||WSJ ’92||WSJ ’93||Sparsity||Epochs||LR|
|Dense Iter 0||29.82||34.57||0%||50||8e-4|
|Sparse Iter 1||27.90||32.99||50%||50||5e-4|
|Dense Iter 1||27.90||32.20||0%||50||3e-4|
|Sparse Iter 2||27.45||32.99||25%||50||1e-4|
|Dense Iter 2||27.45||31.59||0%||50||3e-5|
To show how DSD works on deeper networks, we evaluated DSD on the Deep Speech 2 (DS2) network, described in Table 8. This network has 7 Bidirectional Recurrent layers with approximately 67 million parameters, around 8 times larger than the DS1 model. A subset of the internal English training set is used. The training set is comprised of 2,100 hours of speech. The validation set is comprised of 3.46 hours of speech. The test sets are from WSJ’92 and WSJ’93, which contain 1 hour of speech combined.
Table 9 shows the results of the two iterations of DSD training. For the first sparse re-training, similar to DS1, 50% of the parameters from the Bidirectional Recurrent Layers and Fully Connected Layers are pruned. The Baseline model is trained for 60 epochs to provide a fair comparison with DSD training. The baseline model shows no improvement after 40 epochs. With one iteration of DSD training, WER improves by 0.44 (WSJ ’92) and 0.56 (WSJ ’93) compared to the fully trained baseline.
|Layer ID||0||1||2||3 - 8||9||10|
|DeepSpeech 2||WSJ ’92||WSJ ’93||Sparsity||Epochs||LR|
|Dense Iter 0||11.83||17.42||0%||20||3e-4|
|Sparse Iter 1||10.65||14.84||50%||20||3e-4|
|Dense Iter 1||9.11||13.96||0%||20||3e-5|
|Sparse Iter 2||8.94||14.02||25%||20||3e-5|
|Dense Iter 2||9.02||13.44||0%||20||6e-6|
Here we show again that DSD can be applied multiple times or iteratively for further performance gain. A second iteration of DSD training achieves better accuracy as shown in Table 9. For the second sparse iteration, 25% of parameters in the Fully Connected layer and Bidirectional Recurrent layers are pruned. Overall DSD training achieves relative improvement of 5.55% (WSJ ’92) and 7.44% (WSJ ’93) on the DS2 architecture. These results are in line with DSD experiments on the smaller DS1 network. We can conclude that DSD re-training continues to show improvement in accuracy with larger layers and deeper networks.
Dense-Sparse-Dense training changes the optimization process and improves the optimization perfor- mance with significant margins by nudging the network with pruning and re-densing. We conjecture that the following aspects contribute to the efficacy of DSD training.
Escape Saddle Point: Based on previous studies, one of the most profound difficulties of optimizing deep networks is the proliferation of saddle points (Dauphin et al. (2014)). Advanced optimization methods have been proposed to overcome saddle points. For a similar purpose but with a different approach, the proposed DSD method overcomes the saddle points by pruning and re-densing framework. Pruning the converged model perturbs the learning dynamics and allows the network to jump away from saddle points, which gives the network a chance to converge at a better local or global minimum. This idea is also similar to Simulated Annealing ( Hwang (1988)
). While Simulated Annealing randomly jumps with decreasing probability on the search graph, DSD deterministically deviates from the converged solution achieved in the first dense training phase by removing the small weights and enforcing a sparsity support. Similar to Simulated Annealing, which can escape sub-optimal solutions multiple times in the entire optimization process, DSD can also be applied iteratively to achieve further performance gains, as shown in the Deep Speech results.
Significantly Better Minima:
After escaping saddle point, DSD achieved better minima. We measured both the training loss and validation loss, DSD training decreased the loss and error on both the training and the validation sets on ImageNet. We have also validated the significance of the improvements compared with conventional fine-tuning by t-test, shown in the appendix.
Regularized and Sparse Training: The sparsity regularization in the sparse training step moves the optimization to a lower-dimensional space where the loss surface is smoother and tend to be more robust to noise. More numerical experiments verified that both sparse training and the final DSD reduce the variance and lead to lower error (shown in the appendix).
Weight initialization plays a big role in deep learning (Mishkin & Matas (2015)). Conventional training has only one chance of initialization. DSD gives the optimization a second (or more) chance during the training process to re-initialize from more robust sparse training solution. We re-dense the network from the sparse solution which can be seen as a zero initialization for pruned weights. Other initialization methods are also worth trying.
Break Symmetry: The permutation symmetry of the hidden units makes the weights symmetrical, thus prone to co-adaptation in training. In DSD, pruning the weights breaks the symmetry of the hidden units associated with the weights, and the weights are asymmetrical in the final dense phase.
We introduce DSD, a dense-sparse-dense training framework that regularizes neural networks by pruning and then restoring connections. Our method learns which connections are important during the initial dense solution. Then it regularizes the network by pruning the unimportant connections and retraining to a sparser and more robust solution with same or better accuracy. Finally, the pruned connections are restored and the entire network is retrained again. This increases the dimensionality of parameters, and thus model capacity, from the sparser model.
DSD training achieves superior optimization performance. We highlight our experiments using GoogLeNet, VGGNet, and ResNet on ImageNet; NeuralTalk on Flickr-8K; and DeepSpeech-1&2 on the WSJ dataset. This shows that the accuracy of CNNs, RNNs, and LSTMs can be significnatly benefit from DSD training. Our numerical results and empirical tests show the inadequacy of current training methods for which we have provided an effective solution.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
Truncated power method for sparse eigenvalue problems.The Journal of Machine Learning Research, 14(1):899–925, 2013.
DSD training improves the baseline model performance by consecutively pruning and re-densing the network weights. We conducted more intensive experiments to validate that the improvements are significant and not due to any randomness in the optimization. In order to evaluate the significance, we repeated the baseline training, DSD training (retraining on baseline) and conventional fine-tuning (retraining on the same baseline) multiple times. The statistical significance of DSD improvements are quantified on the Cifar-10 dataset using ResNet.
Cifar-10 is a smaller image recognition benchmark with 50,000 32x32 color images for training and 10,000 for testing. Training on Cifar-10 is fast enough thus it is applicable to conduct intensive experiments within reasonable time to evaluate DSD performance. The baseline models were trained with the standard 164 epochs and initial LR of 0.1 as recommended in the released code (Facebook, 2016). After 164 epochs, we obtained the model with a 8.26% top-1 testing error that is consistent with the Facebook result. Initialized from this baseline model, we repeated 16 times of re-training using DSD training and 16 times using conventional fine-tuning. The DSD used sparsity of 50% and 90 epochs (45 for sparse training and 45 for re-densing training). As a fair comparison, the conventional fine-tuning is also based on the same baseline model with the same hyper-parameters and settings (90 epochs, 45 LR of 0.001 and 45 LR of 0.0001).
Detailed results are listed below. On Cifar-10 and using ResNet-20 architecture, the DSD training on average achieved Top-1 testing error of , which is a 0.37% absolute improvement (4.5% relative improvement) over the baseline model and relatively 1.1% better than what the conventional fine-tuning. The experiment also shows that DSD training can reduce the variance of learning: the trained models after the sparse training and the final DSD training both have lower standard deviation of errors compared with their counterparts using conventional fine-tuning.
, which is a 0.37% absolute improvement (4.5% relative improvement) over the baseline model and relatively 1.1% better than what the conventional fine-tuning. The experiment also shows that DSD training can reduce the variance of learning: the trained models after the sparse training and the final DSD training both have lower standard deviation of errors compared with their counterparts using conventional fine-tuning.
|ResNet-20||Avg. Top-1 Err||SD. Top-1 Err||Sparsity||Epochs||LR|
|Direct Finetune (First half)||8.16%||0.08%||0%||45||1e-3|
|Direct Finetune (Second half)||7.97%||0.04%||0%||45||1e-4|
|DSD (Fist half, Sparse)||8.12%||0.05%||50%||45||1e-3|
|DSD (Second half, Dense)||7.89%||0.03%||0%||45||1e-4|
|Improve from baseline(abs)||0.37%||-||-||-||-|
|Improve from baseline(rel)||4.5%||-||-||-||-|
We used t-test (unpaired) to compare the top-1 testing error rate of the models trained using DSD and conventional methods. The results demonstrate the DSD training achieves significant improvements from both the baseline model (p<0.001) and conventional fine tuning (p<0.001).
Based on the results above, DSD significantly improves conventional baseline training and is also significantly better and more robust than conventional fine-tuning.