DSD: Dense-Sparse-Dense Training for Deep Neural Networks

by   Song Han, et al.

Modern deep neural networks have a large number of parameters, making them very hard to train. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint. In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network. Experiments show that DSD training can improve the performance for a wide range of CNNs, RNNs and LSTMs on the tasks of image classification, caption generation and speech recognition. On ImageNet, DSD improved the Top1 accuracy of GoogLeNet by 1.1 by 1.1 DeepSpeech2 WER by 2.0 NeuralTalk BLEU score by over 1.7. DSD is easy to use in practice: at training time, DSD incurs only one extra hyper-parameter: the sparsity ratio in the S step. At testing time, DSD doesn't change the network architecture or incur any inference overhead. The consistent and significant performance gain of DSD experiments shows the inadequacy of the current training methods for finding the best local optimum, while DSD effectively achieves superior optimization performance for finding a better solution. DSD models are available to download at https://songhan.github.io/DSD.


page 6

page 12

page 13


SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech

Text-to-speech (TTS) models have achieved remarkable naturalness in rece...

Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization

Deep neural networks are typically highly over-parameterized with prunin...

Training Sparse Neural Network by Constraining Synaptic Weight on Unit Lp Sphere

Sparse deep neural networks have shown their advantages over dense model...

Sparse Iso-FLOP Transformations for Maximizing Training Efficiency

Recent works have explored the use of weight sparsity to improve the tra...

Sparse Networks from Scratch: Faster Training without Losing Performance

We demonstrate the possibility of what we call sparse learning: accelera...

Learning both Weights and Connections for Efficient Neural Networks

Neural networks are both computationally intensive and memory intensive,...

Top-KAST: Top-K Always Sparse Training

Sparse neural networks are becoming increasingly important as the field ...

Please sign up or login with your details

Forgot password? Click here to reset