Sparse Networks from Scratch: Faster Training without Losing Performance

by   Tim Dettmers, et al.
University of Washington

We demonstrate the possibility of what we call sparse learning: accelerated training of deep neural networks that maintain sparse weights throughout training while achieving performance levels competitive with dense networks. We accomplish this by developing sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently. Sparse momentum redistributes pruned weights across layers according to the mean momentum magnitude of each layer. Within a layer, sparse momentum grows weights according to the momentum magnitude of zero-valued weights. We demonstrate state-of-the-art sparse performance on MNIST, CIFAR-10, and ImageNet, decreasing the mean error by a relative 8 show that our algorithm can reliably find the equivalent of winning lottery tickets from random initialization: Our algorithm finds sparse configurations with 20 counterparts. Sparse momentum also decreases the training time: It requires a single training run -- no re-training is required -- and increases training speed up to 11.85x. In our analysis, we show that our sparse networks might be able to reach dense performance levels by learning more general features which are useful to a broader range of classes than dense networks.


page 1

page 2

page 3

page 4


DSD: Dense-Sparse-Dense Training for Deep Neural Networks

Modern deep neural networks have a large number of parameters, making th...

Layer-Specific Adaptive Learning Rates for Deep Networks

The increasing complexity of deep learning architectures is resulting in...

Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training

In this paper, we introduce a new perspective on training deep neural ne...

How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

Most artificial networks today rely on dense representations, whereas bi...

Sparse Linear Networks with a Fixed Butterfly Structure: Theory and Practice

Fast Fourier transform, Wavelets, and other well-known transforms in sig...

Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask

The recent "Lottery Ticket Hypothesis" paper by Frankle & Carbin showed ...

1 Introduction

Current state-of-the-art neural networks need extensive computational resources to be trained and can have capacities of close to one billion connections between neurons

(Vaswani et al., 2017; Devlin et al., 2018; Child et al., 2019). One solution that nature found to improve neural network scaling is to use sparsity: the more neurons a brain has, the fewer connections neurons make with each other (Herculano-Houzel et al., 2010). Similarly, for deep neural networks, it has been shown that sparse weight configurations exist which train faster and achieve the same errors as dense networks (Frankle and Carbin, 2019). However, currently, these sparse configurations are found by starting from a dense network, which is pruned and re-trained repeatedly – an expensive procedure.

In this work, we demonstrate the possibility of training sparse networks that rival the performance of their dense counterparts with a single training run – no re-training is required. We train from random initializations and maintain sparse weights throughout training while also speeding up the overall training time. We achieve this by developing sparse momentum, an algorithm which uses the exponentially smoothed gradient of network weights (momentum) as a measure of persistent errors to identify which layers are most efficient at reducing the error and which missing connections between neurons would reduce the error the most. Sparse momentum follows a cycle of (1) pruning weights with small magnitude, (2) redistributing weights across layers according to the mean momentum magnitude of existing weights, and (3) growing new weights to fill in missing connections which have the highest momentum magnitude.

We compare the performance of sparse momentum to compression algorithms and recent methods that maintain sparse weights throughout training. We demonstrate state-of-the-art sparse performance on MNIST, CIFAR-10, and ImageNet-2012. Sparse momentum also matches the performance of several dense baselines on MNIST and CIFAR-10. We estimate mean speedups of our sparse convolutional networks on CIFAR-10 for optimal sparse convolution algorithms and naive dense convolution algorithms compared to dense baselines. For sparse convolution, we estimate speedups between 3.50x and 11.85x and for dense convolution speedups between 1.16x and 1.45x. Finally, we present an analysis of the feature representations of our sparse networks. We find that networks trained by sparse momentum learn more general features which are useful to a broader range of classes than dense features which might explain why sparse networks can compete with dense networks.

2 Related Work

From Dense to Sparse Neural Networks: Work that focuses on creating sparse from dense neural networks has an extensive history. Earlier work focused on pruning via second-order derivatives (LeCun et al., 1989; Karnin, 1990; Hassibi and Stork, 1992)

and heuristics which ensure efficient training of networks after pruning

(Chauvin, 1988; Mozer and Smolensky, 1988; Ishikawa, 1996). Recent work is often motivated by the memory and computational benefits of sparse models that enable the deployment of deep neural networks on mobile and low-energy devices. A very influential paradigm has been the iterative (1) train-dense, (2) prune, (3) re-train cycle introduced by Han et al. (2015)

. Extensions to this work include: Compressing recurrent neural networks and other models

(Narang et al., 2017; Zhu and Gupta, 2018; Dai et al., 2018), continuous pruning and re-training (Guo et al., 2016), joint loss/pruning-cost optimization (Carreira-Perpinán and Idelbayev, 2018), layer-by-layer pruning (Dong et al., 2017), fast-switching growth-pruning cycles (Dai et al., 2017), and soft weight-sharing (Ullrich et al., 2017). These approaches often involve re-training phases which increase the training time. However, since the main goal of this line of work is a compressed model for mobile devices, it is desirable but not an important main goal to reduce the run-time of these procedures. This is contrary to our motivation. Despite the difference in motivation, we include many of these dense-to-sparse compression methods in our comparisons. Other compression algorithms include regularization (Louizos et al., 2018), and Bayesian methods (Louizos et al., 2017; Molchanov et al., 2017). For further details, see the survey of Gale et al. (2019).

Interpretation and Analysis of Sparse Neural Networks: Frankle and Carbin (2019) show that "winning lottery tickets" exist for deep neural networks – sparse initializations which reach similar predictive performance as dense networks and train just as fast. However, finding these winning lottery tickets is computationally expensive and involves multiple prune and re-train cycles starting from a dense network. Followup work concentrated on finding these configurations faster (Frankle et al., 2019; Zhou et al., 2019). In contrast, we reach dense performance levels with a sparse network from random initialization with a single training run while accelerating training.

Sparse Neural Networks Throughout Training: Methods that maintain sparse weights throughout training through a prune-redistribute-regrowth cycle are most closely related to our work. Bellec et al. (2018) introduce DEEP-R, which takes a Bayesian perspective and performs sampling for prune and regrowth decisions – sampling sparse network configurations from a posterior. While theoretically rigorous, this approach is computationally expensive and challenging to apply to large networks and datasets. Sparse evolutionary training (SET) (Mocanu et al., 2018) simplifies prune-regrowth cycles by using heuristics: (1) prune the smallest and most negative weights, (2) grow new weights in random locations. Unlike our work, where many convolutional channels are empty and can be excluded from computation, growing weights randomly fills most convolutional channels and makes it challenging to harness computational speedups during training without specialized sparse algorithms. SET also does not include the cross-layer redistribution of weights which we find to be critical for good performance, as shown in our ablation study. The most closely related work to ours is Dynamic Sparse Reparameterization (DSR) by Mostafa and Wang (2019), which includes the full prune-redistribute-regrowth cycle. However, similar to SET, DSR includes random regrowth which hampers the possibilities of speedups during training. More distantly related is Single-shot Network Pruning (SNIP) (Lee et al., 2019), which aims to find the best sparse network from a single pruning decision. The goal of SNIP is simplicity, while our goal is maximizing predictive and run-time performance. In our work, we compare against all four methods: DEEP-R, SET, DSR, and SNIP.

3 Method

3.1 Sparse Learning

We define sparse learning to be the training of deep neural networks which maintain sparsity throughout training while matching the predictive performance of dense neural networks. To achieve this, intuitively, we want to find the weights that reduce the error most effectively. This is challenging since most deep neural network can hold trillions of different combinations of sparse weights. Additionally, during training, as feature hierarchies are learned, efficient weights might change gradually from shallow to deep layers. How can we find good sparse configurations? In this work, we follow a divide-and-conquer strategy that is guided by computationally efficient heuristics. We divide sparse learning into the following sub-problems which can be tackled independently: (1) Pruning weights, (2) redistribution of weights across layers, and (3) regrowing weights, as defined in more detail below.

Figure 1:

Sparse Momentum is applied at the end of each epoch: (1) take the magnitude of the exponentially smoothed gradient (momentum) of each layer and normalize to 1; (2) for each layer, remove

% of the weights with the smallest magnitude; (3) across layers, redistribute the removed weights by adding weights to each layer proportionate to the momentum of each layer; within a layer, add weights starting from those with the largest momentum magnitude. Decay .

3.2 Sparse Momentum

We use the mean magnitude of momentum of existing weights in each layer to estimate how efficient the average weight in each layer is at reducing the overall error. Intuitively, we want to take weights from less efficient layers and redistribute them to weight-efficient layers. The sparse momentum algorithm is depicted in Figure 1. In this section, we first describe the intuition behind sparse momentum and then present a more detailed description of the algorithm.

The gradient of the error with respect to a weight

yields the directions which reduce the error at the highest rate. However, if we use stochastic gradient descent, most weights of

oscillate between small/large and negative/positive gradients with each mini-batch (Qian, 1999) – a good change for one mini-batch might be a bad change for another. We can reduce oscillations if we take the average gradient over time, thereby finding weights which reduce the error consistently. However, we want to value recent gradients, which are closer to the local minimum, more highly than the distant past. This can be achieved by exponentially smoothing – the momentum :

where is a smoothing factor, is the momentum for the weight in layer ; is initialized with .

Momentum is efficient at accelerating the optimization of deep neural networks by identifying weights which reduce the error consistently. Similarly, the aggregated momentum of weights in each layer should reflect how good each layer is at reducing the error consistently. Additionally, the momentum of zero-valued weights – equivalent to missing weights in sparse networks – can be used to estimate how quickly the error would change if these weights would be included in a sparse network.

The details of the algorithm are shown in Algorithm 1. Before training, we initialize the network with a certain sparsity : We initialize the network as usual and then remove a fraction of weights for each layer. During training, we apply sparse momentum after each epoch. We can break the sparse momentum algorithm itself in three major parts: (a) redistribution of weights, (b) pruning weights, (c) regrowing weights. In step (a), we calculate the weight redistribution proportions and in turn how many weights to regrow in each layer: For each layer, we take the mean of the element-wise momentum magnitude that belongs to all nonzero weights. We then sum-normalize these means across all layers to get the momentum contribution of each layer. Finally, we take this momentum contribution for each layer and multiply with the overall removed weights to get the number of weights which we will regrow in each layer. In step (b), we prune a proportion of (pruning rate) of the weights with the lowest magnitude for each layer. In step (c), we regrow weights by enabling the gradient flow of zero-valued (missing) weights which have the largest momentum magnitude.

Additionally, there are two edge-cases which we did not include in Algorithm 1 for clarity: (1) If we allocate more weights to be regrown than is possible for a specific layer, for example regrowing 100 weights for a layer of maximum 10 weights, we redistribute the excess number of weights equally among all other layers. (2) If a layer is dense and still growing we reduce the pruning rate for these layers proportional to the sparsity: .

After each epoch, we decay the pruning rate in Algorithm 1 in the same way learning rates are decayed. We find that a cosine decay schedule that anneals the pruning rate to zero on the last epoch yields the best validation error and we use this procedure for all experiments.

Data: Layer i to k with: Momentum , Weight , binary Mask; pruning rate
,W /* (a) Calculate mean momentum contributions of all layers. */
1 for  to  do
3 end for
4for  to  do
6 end for
/* (b) Prune weights by finding the NumRemoveth smallest weight. */
7 for  to  do
       // Stop gradient flow.
9 end for
/* (c) Enable gradient flow of weights with largest momentum magnitude. */
10 for  to  do
       // Only consider the momentum of missing weights.
       // | is the boolean OR operator
12 end for
Algorithm 1 Sparse momentum algorithm in NumPy notation.

3.3 Experimental Setup

For comparison, we follow two different experimental settings from Lee et al. (2019) and Mostafa and Wang (2019): For MNIST (LeCun, 1998), we use a batch size of 100, decay the learning rate by a factor of 0.1 every 25000 mini-batches. For CIFAR-10 (Krizhevsky and Hinton, 2009)

, we use standard data augmentations (horizontal flip, and random crop with reflective padding), a batch size of 128, and decay the learning rate every 30000 mini-batches. We train for 100 and 250 epochs on MNIST and CIFAR-10, use a learning rate of 0.1, stochastic gradient descent with Nesterov momentum of 0.9, and we use a weight decay of

. We use a fixed 10% of the training data as the validation set and train on the remaining 90%. We evaluate the test set performance of our models on the last epoch. For all experiments on MNIST and CIFAR-10, we report the standard errors. Our sample size is generally between 10 and 12 experiments per method/architecture/sparsity level with different random seeds for each experiment.

We use the modified network architectures of AlexNet, VGG16, and LeNet-5 as introduced by Lee et al. (2019). For the setup of Mostafa and Wang (2019) we use no validation set and for Wide Residual Networks (WRN) 28-2 (Zagoruyko and Komodakis, 2016) experiments on CIFAR-10 we start with the following layers as dense: First convolutional layer, last fully connected layer, and all downsample residual convolutional layers.

On ImageNet (Deng et al., 2009), we use ResNet-50 (He et al., 2016)

with a stride of 2 for the 3x3 convolution in the bottleneck layers. We use a batch size of 256, input size of 224, momentum of 0.9, and weight decay of

. We train for 100 epochs and report validation set performance after the last epoch.

For all experiments, we keep biases and batch normalization weights dense. We additionally tune a single parameter: The initial pruning rate

. We search in the space {0.2, 0.3, 0.4, 0.5, 0.6, 0.7} and find that for most networks on MNIST and CIFAR-10 a pruning rate of works best. We use this pruning rate throughout all experiments.

ImageNet experiments were run on 4x RTX 2080 Ti and all other experiments on individual GPUs.

Our software builds on PyTorch

(Paszke et al., 2017) and is a wrapper for PyTorch neural networks with a modular architecture for growth, redistribution, and pruning algorithms. Using our software, any PyTorch neural network can be adapted to be a sparse momentum network with 5 lines of code. We will open-source our software along with trained models and individual experimental results.111

Figure 2:

Test set accuracy with 95% confidence intervals on MNIST and CIFAR at varying sparsity levels for LeNet 300-100 and WRN 28-2.

4 Results

Results in Table 1 and Table 2 follow the procedure of (Lee et al., 2019)

. On MNIST, sparse momentum does very well for the LeNet-5 Caffe model achieving equal performance to the dense baseline with 20% weights. For LeNet 300-100, sparse momentum outperforms baselines when using a moderate amount of weights and for 20% exceeds dense baseline performance. However, for 1-2% of weights, variational dropout is more effective.

On CIFAR-10 in Table 2, we can see that sparse momentum outperforms Single-shot Network Pruning (SNIP) for all models and can achieve the same performance level as dense models for VGG16-D and WRN 16-10 with just 5% of weights.

Figure 2 shows the results on MNIST and CIFAR that follows the experimental procedure of Mostafa and Wang (2019). For LeNet 300-100 on MNIST, we can see that sparse momentum outperforms all other methods. For CIFAR-10, sparse momentum is better than dynamic sparse in 4 out of 5 cases. However, in general, the confidence intervals for most methods overlap – this particular setup for CIFAR-10 with specifically selected dense weights seems to be too easy to differentiate performance between methods and we do not recommend this setup for future work. Sparse momentum outperforms all other methods on ImageNet (ILSVRC2012) as shown in Table 3.

LeNet 300-100 LeNet-5 Caffe
W (%) Error (%) W (%) Error (%)
Dense 100.0 1.340.011 100.0 0.580.010
Opt. Brain Damage (LeCun et al., 1989) 8.0 2.0 8.0 2.7
Layer-wise Brain Damage (Dong et al., 2017) 1.5 2.0 1.0 2.1
Compression via optimization** 1.0 3.2 1.0 1.1
Single-shot Net. Pruning (Lee et al., 2019) 2.0 2.4 1.0 1.1
Soft weight-sharing (Ullrich et al., 2017) 4.4 1.9 0.5 1.0
Dyn. Network Surgery (Guo et al., 2016) 1.8 2.0 0.9 0.9
Learn weights&connections (Han et al., 2015) 8.3 1.6 9.3 0.8
Single-shot Net. Pruning (Lee et al., 2019) 5.0 1.6 2.0 0.8
Variational Dropout (Molchanov et al., 2017) 1.5 1.9 0.4 0.8
Sparse Momentum 1.0 2.360.044 1.0 0.830.040
2.0 1.990.019 2.0 0.760.022
5.0 1.530.020 5.0 0.690.021
20.0 1.260.017* 20.0 0.600.013*
* 95% confidence intervals overlap with or exceed dense model.
** (Carreira-Perpinán and Idelbayev, 2018).
Table 1: MNIST test set performance (standard error). W indicates the density (%) of the weights.
Sparse Error (%)
Model Dense Error (%) SNIP Momentum Weights (%)
AlexNet-s 12.950.056 14.99 14.350.057 10
AlexNet-b 12.850.068 14.50 13.930.048 10
VGG16-C 6.490.038 7.27 6.770.056 5
VGG16-D 6.590.050 7.09 6.490.045* 5
VGG16-like 6.500.054 8.00 6.710.046 3
WRN-16-8 4.570.022 6.63 5.660.054 5
WRN-16-10 4.450.040 6.43 4.590.043* 5
WRN-22-8 4.260.032 5.85 4.960.042 5
* 95% confidence intervals overlap with dense model.
Table 2: CIFAR-10 test set error (standard error) for dense baselines, Sparse Momentum and SNIP.
Accuracy (%)
Model Top-1 Top-5 Top-1 Top-5
Dense baseline (He et al., 2016) 79.3 94.8 79.3 94.8
10% weights 20% Weights
Static sparse (Mostafa and Wang, 2019) 67.8 88.4 71.6 90.4
Thin Dense (Mostafa and Wang, 2019) 70.7 89.9 72.4 90.9
DeepR (Bellec et al., 2018) 70.2 90.0 71.7 90.6
Compressed sparse (Mostafa and Wang, 2019) 70.3 90.0 73.2 91.5
Sparse Evolutionary Training (Mocanu et al., 2018) 70.4 90.1 72.6 91.2
Dynamic Sparse (Mostafa and Wang, 2019) 71.6 90.5 73.3 92.4
Sparse momentum 73.1 91.5 74.9 92.5
Table 3: ImageNet results for sparse momentum. Other results are from Mostafa and Wang (2019).

4.1 Speedups and Overhead

We estimated the speedups that could be obtained using sparse momentum in two ways: Theoretical speedups for sparse convolution algorithms and practical speedups using dense convolutional algorithms. For our sparse convolution estimates, we first benchmark the time taken for each dense convolutional layer for a training run and scale it by the sparsity to estimate the speedups gained (equivalent to FLOPs saved). This reflects the maximum speedup for our sparse networks, which can be obtained if optimized sparse convolution algorithms are used. While a fast sparse convolution algorithm for coarse block structures exist for GPUs (Gray et al., 2017), optimal sparse convolution algorithms for fine-grained patterns do not and need to be developed to enable these speedups.

The second method measures practical speedups that can be obtained with naive, dense convolution algorithms which are available today. For dense convolution algorithms, we estimate speedups as follows: If a convolutional channel does only contain zero-valued weights, we can remove these channels from the computation without any consequences and obtain speedups. We assume a linear speedup with an increasing number of empty convolutional channels. We use an RTX Titan and measure the run-time of a dense convolution in 32-bit. We then scale these measurements obtained by the proportion of empty convolutional channels. Using this measure, we estimated the speedups for our models on CIFAR-10. The resulting speedups can be seen in Table 4. We see that dense convolution speedups are mostly dependent on width, with wider networks receiving larger speedups. Sparse convolution speedups are particularly pronounced for Wide Residual Networks (WRN). These results highlight the importance to develop optimized algorithms for sparse convolution.

Beyond speedups, we also measured the overhead of our sparse momentum procedure to be equivalent of a slowdown to 0.973x0.029x compared to a dense baseline.

Speedups Weights (%)
Model Dense Convolution Sparse Convolution
AlexNet-s 1.45x 4.00x 10
VGG16-D 1.36x 3.51x 5
WRN 28-2 1.19x 5.82x 5
WRN 16-10 1.16x 11.85x 5
Table 4: Speedups for sparse networks on CIFAR-10 compared to dense baselines.

5 Analysis

5.1 Ablation Analysis

Our method differs from previous methods like Sparse Evolutionary Training and Dynamic Sparse Reparameterization in two ways: (1) redistribution of weights and (2) growth of weights. To better understand how these components contribute to the overall performance, we ablate these components on CIFAR-10 for VGG16-D and MNIST for LeNet 300-100 and LeNet-5 Caffe with 5% weights for all experiments. The results can be seen in Table 5.

Redistibution according to the magnitude of momentum increases the performance the most for the deeper networks VGG16-D and LeNet-5 Caffe. We hypothesize that the benefit of redistribution algorithms is proportional to the level of depth of networks: The deeper a network is, the more reliant is it to learn a hierarchy of features across layers – redistribution facilitates the learning of hierarchies by moving parameters from shallow layers to deeper layers as training progresses.

Momentum growth increases performance for LeNet 300-100 reliably. There is some evidence that random growth improves performance slightly for VGG16-D and LeNet-5 Caffe, but the confidence intervals overlap, and this observation might be a statistical anomaly. Furthermore, the use of random growth distributes parameters across all convolutional channels, and thus it is no longer possible to achieve speedups with dense convolutional algorithms – this is contrary to the main goal of our work. If one is interested in predictive performance, it is more reasonable to increase the number of parameters and use momentum growth, which would yield both better performance and provide speedups compared to random growth.

Test error (%)
Redistribution Growth VGG16-D LeNet 300-100 LeNet-5 Caffe
momentum momentum 6.490.045 1.530.020 0.690.021
momentum random 0.150.054 0.070.022 0.050.011
None momentum 0.790.082 0.010.018 0.320.071
None random 0.490.060 0.110.020 0.130.013
Table 5: Ablation analysis for different growth and redistribution algorithm combinations.
Figure 3: Dense vs sparse histograms of class-specialization for convolutional channels on CIFAR-10. A class-specialization of 0.5 indicates that 50% of the overall activity comes from a single class.

5.2 Dense vs Sparse Features

Sparse networks need to use every weight effectively to build feature representations which are competitive with dense networks. In this section, we study the difference between sparse and dense features to further our understanding of what features look like that enable sparse learning.

For feature visualization, it is common to backpropagate activity to the inputs to be able to visualize what these activities represent

(Simonyan et al., 2013; Zeiler and Fergus, 2014; Springenberg et al., 2014). However, in our case, we are more interested in the overall distribution of features for each layer within our network, and as such we want to look at the magnitude of the activity in a channel since – unlike feature visualization – we are not just interested in feature detectors but also discriminators. For example, a face detector would induce positive activity for a ‘person’ class but might produce negative activity for a ‘mushroom’ class. Both kinds of activity are useful.

With this reasoning, we develop the following convolutional channel-activation analysis: (1) pass the entire training set through the network and aggregate the magnitude of the activation in each convolutional channel separately for each class; (2) normalize across classes to receive for each channel the proportion of activation which is due to each class; (3) look at the maximum proportion of each channel as a measure of class specialization: a maximum proportion of where is the number of classes indicates that the channel is equally active for all classes in the training set. The higher the proportion deviates from this value, the more is a channel specialized for a particular class.

Results of this method can be seen for AlexNet-s, VGG16-D, and WRN 28-2 on CIFAR-10 in Figure 3. We see the convolutional channels in sparse networks have lower class-specialization indicating they learn features which are useful for a broader range of classes compared to dense networks. This trend intensifies with depth. This suggests that sparse networks might be able to rival dense networks by learning more general features.

6 Conclusion and Future Work

We presented our sparse learning algorithm, sparse momentum, which uses the mean magnitude of momentum to grow and redistribute weights. We showed that sparse momentum outperforms other sparse algorithms on MNIST, CIFAR-10, and ImageNet. Additionally, sparse momentum can rival dense neural network performance while yielding speedups. In our analysis, we showed that sparse networks might be able to rival dense networks by learning more general features compared to dense models. We believe that further study of sparse networks and their representations can inform the design of architectures and deep feature learning algorithms. To fully utilize the improved run-time performance of sparse learning algorithms, future research should focus on specialized sparse convolution and sparse matrix multiplication algorithms.

7 Acknowledgements

This work was funded by a Jeff Dean – Heidi Hopper Endowed Regental Fellowship. We thank Ofir Press, Jungo Kasai, Omer Levy, Sebastian Riedel and Yejin Choi for helpful discussions. We thank Ofir Press, Jungo Kasai, Judit Acs, Zoey Chen, Ethan Perez, and Mohit Shridhar for their helpful reviews and comments.