Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization

02/15/2019 ∙ by Hesham Mostafa, et al. ∙ Cerebras Systems 0

Deep neural networks are typically highly over-parameterized with pruning techniques able to remove a significant fraction of network parameters with little loss in accuracy. Recently, techniques based on dynamic re-allocation of non-zero parameters have emerged for training sparse networks directly without having to train a large dense model beforehand. We present a parameter re-allocation scheme that addresses the limitations of previous methods such as their high computational cost and the fixed number of parameters they allocate to each layer. We investigate the performance of these dynamic re-allocation methods in deep convolutional networks and show that our method outperforms previous static and dynamic parameterization methods, yielding the best accuracy for a given number of training parameters, and performing on par with networks obtained by iteratively pruning a trained dense model. We further investigated the mechanisms underlying the superior performance of the resulting sparse networks. We found that neither the structure, nor the initialization of the sparse networks discovered by our parameter reallocation scheme are sufficient to explain their superior generalization performance. Rather, it is the continuous exploration of different sparse network structures during training that is critical to effective learning. We show that it is more fruitful to explore these structural degrees of freedom than to add extra parameters to the network.



There are no comments yet.


page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability of deep neural networks to effectively learn complex transformations by example and their superior generalization ability has been key to their success in a wide range of domains ranging from computer vision to machine translation to automatic speech recognition. Even though they are able to generalize well, deep networks learn more effectively when they are highly overparameterized 

(Brutzkus et al., 2017; Zhang et al., 2016). Emerging evidence has attributed this need for over-pararameterization to the geometry of the high-dimensional loss landscapes of overparameterized deep neural networks (Dauphin et al., 2014; Choromanska et al., 2014; Goodfellow et al., 2014; Im et al., 2016; Wu et al., 2017; Liao & Poggio, 2017; Cooper, 2018; Novak et al., 2018), and to the implicit regularization properties of SGD (Brutzkus et al., 2017; Zhang et al., 2018a; Poggio et al., 2017), though a thorough theoretical understanding is not yet complete.

Several techniques are able to trim down the post-training model size such as distillation methods (Bucilua et al., 2006; Hinton et al., 2015), reduced bit-precision methods (Hubara et al., 2016; McDonnell, 2018), low-rank decomposition methods (Jaderberg et al., 2014; Denil et al., 2013), and pruning methods (Han et al., 2015a; Zhang et al., 2018b). While these methods are highly effective in reducing the number of network parameters with little to no degradation in accuracy, they either operate on a pre-trained model or require the full over-parameterized model to be maintained during training. The success of these compression methods indicate that shallow and/or small networks contain parameter configurations that allow these networks to reach accuracies on par with the accuracy of bigger and deeper networks. This gives a tantalizing hint that over-parameterization is not a strict necessity and that alternative training or parameterization methods might be able to find these compact networks directly.

The problem of achieving training-time parameter efficiency 111 if model family achieves a specific level of generalization performance with fewer parameters than model family , we say is more parameter efficient than at that performance level. can be approached in a number of ways. Innovations in this direction for deep convolutional neural networks (CNNs) include the development of skip connections He et al. (2015)

, the elimination of fully-connected layers in favor of global average pooling layers followed directly by the classifier layer  

Lin et al. (2013), and depth-wise separable convolutions Sifre & Mallat (2014); Howard et al. (2017). These architectural innovation drastically improved the accuracy of CNNs at reduced parameter budgets.

An alternative approach is reparameterizing an existing model architecture. In general, any differentiable reparameterization can be used to augment training of a given model. Let an original network (or a layer therein) be denoted by , parameterized by . Reparameterize it by and through , where is differentiable w.r.t. but not necessarily w.r.t. . Denote the reparameterized network by , considering as metaparameters 222 We use the term metaparameter to refer to the parameters of the reparameterization function . They differ from parameters

in that they are not optimized through gradient descent, and they differ from hyperparameters in that they define meaningful features of the model which are required for inference.



We can train using gradient descent. If and can be trained to to match the generalization performance of , then is a more efficient parameterization of the network.

Sparse reparameterization is a special case where is a linear projection; is the non-zero entries (i.e. “weights”) and

their indices (i.e. “connectivity”) in the original parameter tensor

. Likewise, parameter sharing is a similar special case of linear reparameterization where is the tied parameters and is the indices at which each parameter is placed (with repetition) in the original parameter tensor . Furthermore, if metaparameters are fixed during the course of training, the reparameterization is static, whereas if is adjusted adaptively during training, we call it dynamic reparameterization.

In this paper, we look at multiple parameterizations of deep residual CNNs (both static and dynamic). We build upon previous sparse dynamic parameterization schemes to develop a novel dynamic parameterization method that yields the highest parameter efficiency while training deep residual CNNs, outperforming previous static and dynamic parameterization methods. Our method dynamically changes the sparse network structure during learning and its superior performance implies that given a certain storage and computational budget to train a residual CNN, we are better off allocating part of the storage budget to describing and evolving the structure of the network, rather than spending it all on the parameters of a conventional dense network.

We show that the success of our dynamic parameterization method is not solely due to the final structure of the resultant sparse networks or a combination of final structure and initial weight values. Rather, training-time structural exploration is needed to reach best accuracies, even if a high-performance structure and its initial values are known a-priori. This implies that optimizing structure in tandem with weight optimization through gradient descent helps the later find better-performing weights. Structure exploration thus improves the trainability of sparse deep residual CNNs.

2 Related work

Training of differentiably reparameterized networks has been proposed in numerous studies before.

Dense reparameterization  Several dense reparameterization techniques sought to reduce the size of fully connected layers. These include low-rank decomposition (Denil et al., 2013), fastfood transform (Yang et al., 2014), ACDC transform (Moczulski et al., 2015), HashedNet (Chen et al., 2015), low displacement rank (Sindhwani et al., 2015) and block-circulant matrix parameterization (Treister et al., 2018).

Note that similar reparameterizations were also used to introduce certain algebraic properties to the parameters for purposes other than reducing model sizes, e.g. to make training more stable as in unitary evolution RNNs (Arjovsky et al., 2015) and in weight normalization (Salimans & Kingma, 2016), to inject inductive biases (Thomas et al., 2018), and to alter (Dinh et al., 2017) or to measure (Li et al., 2018) properties of the loss landscape. All dense reparameterization methods to date are static.

Sparse reparameterization  Successful training of sparse reparameterized networks usually employs iterative pruning and retraining, e.g. Han et al. (2015b); Narang et al. (2017); Zhu & Gupta (2017) 333 Note that these, as well as all other techniques we benchmark against in this paper, impose non-structured sparsification on parameter tensors, yielding sparse models. There also exist a class of structured pruning methods that “sparsify” at channel or layer granularity, e.g. Luo et al. (2017) and Huang & Wang (2017), generating essentially small dense models. We describe a full landscape of existing methods in Appendix C. . Training typically starts with a large pre-trained model and sparsity is gradually increased during the course of fine-tuning. Training a small, static, and sparse model de novo typically fares worse than obtaining the sparse model through pruning a large dense model (Zhu & Gupta, 2017).

Frankle & Carbin (2018) successfully identified small and sparse subnetworks post-training which, when trained in isolation, reached a similar accuracy as the enclosing big network. They further showed that these subnetworks were sensitive to initialization, and hypothesized that the role of overparameterization is to provide a large number of candidate subnetworks, thereby increasing the likelihood that one of these subnetworks will have the necessary structure and initialization needed for effective learning.

Most closely related to our work are dynamic sparse reparameterization techniques that emerged only recently. Like ours, these methods adaptively alter, by certain heuristic rules, the location of non-zero parameters during training. Sparse evolutionary training(SET) 

(Mocanu et al., 2018)

used magnitude-based pruning and random growth at the end of each training epoch. NeST 

(Dai et al., 2017, 2018)

iteratively grew and pruned parameters and neurons during training; parameter growth was guided by parameter gradient and pruning by parameter magnitude. Deep rewiring 

(Bellec et al., 2017) combined dynamic sparse parameterization with stochastic parameter updates for training. These methods were mostly concerned with sparsifying fully connected layers and applied to relatively small and shallow networks. We show that the method we propose in this paper is more scalable and computationally efficient than these previous approaches, while achieving better performance on deep convolutional networks.

3 Methods

We train deep CNNs where the majority of layers have sparse weight tensors. All sparse weight tensors are initialized at the same sparsity (percentage of zeros) level. We use a full (non-sparse) parameterization for all bias parameters and the parameters of batch normalization layers. Throughout training, we always maintain the same total number of non-zero parameters in the network. Parameters are moved within and across tensors in two phases: a pruning phase, followed immediately by a growth phase as shown in algorithm 

1. We carry out the parameter re-allocation step described by Algorithm 1 every few hundred training iterations.

We use magnitude-based pruning based on an adaptive global threshold where all network weights with magnitude smaller than are pruned. adapts to roughly maintain a fixed number of pruned/grown parameters () during each re-allocation step. This makes pruning particularly efficient as no sorting operations are needed and only a single global threshold is used. After removing parameters during the pruning phase, zero-initialized parameters are re-distributed back among the network tensors in the growth phase.

Intuitively, we should allocate more parameters to layers where they can more quickly reduce the training classification loss. To first order, we should allocate more parameters to layers whose parameters receive larger classification loss gradients. If a layer has been heavily pruned, this indicates that for a large portion of its parameters, the training loss gradients were not large enough or consistent enough to counteract the pull towards zero arising from weight regularization. We thus use a simple heuristic in which the available parameters to grow are allocated more towards layers having a higher percentage of non-zero weights as shown in algorithm 1. The parameters allocated to a layer are randomly placed in the non-active (zero) positions of its weight tensor. See appendix F for a more detailed description of the algorithm.

To simplify exposition, we do not include in algorithm 1 guards against rounding errors that can introduce a discrepancy between the number of pruned parameters and grown parameters. We also do not include the special case where more parameters are allocated to a tensor than there are non-active positions. In that case, the extra parameters that do not fit in the now fully dense tensor are re-distributed among the other sparse tensors.

The most closely related algorithm to ours is SET Mocanu et al. (2018). Our algorithm differs from SET in two respects: we use an adaptive threshold for pruning instead of pruning a fixed fraction of weights at each re-allocation step; we re-allocate parameters across layers during training and do not impose a fixed sparsity level on each layer. The first difference leads to reduced computational overhead as it obviates the need for sorting operations, while the second difference leads to better performing networks as shown in the next section and the ability to train extremely sparse networks as shown in appendix E.

1:: Target number of parameters to prune (fixed)
2: : Fractional tolerance of (fixed)
3:: Pruning threshold (initialized at )
4: : All sparse weight tensors in network
5:for i= 1 to  do
6:      Prune all weights in with magnitude less than . is the number of weights that have just been pruned.
7:      Number of non-zero weights in after pruning
8:end for
9: Total number of pruned weights
10: Total number of non-zero weights
11:if  then
13:else if  then
15:end if
16:for i= 1 to  do
17:      Grow zero-initialized weights at random in
18:end for
Algorithm 1 Reallocate non-zero parameters within and across weight tensors

We evaluate our method, together with other static and dynamic parameterization methods on the deep residual CNNs shown in table 1. We did not include AlexNet (Krizhevsky et al., 2012) and VGG-style networks (Simonyan & Zisserman, 2014) as their parameter efficiency is inferior to residual nets. Such a setup makes the improvement in parameter efficiency achieved by our dynamic parameterization method more relevant. Dynamic sparse parameterization was applied to all weight tensors of convolutional layers (with the exception of downsampling convolutions and the first convolutional layer acting on the input image), while all biases and parameters of normalization layers were kept dense. Global sparsity is defined in relation to the sparse tensors only, i.e, it is the number of non-active (zero) positions in all sparse tensors as a fraction of the number of parameters in dense tensors having the same dimensions.

At a specific global sparsity , we compared our method (dynamic sparse) against six baselines:

  1. Full dense: original large and dense model, with parameters;

  2. Thin dense: original model with thinner layers, such that it had the same size as dynamic sparse;

  3. Static sparse: original model initialized sparsity level where the sparsity pattern is random, then trained with connectivity (sparsity pattern) fixed;

  4. Compressed sparse: compression of the original model by iterative pruning and retraining the original model to target sparsity  (Zhu & Gupta, 2017);

  5. DeepR: sparse model trained by using Deep Rewiring (Bellec et al., 2017);

  6. SET: sparse model trained by using Sparse Evolutionary Training (SET) (Mocanu et al., 2018).

Appendix B compares our method against an additional static parameterization method based on weight tying: hash nets (Chen et al., 2015).

Using the number of parameters to compare network sizes across sparse and non-sparse models can be misleading if the extra information needed to specify the connectivity structure in the sparse models is not taken into account. We thus compare models that have the same size in bits, instead of the same number of weights. While the number of bits needed to specify the connectivity is implementation dependent, we assume a simple scheme where one bit is used for each position in the weight tensors to indicate whether this position is active (contains a non-zero weight) or not. A sparse tensor is fully defined by this bit-mask, together with the non-zero weights. This scheme was previously used in CNN accelerators that natively operate on sparse structures Aimar et al. (2018). For a network with 32-bit weights in dense tensors, a sparse version with sparsity would have a size of bits and would thus be equivalent to a thinner dense network with weights. We use this formula to obtain the size of the only non-sparse baseline we have, thin dense, which will thus have more weights than the equivalently-sized sparse models.

A recent work Liu et al. (2018) shows that training small networks from scratch can match the accuracy of networks obtained through post-training pruning of larger networks. The authors show this is almost always the case if the small networks were trained long enough. To address potential concerns that the performance of our dynamic parameterization scheme can be matched by networks with static parameterization if the later were trained for more epochs, we always train the thin dense and static sparse baselines for double the number of epochs used to train our dynamic sparse models. This ensures that any superior accuracy achieved by our method can not merely be due to its ability to converge faster during training. As we show in the results section, our dynamic parameterization scheme incurs minimal computational overhead, which means the thin dense and static sparse baselines are trained using significantly more computational resources than dynamic sparse.

Note that compressed sparse is a compression method that initially trains a large dense model and iteratively prunes it down, whereas all other baselines maintain the same model size throughout training. For compressed sparse, we train the large dense model for the same number of epochs used for our dynamic sparse, and then iteratively and gradually prune it down across many additional training epochs. Compressed sparse thus trains for more epochs than dynamic sparse. See Appendix A for the hyperparameters used in the experiments.

Dataset CIFAR10 Imagenet
(Zagoruyko & Komodakis, 2016)
(He et al., 2015)
GlobalAvgPool, F10
C64/77-2, MaxPool/33-2
[C64/11, C64/33, C256/11]3
[C128/11, C128/33, C512/11]4
[C256/11, C256/33, C1024/11]6
[C512/11, C512/33, C2048/11]3
GlobalAvgPool, F1000
# Parameters 1.5M 25.6M

For brevity architecture specifications omit batch normalization and activations. Pre-activation batch normalization was used in all cases. Convolutional (C) layers are specified with output size and kernel size and Max pooling (MaxPool) layers with kernel size. Brackets enclose residual blocks postfixed with repetition numbers; the downsampling convolution in the first block of a scale group is implied.

Table 1: Datasets and models used in experiments

4 Experimental results

Figure 1:

WRN-28-2 on CIFAR10. (fig:cf10_accuracy) Test accuracy plotted against number of trainable parameters in the sparse models for different methods. Dashed lines are used for the full dense model and for models obtained through compression, whereas all methods that maintain a constant parameter count throughout training and inference are represented by solid lines. Circular symbols mark the median of 5 runs, and error bars are the standard deviation. Parameter counts include all trainable parameters, i.e, parameters in sparse tensors plus all other dense tensors, such as those of batch normalization layers. (fig:cf10_block_sparsity) Breakdown of the final sparsities of the parameter tensors in the three residual blocks that emerged from our dynamic sparse parameterization algorithm (Algorithm 

1) at different levels of global sparsity.

WRN-28-2 on CIFAR10:  We ran experiments using a Wide Resnet model WRN-28-2 (Zagoruyko & Komodakis, 2016) trained to classify CIFAR10 images (see Appendix A for details of implementation). We varied the level of global sparsity and evaluated the accuracy of different dynamic and static parameterization training methods. As shown in Figure 1, static sparse and thin dense significantly underperformed the compressed sparse model as expected, whereas our method dynamic sparse performed slightly better on average. Deep rewiring significantly lagged all other method. While the performance of SET was on par with compressed sparse, it lagged behind dynamic sparse at high sparsity levels. At low sparsity levels SET largely closed the gap to compressed sparse. Even though the statically parameterized models static sparse and thin dense were trained for twice the number of epochs, they still failed to match the performance of our method or SET. Keep in mind that thin dense even had more SGD-trainable weights than all the sparse models as described in the methods section.

Our dynamic parameterization method automatically adjusts the sparsity of the parameter tensors in different layers by moving parameters across layers. We looked at the sparsity patterns that emerged at different sparsity levels and observed consistent patterns : (a) larger parameter tensors tended to be sparser than smaller ones, and (b) deeper layers tended to be sparser than shallower ones. Figure. 1 shows a break-down of the sparsity levels in the different residual blocks at different sparsity levels.

Resnet-50 on Imagenet:  We also experimented with the Resnet-50 bottleneck architecture (He et al., 2015) trained on Imagenet (see Appendix A for details of implementation). We tested two global sparsity levels, and (Table 2). Models obtained using our method (dynamic sparse) outperformed models obtained using all dynamic and static parameterization methods, and even slightly outperformed models obtained through post-training compression of a large dense model. We also list in Table 2 two representative methods of structured pruning (see Appendix C), ThiNet (Luo et al., 2017) and Sparse Structure Selection (Huang & Wang, 2017), which, consistent with recent criticisms (Liu et al., 2018), underperformed static dense baselines. As with the previous experiments on WRN-28-2, reliable sparsity patterns across the parameter tensors in different layers emerged from dynamic parameter reallocation during training, displaying the same empirical trends described above (Figure 2).

Final overall sparsity (# Parameters) (7.3M) (5.1M) (25.6M)
Thin dense
Static sparse
 (Bellec et al., 2017)
 (Mocanu et al., 2018)
Dynamic sparse
[ 0.0]
Compressed sparse
 (Zhu & Gupta, 2017)
 (Luo et al., 2017)
(at 8.7M parameter count)
 (Huang & Wang, 2017)
(at 15.6M parameter count)
Numbers in square brackets are differences from the full dense baseline. Romanized numbers are results of our experiments, and italicized ones taken directly from the original paper. Performance of two structured pruning methods, ThiNet and Sparse Structure Selection (SSS), are also listed for comparison (below the double line, see Appendix C for discussion of their relevance); note the difference in parameter counts.
Table 2: Test accuracy% (top-1, top-5) of Resnet-50 trained on Imagenet
Figure 2: layer-wise breakdown of the final parameter tensor sparsities of Resnet-50 trained on Imagenet. (fig:resnet_layersparsity08) At overall sparsity . (fig:resnet_layersparsity09) At overall sparsity of .
WRN-28-2 on
resnet50 on
DeepR 4.466 0.358 5.636 0.218
SET 1.087 0.049 1.009 0.002
Dynamic sparse 1.083 0.051 1.005 0.004
Table 3: Median wall-clock training epoch times for WRN-28-2 and resnet50 for different dynamic re-parameterization schemes (from 25 epochs). Results are relative to the epoch time of a sparse network trained without dynamic parameter re-allocation. WRN-28-2 runs were on a single Titanxp GPU, while resnet50 runs used four Titanxp GPUs.

Computational overhead of dynamic reparameterization:  We assessed the additional computational cost incurred by reparameterization steps (Algorithm 1) during training, and compared ours with existing dynamic sparse reparameterization techniques, DeepR and SET (Table 3). Because both SET and ours reallocate parameters only intermittently (every few hundred training iterations), the computational overhead was negligible for the experiments presented here444 Because of the rather negligible overhead, the reduced operation count thanks to the elimination of sorting operations did not amount to a substantial improvement over SET on GPUs. Our method’s advantage over SET lies in its ability to produce better sparse models and to reallocate free parameters automatically (see Appendix E). . DeepR, however, requires adding noise to gradient updates as well as reallocating parameters every training iteration which led to a significantly larger overhead.

Disentangling the effects of dynamic re-parameterization Our dynamic parameter re-allocation method consistently yields better accuracy than static parameterization methods even though the later were trained for more epochs, and in the case of thin dense, had more SGD-trainable parameters. The most immediate hypothesis for explaining this phenomenon is that our method is able to discover suitable sparse network structures that can be trained to reach high accuracies. To investigate whether the high performance of networks discovered by our method can be solely attributed to their sparse structure, we did the following experiments using WRN-28-2 trained on CIFAR10: after training with our dynamic re-allocation method, the structure (i.e. positions of non-zero entries in sparse parameter tensors) of the final network was retained, and this network was randomly re-initialized and re-trained with the structure fixed(green bars in Fig. 3). Even though the network has the same structure as the final network found by our method, its training failed to reach the same accuracy.

One might argue that it is not just the network structure, but also its initialization that allow it to reach high accuracies (Frankle & Carbin, 2018). To assess this argument, we used the final network structure found by our method as described above, and initialized it with the same initial values used when training using our method. As shown in Fig. 3 (blue bars), the combination of final structure and original initialization still fell significantly short of the level of accuracy achieved by our dynamic parameter re-allocation method and the performance was not significantly different from training the same network with random initialization (green bars). As control, we also show the static sparse case where the sparse network structure and its initialization were both random (red bars in Fig. 3). Unsurprisingly, these networks performed the worst. Similar trends are observed for resnet-50 trained on imagenet as shown in Fig. 3. All static networks, whether their structure+initialization were random or copied from networks trained using our dynamic parameterization method, were trained for double the number of epochs compared to our method.

These results indicate that the dynamics of parameter re-allocation themselves are important for learning as the success of the networks it discovers can not be solely attributed to their structure or initialization. For WRN-28-2, we experimented with stopping the parameter re-allocation mechanism (i.e, fixing the network structure) at various points during training. As shown in Fig. 4, dynamic parameter re-allocation does not need to be active for the entire course of training, but only for some initial epochs.

Figure 3: Comparison of training using our dynamic parameterization method against training a number of related statically parameterized networks. All statically parameterized networks were trained for double the number of epochs used by our method. (fig:tickets_a) WRN-28-2 on CIFAR10. Mean and standard deviation from 5 runs. (fig:tickets_b) Resnet-50 on imagenet. Single runs.
Figure 4: Test accuracies of sparse WRN-28-2 trained on CIFAR10 when dynamic parameter re-allocation was switched off at different epochs. Results are shown for two global sparsity levels: 0.8 and 0.9. Horizontal bands indicate the accuracy of the compressed sparse baselines where the band widths represent the standard deviation. For all data points, we ran training for 200 epochs (independently of when dynamic parameter re-allocation was stopped). Mean and standard deviation from 5 trials.

5 Discussion

In this work, we investigated the following problem: given a small, fixed budget of parameters for a deep residual CNN throughout training time, how to train it to yield the best generalization performance. While this is an open-ended question, we showed that dynamic parameterization methods can achieve significantly better accuracies than static methods for the same model size. Dynamic parameterization methods have received relatively little attention, with the two principal techniques so far (SET and DeepR) applied only to relatively small and shallow networks. We showed that these techniques are indeed applicable to deep CNNs with SET consistently outperforming DeepR while incurring a lower computational cost. We proposed a dynamic parameterization method that adaptively allocates free parameters across the network based on a simple heuristic. This method yields better accuracies than previous dynamic parameterization methods and it outperforms all the static parameterization methods we tested. In appendix B, we show that our method outperforms another static parameterization method based on hash nets Chen et al. (2015). As we show in appendix E, our method is also able to train networks at extreme sparsity levels where previous static and dynamic parameterization methods often fail catastrophically.

High-performance sparse networks are often obtained through post-training pruning of dense networks. Recent work looks into how sparse networks can be trained directly using post-hoc information obtained from a pruned model. Ref. Liu et al. (2018) argues that it is the structure alone of the pruned model that matters, i.e, training a model of the same structure, and starting with random weights, can reach the same level of accuracy as the pruned model. Yet other results (Frankle & Carbin, 2018) argue that a standalone sparse network can only be trained effectively if it copies both the pruned network structure as well as the pruned network’s initial weights when it was part of the dense model. We performed experiments in the same spirit: we trained statically parameterized networks that copy only the structure and that copy both the structure and the initial weight values of the sparse networks trained using our scheme. Interestingly, neither managed to match the performance of sparse networks trained using our dynamic parameterization scheme. The value of our dynamic parameter re-allocation scheme thus goes beyond discovering the correct sparse network structure; the dynamics of the structure exploration process itself help gradient descent converge to better weights. Extra work is needed to explain the mechanism underlying this phenomenon. One hypothesis is that the discontinuous jumps in network response when the structure changes provide the ‘jolts’ necessary to pull the network from a sharp minimum that generalizes badly Keskar et al. (2016).

Structural degrees of freedom are qualitatively different from the degrees of freedom introduced by over-parameterization. The later can be directly exploited using gradient descent. Structural degrees of freedom, however, are explored using non-differentiable heuristics that only interact indirectly with the dynamics of gradient descent, for example when gradient descent pulls weights towards zero causing the associated connections to be removed. Our results indicate that for residual CNNs, and as far as model size is concerned, we are better off allocating bits to describe and explore structural degrees of freedom in a reasonably sparse network than allocating them to conventional weights.

Beside model size, computational efficiency is also a primary concern. Current mainstream compute architectures such as CPUs and GPUs have trouble efficiently handling unstructured sparsity patterns. To maintain the standard CNN structure, various pruning techniques prune a trained model at the level of entire feature maps. Recent evidence suggests the resulting networks perform no better than conventionally-trained thin networks (Liu et al., 2018), calling into question the value of such coarse pruning. In appendix D, we show that our method can easily extend to operate at an intermediate level of structured sparsity, that of kernel slices. Imposing this sparsity structure causes performance to degrade and the resulting networks perform on par with statically parameterized thin dense networks when the later are trained for double the number of epochs.

In summary, our results indicate that for deep residual CNNs, it is possible to train sparse models directly to reach generalization performance comparable to sparse networks produced by iterative pruning of large dense models. Moreover, our dynamic parameterization method results in models that significantly outperform equivalent-sized dense models. Exploring structural degrees of freedom during training is key and our method is the first that is able to fully explore these degrees of freedom using its ability to move parameters within and across layers. Our results do not contradict the common wisdom that extra degrees of freedom are needed while training deep networks, but they point to structural degrees of freedom as an alternative to the degrees of freedom introduced by over-parameterization.


Appendix A Details of implementation

We implemented all models and reparameterization mechanisms using pytorch. Experiments were run on GPUs, and all sparse tensors were represented as dense tensors filtered by a binary mask 555 This is a mere implementational choice for ease of experimentation given available hardware and software, which did not save memory because of sparsity. With computing substrate optimized for sparse linear algebra, our method is duly expected to realize the promised memory efficiency. .

on CIFAR10
on Imagenet
Hyperparameters for training
Number of training epochs 100 200 100
Mini-batch size 100 100 256
Learning rate schedule
(epoch range: learning rate)
1 - 25:
26 - 50:
51 - 75:
76 - 100:
1 - 60:
61 - 120:
121 - 160:
161 - 200:
1 - 30:
31 - 60:
61 - 90:
91 - 100:
Momentum (Nesterov) 0.9 0.9 0.9
regularization multiplier 0.0001 0.0 0.0
regularization multiplier 0.0 0.0005 0.0001
Hyperparameters for sparse compression (compressed sparse(Zhu & Gupta, 2017)
Number of pruning iterations () 10 20 20
Number of training epochs
between pruning iterations
2 2 2
Number of training epochs post-pruning 20 10 10
Total number of pruning epochs 40 50 50
Learning rate schedule during pruning
(epoch range: learning rate)
1 - 20:
21 - 30:
31 - 40:
1 - 25:
25 - 35:
36 - 50:
1 - 25:
26 - 35:
36 - 50:
Hyperparameters for dynamic sparse reparameterization (dynamic sparse) (ours)
Number of parameters to prune () 600 20,000 200,000
Fractional tolerance of () 0.1 0.1 0.1
Initial pruning threshold () 0.001 0.001 0.001
Reparameterization period () schedule
(epoch range: )
1 - 25:
26 - 50:
51 - 75:
76 - 100:
1 - 25:
26 - 80:
81 - 140:
141 - 200:
1 - 25:
26 - 50:
51 - 75:
76 - 100:
Hyperparameters for Sparse Evolutionary Training (SET(Mocanu et al., 2018)
Number of parameters to prune
at each re-parameterization step
600 20,000 200,000
Reparameterization period () schedule
(epoch range: )
1 - 25:
26 - 50:
51 - 75:
76 - 100:
1 - 25:
26 - 80:
81 - 140:
141 - 200:
1 - 25:
26 - 50:
51 - 75:
76 - 100:
Hyperparameters for Deep Rewiring (DeepR(Bellec et al., 2017)
regularization multiplier ()
Temperature () schedule
(epoch range: )
1 - 25:
26 - 50:
51 - 75:
76 - 100:
1 - 25:
26 - 80:
81 - 140:
141 - 200:
1 - 25:
26 - 50:
51 - 75:
76 - 100:
Table 4: Hyperparameters for all experiments presented in the paper

Training  Hyperparameter settings for training are listed in the first block of Table 4. Standard mild data augmentation was used in all experiments for CIFAR10 (random translation, cropping and horizontal flipping) and for Imagenet (random cropping and horizontal flipping). The last linear layer of WRN-28-2 was always kept dense as it has a negligible number of parameters. The number of training epochs for the thin dense and static sparse baselines are double the number of training epochs shown in Table 4.

Sparse compression baseline  We compared our method against iterative pruning methods (Han et al., 2015b; Zhu & Gupta, 2017). We start from a full dense model trained with hyperparameters provided in the first block of Table 4 and then gradually prune the network to a target sparsity in steps. As in Zhu & Gupta (2017), the pruning schedule we used was


where indexes pruning steps, and the target sparsity reached at the end of training. Thus, this baseline (labeled as compressed sparse in the paper) was effectively trained for more iterations (original training phase plus compression phase) than our dynamic sparse method. Hyperparameter settings for sparse compression are listed in the second block of Table 4.

Dynamic reparameterization (ours)  Hyperparameter settings for dynamic sparse reparameterization (Algorithm 1) are listed in the third block of Table 4.

Sparse Evolutionary Training (SET)  Because the larger-scale experiments here (WRN-28-2 on CIFAR10 and Resnet-50 on Imagenet) were not attempted by Mocanu et al. (2018), no specific settings for reparameterization in these cases were available in the original paper. In order to make a fair comparison, we used the same hyperparameters as those used in our dynamic reparameterization scheme (third block in Table 4). At each reparameterization step, the weights in each layer were sorted by magnitude and the smallest fraction was pruned. An equal number of parameters were then randomly allocated in the same layer and initialized to zero. For control, the total number of reallocated weights at each step was chosen to be the same as our dynamic reparameterization method, as was the schedule for reparameterization.

Deep Rewiring (DeepR)  The fourth block in Table 4 contain hyperparameters for the DeepR experiments. We refer the reader to Bellec et al. (2017) for details of the deep rewiring algorithm and for explanation of the hyperparameters. We chose the DeepR hyperparameters for the different networks based on a parameter sweep.

Appendix B Comparison to hash nets

We also compared our dynamic sparse reparameterization method to a number of static dense reparameterization techniques, e.g. Denil et al. (2013); Yang et al. (2014); Moczulski et al. (2015); Sindhwani et al. (2015); Chen et al. (2015); Treister et al. (2018). Instead of sparsification, these methods impose structure on large parameter tensors by parameter sharing. Most of these methods have not been used for convolutional layers except for recent ones (Chen et al., 2015; Treister et al., 2018). We found that HashedNet (Chen et al., 2015) had the best performance over other static dense reparameterization methods, and also benchmarked our method against it. Instead of reparameterizing a parameter tensor with entries to a sparse one with non-zero components, HashedNet’s reparameterization is to put free parameters into positions in the parameter through a random mapping from to computed by cheap hashing, resulting in a dense parameter tensor with shared components.

Results of LeNet-300-100-10 on MNIST are presented in Figure 5, those of WRN-28-2 on CIFAR10 in Figure 5, and those of Resnet-50 on Imagenet in Table 5. For a certain global sparsity of our method, we compare it against a HashedNet with all reparameterized tensors hashed such that each had a fraction of unique parameters. We found that our method dynamic sparse significantly outperformed HashedNet.

Figure 5: Comparison to HashedNet. (fig:mnist_with_tied) Test accuracy for LeNet-300-100-10 trained on MNIST. (fig:cifar_with_tied) Test accuracy for WRN-28-2 trained on CIFAR10. Conventions same as in Figure 7.
Final global sparsity (# Parameters) (7.3M) (5.1M)
HashedNet 70.0 [-4.9] 89.6 [-2.8] 66.9 [-8.0] 87.4 [-5.0]
Dynamic sparse (ours) 73.3 [-1.6] 92.4 [ 0.0] 71.6 [-3.3] 90.5 [-1.9]
Table 5: Test accuracy% (top-1, top-5) of Resnet-50 on Imagenet for dynamic sparse vs. HashedNet. Numbers in square brackets are differences from the full dense baseline.

Appendix C A taxonomy of training methods that yield “sparse” deep CNNs

As an extension to Section 2 of the main text, here we elaborate on existing methods related to ours, how they compare with and contrast to each other, and what features, apart from effectiveness, distinguished our approach from all previous ones. We confine the scope of comparison to training methods that produce smaller versions (i.e. ones with fewer parameters) of a given modern (i.e. post-AlexNet) deep convolutional neural network model. We list representative methods in Table 6. We classify these methods by three key features.

Strict parameter budget
training and inference
of sparsity
layer sparsity
Dynamic Sparse Reparameterization
yes non-structured yes
Sparse Evolutionary Training (SET)
(Mocanu et al., 2018)
yes non-structured no
Deep Rewiring (DeepR)
(Bellec et al., 2017)
yes non-structured no
NN Synthesis Tool (NeST)
(Dai et al., 2017, 2018)
no non-structured yes
(Zhu & Gupta, 2017)
no non-structured no
RNN Pruning
(Narang et al., 2017)
no non-structured no
Deep Compression
(Han et al., 2015b)
no non-structured no
Group-wise Brain Damage
(Lebedev & Lempitsky, 2015)
no channel no
-norm Channel Pruning
(Li et al., 2016)
no channel no
Structured Sparsity Learning (SSL)
(Wen et al., 2016)
no channel/kernel/layer yes
(Luo et al., 2017)
no channel no
LASSO-regression Channel Pruning
(He et al., 2017)
no channel no
Network Slimming
(Liu et al., 2017)
no channel yes
Sparse Structure Selection (SSS)
(Huang & Wang, 2017)
no layer yes
Principal Filter Analysis (PFA)
(Suau et al., 2018)
no channel yes/no
We provide examples of different categories of methods. This is not a complete list of methods.
Table 6: Representative examples of training methods that yield “sparse” deep CNNs

Strict parameter budget throughout training and inference  This feature was discussed in depth in the main text. Most of the methods to date are compression techniques, i.e. they start training with a fully parameterized, dense model, and then reduce parameter counts. To the best of our knowledge, only three methods, namely DeepR (Bellec et al., 2017), SET (Mocanu et al., 2018) and ours, strictly impose, throughout the entire course of training, a fixed small parameter budget, one that is equal to the size of the final sparse model for inference. We make a distinction between these direct training methods (first block) and compression methods (second and third blocks of Table 6) 666 Note that an intermediate case is NeST (Dai et al., 2017, 2018), which starts training with a small network, grows it to a large size, and finally prunes it down again. Thus, a fixed parameter footprint is not strictly imposed throughout training, so we list NeST in the second block of Table 6. .

This distinction is meaningful in two ways: (a) practically, direct training methods are more memory-efficient on appropriate computing substrate by requiring parameter storage of no more than the final compressed model size; (b) theoretically, these methods, if performing on par with or better than compression methods (as this work suggests), shed light on an important question: whether gross overparameterization during training is necessary for good generalization performance?

Granularity of sparsity  The granularity of sparsity refers to the additional structure imposed on the placement of the non-zero entries of a sparsified parameter tensor. The finest-grained case, namely non-structured, allows each individual weight in a parameter tensor to be zero or non-zero independently. Early compression techniques, e.g. Han et al. (2015b), and more recent pruning-based compression methods based thereon, e.g. Zhu & Gupta (2017), are non-structured (second block of Table 6). So are all direct training methods like ours (first block of Table 6).

Non-structured sparsity can not be fully exploited by mainstream compute devices such as GPUs. To tackle this problem, a class of compression methods, structured pruning methods (third block in Table 6), constrain “sparsity” to a much coarser granularity. Typically, pruning is performed at the level of an entire feature map, e.g. ThiNet (Luo et al., 2017), whole layers, or even entire residual blocks (Huang & Wang, 2017). This way, the compressed “sparse” model has essentially smaller and/or fewer dense parameter tensors, and computation can thus be accelerated on GPUs the same way as dense neural networks.

These structured compression methods, however, did not make a useful baseline in this work, for the following reasons. First, because they produce dense models, their relevance to our method (non-structured, non-compression) is far more remote than non-structured compression techniques yielding sparse models, for a meaningful comparison. Second, typical structured pruning methods substantially underperformed non-structured ones (see Table 2 in the main text for two examples, ThiNet and SSS), and emerging evidence has called into question the fundamental value of structured pruning: Mittal et al. (2018) found that the channel pruning criteria used in a number of state-of-the-art structured pruning methods performed no better than random channel elimination, and Liu et al. (2018) found that fine-tuning in a number of state-of-the-art pruning methods fared no better than direct training of a randomly initialized pruned model which, in the case of channel/layer pruning, is simply a less wide and/or less deep dense model (see Table 2 in the main text for comparison of ThiNet and SSS against thin dense).

In addition, we performed extra experiments in which we constrained our method to operate on networks with structured sparsity and obtained significantly worse results, see Appendix D.

Predefined versus automatically discovered sparsity levels across layers  The last key feature (rightmost column of Table 6) for our classification of methods is whether the sparsity levels of different layers of the network is automatically discovered during training or predefined by manual configuration. The value of automatic sparsification, e.g. ours, is twofold. First, it is conceptually more general because parameter reallocation heuristics can be applied to diverse model architectures, whereas layer-specific configuration has to be cognizant of network architecture, and at times also of the task to learn. Second, it is practically more scalable because it obviates the need for manual configuration of layer-wise sparsity, keeping the overhead of hyperparameter tuning constant rather than scaling with model depth/size. In addition to efficiency, we also show in Appendix E extra experiments on how automatic parameter reallocation across layers contributed to its effectiveness.

In conclusion, our method is unique in that it:

  1. strictly maintains a fixed parameter footprint throughout the entire course of training.

  2. automatically discovers layer-wise sparsity levels during training.

Appendix D Structured versus non-structured sparsity

Final overall sparsity (# Parameters) (7.3M) (5.1M)
Thin dense 72.4 [-2.5] 90.9 [-1.5] 70.7 [-4.2] 89.9 [-2.5]
Dynamic sparse (kernel granularity) 72.6 [-2.3] 91.0 [-1.4] 70.2 [-4.7] 89.8 [-2.6]
Dynamic sparse (non-structured) 73.3 [-1.6] 92.4 [ 0.0] 71.6 [-3.3] 90.5 [-1.9]
Table 7: Test accuracy% (top-1, top-5) of Resnet-50 on Imagenet for different levels of granularity of sparsity. Numbers in square brackets are differences from the full dense baseline.

We investigated how our method performs if it were constrained to training sparse models at a coarser granularity. Consider a weight tensor of a convolution layer, of size , where and are the number of output and input channels, respectively. Our method performed dynamic sparse reparameterization by pruning and reallocating individual weights of the 4-dimensional parameter tensor–the finest granularity. To adapt our procedure to coarse-grain sparsity on groups of parameters, we modified our algorithm (Algorithm 1 in the main text) in the following ways:

  1. the pruning step now removed entire groups of weights by comparing their -norms with the adaptive threshold.

  2. the adaptive threshold was updated based on the difference between the target number and the actual number of groups to prune/grow at each step.

  3. the growth step reallocated groups of weights within and across parameter tensors using the heuristic in Line 17 of Algorithm 1.

We show results at kernel-level granularity (i.e. groups are kernels) in Figure 6 and Table 7, for WRN-28-2 on CIFAR10 and Resnet-50 on Imagenet, respectively. We observe that enforcing kernel-level sparsity leads to significantly worse accuracy compared to unstructured sparsity. For WRN-28-2, kernel-level parameter re-allocation still outperforms the thin dense baseline, though the performance advantage disappears as the level of sparsity decreases. Note that the thin dense baseline was always trained for double the number of epochs used to train the models with dynamic parameter re-allocation.

Figure 6: Test accuracy for WRN-28-2 trained on CIFAR10 for two variants of dynamic sparse, i.e. kernel-level granularity of sparsity and non-structured (same as dynamic sparse in the main text), as well as the thin dense baseline. Conventions same as in Figure 7.

When we further coarsened the granularity of sparsity to channel level (i.e. groups are slices that generate output feature maps), our method failed to produce performant models.

Appendix E Multi-layer perceptrons and training at extreme sparsity levels

We carried out experiments on small multi-layer perceptrons to assess whether our dynamic parameter re-allocation method can effectively distribute parameters in small networks at extreme sparsity levels. we experimented with a simple LeNet-300-100 trained on MNIST. Hyper-parameters for the experiments are reported in appendix 

A. The results are shown in Fig. 7. Our method is the only method, other than pruning from a large dense model, that is capable of effectively training the network at the highest sparsity setting by automatically moving parameters between layers to realize layer sparsities that can be effectively trained. The per-layer sparsities discovered by our method are shown in Fig. 7. Our method automatically leads to a top layer with much lower sparsity than the two hidden layers. Similar sparsity patterns were found through hand-tuning to improve the performance of DeepR Bellec et al. (2017). All layers were initialized at the same sparsity level (equal to the global sparsity level). While hand-tuning the per-layer sparsities should allow SET and DeepR to learn at the highest sparsity setting, our method automatically discovers the per-layer sparsities and allows us to dispense with such a tuning step.

Figure 7: Test accuracy for LeNet-300-100-10 on MNIST for different training methods. Circular symbols mark the median of 5 runs, and error bars are the standard deviation. Parameter counts include all trainable parameters, i.e, parameters in sparse tensors plus all other dense tensors, such as those of batch normalization layers. Notice the failure of training at the highest sparsity level for static sparse, SET, and DeepR.

Appendix F Full description of the dynamic parameter re-allocation algorithm

Algorithm 1 in the main text informally describes our parameter re-allocation scheme. In this appendix, we present a more rigorous description of the algorithm. Let all reparameterized weight tensors in the original network be denoted by , where indexes layers. Let be the number of parameters in , and the total parameter count.

Sparse reparameterize , where function places components of parameter into positions in indexed by 777 By we denote the set of all cardinality ordered subsets of finite set . , s.t. indexing components. Let be the dimensionality of and , i.e. the number of non-zero weights in . Define as the sparsity of . Global sparsity is then defined as where .

During the whole course of training, we kept global sparsity constant, specified by hyperparameter . Reparameterization was initialized by uniformly sampling positions in each weight tensor at the global sparsity , i.e. , where . Associated parameters were randomly initialized.

Dynamic reparameterization was done periodically by repeating the following steps during training:

  1. Train the model (currently reparameterized by ) for batch iterations;

  2. Reallocate free parameters within and across weight tensors following Algorithm 2 to arrive at new reparameterization .

1, , From step
2, , To step
3 Target number of parameters to be pruned and its fractional tolerance
4for  do For each reparameterized weight tensor
5      Indices of subthreshold components of to be pruned
6      Numbers of pruned and surviving weights
7end for
8if  then Too few parameters pruned
9      Increase pruning threshold
10else if  then Too many parameters pruned
11      Decrease pruning threshold
12else A proper number of parameters pruned
13      Maintain pruning threshold
14end if
15for  do For each reparameterized weight tensor
16      Redistribute parameters for growth
17      Sample zero positions to grow new weights
18      New parameter count
19      New reparameterization
20end for
Algorithm 2 Reallocate free parameters within and across weight tensors

The adaptive reallocation is in essence a two-step procedure: a global pruning followed by a tensor-wise growth. Specifically our algorithm has the following key features:

  1. Pruning was based on magnitude of weights, by comparing all parameters to a global threshold , making the algorithm much more scalable than methods relying on layer-specific pruning.

  2. We made adaptive, subject to a simple setpoint control dynamics that ensured roughly weights to be pruned globally per iteration. This is computationally cheaper than pruning exactly smallest weights, which requires sorting all weights in the network.

  3. Growth was by uniformly sampling zero weights and tensor-specific, thereby achieving a reallocation of parameters across layers. The heuristic guiding growth is


    where and are the pruned and surviving parameter counts, respectively. This rule allocated more free parameters to weight tensors with more surviving entries, while keeping the global sparsity the same by balancing numbers of parameters pruned and grown 888 Note that an exact match is not guanranteed due to rounding errors in Eq. 3 and the possibility that , i.e. free parameters in a weight tensor exceeding its dense size after reallocation. We added an extra step to redistribute parameters randomly to other tensors in these cases, thereby assuring an exact global sparsity. .

The entire procedure can be fully specified by hyperparameters .