1 Introduction
The ability of deep neural networks to effectively learn complex transformations by example and their superior generalization ability has been key to their success in a wide range of domains ranging from computer vision to machine translation to automatic speech recognition. Even though they are able to generalize well, deep networks learn more effectively when they are highly overparameterized
(Brutzkus et al., 2017; Zhang et al., 2016). Emerging evidence has attributed this need for overpararameterization to the geometry of the highdimensional loss landscapes of overparameterized deep neural networks (Dauphin et al., 2014; Choromanska et al., 2014; Goodfellow et al., 2014; Im et al., 2016; Wu et al., 2017; Liao & Poggio, 2017; Cooper, 2018; Novak et al., 2018), and to the implicit regularization properties of SGD (Brutzkus et al., 2017; Zhang et al., 2018a; Poggio et al., 2017), though a thorough theoretical understanding is not yet complete.Several techniques are able to trim down the posttraining model size such as distillation methods (Bucilua et al., 2006; Hinton et al., 2015), reduced bitprecision methods (Hubara et al., 2016; McDonnell, 2018), lowrank decomposition methods (Jaderberg et al., 2014; Denil et al., 2013), and pruning methods (Han et al., 2015a; Zhang et al., 2018b). While these methods are highly effective in reducing the number of network parameters with little to no degradation in accuracy, they either operate on a pretrained model or require the full overparameterized model to be maintained during training. The success of these compression methods indicate that shallow and/or small networks contain parameter configurations that allow these networks to reach accuracies on par with the accuracy of bigger and deeper networks. This gives a tantalizing hint that overparameterization is not a strict necessity and that alternative training or parameterization methods might be able to find these compact networks directly.
The problem of achieving trainingtime parameter efficiency ^{1}^{1}1 if model family achieves a specific level of generalization performance with fewer parameters than model family , we say is more parameter efficient than at that performance level. can be approached in a number of ways. Innovations in this direction for deep convolutional neural networks (CNNs) include the development of skip connections He et al. (2015)
, the elimination of fullyconnected layers in favor of global average pooling layers followed directly by the classifier layer
Lin et al. (2013), and depthwise separable convolutions Sifre & Mallat (2014); Howard et al. (2017). These architectural innovation drastically improved the accuracy of CNNs at reduced parameter budgets.An alternative approach is reparameterizing an existing model architecture. In general, any differentiable reparameterization can be used to augment training of a given model. Let an original network (or a layer therein) be denoted by , parameterized by . Reparameterize it by and through , where is differentiable w.r.t. but not necessarily w.r.t. . Denote the reparameterized network by , considering as metaparameters ^{2}^{2}2 We use the term metaparameter to refer to the parameters of the reparameterization function . They differ from parameters
in that they are not optimized through gradient descent, and they differ from hyperparameters in that they define meaningful features of the model which are required for inference.
:(1) 
We can train using gradient descent. If and can be trained to to match the generalization performance of , then is a more efficient parameterization of the network.
Sparse reparameterization is a special case where is a linear projection; is the nonzero entries (i.e. “weights”) and
their indices (i.e. “connectivity”) in the original parameter tensor
. Likewise, parameter sharing is a similar special case of linear reparameterization where is the tied parameters and is the indices at which each parameter is placed (with repetition) in the original parameter tensor . Furthermore, if metaparameters are fixed during the course of training, the reparameterization is static, whereas if is adjusted adaptively during training, we call it dynamic reparameterization.In this paper, we look at multiple parameterizations of deep residual CNNs (both static and dynamic). We build upon previous sparse dynamic parameterization schemes to develop a novel dynamic parameterization method that yields the highest parameter efficiency while training deep residual CNNs, outperforming previous static and dynamic parameterization methods. Our method dynamically changes the sparse network structure during learning and its superior performance implies that given a certain storage and computational budget to train a residual CNN, we are better off allocating part of the storage budget to describing and evolving the structure of the network, rather than spending it all on the parameters of a conventional dense network.
We show that the success of our dynamic parameterization method is not solely due to the final structure of the resultant sparse networks or a combination of final structure and initial weight values. Rather, trainingtime structural exploration is needed to reach best accuracies, even if a highperformance structure and its initial values are known apriori. This implies that optimizing structure in tandem with weight optimization through gradient descent helps the later find betterperforming weights. Structure exploration thus improves the trainability of sparse deep residual CNNs.
2 Related work
Training of differentiably reparameterized networks has been proposed in numerous studies before.
Dense reparameterization Several dense reparameterization techniques sought to reduce the size of fully connected layers. These include lowrank decomposition (Denil et al., 2013), fastfood transform (Yang et al., 2014), ACDC transform (Moczulski et al., 2015), HashedNet (Chen et al., 2015), low displacement rank (Sindhwani et al., 2015) and blockcirculant matrix parameterization (Treister et al., 2018).
Note that similar reparameterizations were also used to introduce certain algebraic properties to the parameters for purposes other than reducing model sizes, e.g. to make training more stable as in unitary evolution RNNs (Arjovsky et al., 2015) and in weight normalization (Salimans & Kingma, 2016), to inject inductive biases (Thomas et al., 2018), and to alter (Dinh et al., 2017) or to measure (Li et al., 2018) properties of the loss landscape. All dense reparameterization methods to date are static.
Sparse reparameterization Successful training of sparse reparameterized networks usually employs iterative pruning and retraining, e.g. Han et al. (2015b); Narang et al. (2017); Zhu & Gupta (2017) ^{3}^{3}3 Note that these, as well as all other techniques we benchmark against in this paper, impose nonstructured sparsification on parameter tensors, yielding sparse models. There also exist a class of structured pruning methods that “sparsify” at channel or layer granularity, e.g. Luo et al. (2017) and Huang & Wang (2017), generating essentially small dense models. We describe a full landscape of existing methods in Appendix C. . Training typically starts with a large pretrained model and sparsity is gradually increased during the course of finetuning. Training a small, static, and sparse model de novo typically fares worse than obtaining the sparse model through pruning a large dense model (Zhu & Gupta, 2017).
Frankle & Carbin (2018) successfully identified small and sparse subnetworks posttraining which, when trained in isolation, reached a similar accuracy as the enclosing big network. They further showed that these subnetworks were sensitive to initialization, and hypothesized that the role of overparameterization is to provide a large number of candidate subnetworks, thereby increasing the likelihood that one of these subnetworks will have the necessary structure and initialization needed for effective learning.
Most closely related to our work are dynamic sparse reparameterization techniques that emerged only recently. Like ours, these methods adaptively alter, by certain heuristic rules, the location of nonzero parameters during training. Sparse evolutionary training(SET)
(Mocanu et al., 2018)used magnitudebased pruning and random growth at the end of each training epoch. NeST
(Dai et al., 2017, 2018)iteratively grew and pruned parameters and neurons during training; parameter growth was guided by parameter gradient and pruning by parameter magnitude. Deep rewiring
(Bellec et al., 2017) combined dynamic sparse parameterization with stochastic parameter updates for training. These methods were mostly concerned with sparsifying fully connected layers and applied to relatively small and shallow networks. We show that the method we propose in this paper is more scalable and computationally efficient than these previous approaches, while achieving better performance on deep convolutional networks.3 Methods
We train deep CNNs where the majority of layers have sparse weight tensors. All sparse weight tensors are initialized at the same sparsity (percentage of zeros) level. We use a full (nonsparse) parameterization for all bias parameters and the parameters of batch normalization layers. Throughout training, we always maintain the same total number of nonzero parameters in the network. Parameters are moved within and across tensors in two phases: a pruning phase, followed immediately by a growth phase as shown in algorithm
1. We carry out the parameter reallocation step described by Algorithm 1 every few hundred training iterations.We use magnitudebased pruning based on an adaptive global threshold where all network weights with magnitude smaller than are pruned. adapts to roughly maintain a fixed number of pruned/grown parameters () during each reallocation step. This makes pruning particularly efficient as no sorting operations are needed and only a single global threshold is used. After removing parameters during the pruning phase, zeroinitialized parameters are redistributed back among the network tensors in the growth phase.
Intuitively, we should allocate more parameters to layers where they can more quickly reduce the training classification loss. To first order, we should allocate more parameters to layers whose parameters receive larger classification loss gradients. If a layer has been heavily pruned, this indicates that for a large portion of its parameters, the training loss gradients were not large enough or consistent enough to counteract the pull towards zero arising from weight regularization. We thus use a simple heuristic in which the available parameters to grow are allocated more towards layers having a higher percentage of nonzero weights as shown in algorithm 1. The parameters allocated to a layer are randomly placed in the nonactive (zero) positions of its weight tensor. See appendix F for a more detailed description of the algorithm.
To simplify exposition, we do not include in algorithm 1 guards against rounding errors that can introduce a discrepancy between the number of pruned parameters and grown parameters. We also do not include the special case where more parameters are allocated to a tensor than there are nonactive positions. In that case, the extra parameters that do not fit in the now fully dense tensor are redistributed among the other sparse tensors.
The most closely related algorithm to ours is SET Mocanu et al. (2018). Our algorithm differs from SET in two respects: we use an adaptive threshold for pruning instead of pruning a fixed fraction of weights at each reallocation step; we reallocate parameters across layers during training and do not impose a fixed sparsity level on each layer. The first difference leads to reduced computational overhead as it obviates the need for sorting operations, while the second difference leads to better performing networks as shown in the next section and the ability to train extremely sparse networks as shown in appendix E.
We evaluate our method, together with other static and dynamic parameterization methods on the deep residual CNNs shown in table 1. We did not include AlexNet (Krizhevsky et al., 2012) and VGGstyle networks (Simonyan & Zisserman, 2014) as their parameter efficiency is inferior to residual nets. Such a setup makes the improvement in parameter efficiency achieved by our dynamic parameterization method more relevant. Dynamic sparse parameterization was applied to all weight tensors of convolutional layers (with the exception of downsampling convolutions and the first convolutional layer acting on the input image), while all biases and parameters of normalization layers were kept dense. Global sparsity is defined in relation to the sparse tensors only, i.e, it is the number of nonactive (zero) positions in all sparse tensors as a fraction of the number of parameters in dense tensors having the same dimensions.
At a specific global sparsity , we compared our method (dynamic sparse) against six baselines:

Full dense: original large and dense model, with parameters;

Thin dense: original model with thinner layers, such that it had the same size as dynamic sparse;

Static sparse: original model initialized sparsity level where the sparsity pattern is random, then trained with connectivity (sparsity pattern) fixed;

Compressed sparse: compression of the original model by iterative pruning and retraining the original model to target sparsity (Zhu & Gupta, 2017);

DeepR: sparse model trained by using Deep Rewiring (Bellec et al., 2017);

SET: sparse model trained by using Sparse Evolutionary Training (SET) (Mocanu et al., 2018).
Appendix B compares our method against an additional static parameterization method based on weight tying: hash nets (Chen et al., 2015).
Using the number of parameters to compare network sizes across sparse and nonsparse models can be misleading if the extra information needed to specify the connectivity structure in the sparse models is not taken into account. We thus compare models that have the same size in bits, instead of the same number of weights. While the number of bits needed to specify the connectivity is implementation dependent, we assume a simple scheme where one bit is used for each position in the weight tensors to indicate whether this position is active (contains a nonzero weight) or not. A sparse tensor is fully defined by this bitmask, together with the nonzero weights. This scheme was previously used in CNN accelerators that natively operate on sparse structures Aimar et al. (2018). For a network with 32bit weights in dense tensors, a sparse version with sparsity would have a size of bits and would thus be equivalent to a thinner dense network with weights. We use this formula to obtain the size of the only nonsparse baseline we have, thin dense, which will thus have more weights than the equivalentlysized sparse models.
A recent work Liu et al. (2018) shows that training small networks from scratch can match the accuracy of networks obtained through posttraining pruning of larger networks. The authors show this is almost always the case if the small networks were trained long enough. To address potential concerns that the performance of our dynamic parameterization scheme can be matched by networks with static parameterization if the later were trained for more epochs, we always train the thin dense and static sparse baselines for double the number of epochs used to train our dynamic sparse models. This ensures that any superior accuracy achieved by our method can not merely be due to its ability to converge faster during training. As we show in the results section, our dynamic parameterization scheme incurs minimal computational overhead, which means the thin dense and static sparse baselines are trained using significantly more computational resources than dynamic sparse.
Note that compressed sparse is a compression method that initially trains a large dense model and iteratively prunes it down, whereas all other baselines maintain the same model size throughout training. For compressed sparse, we train the large dense model for the same number of epochs used for our dynamic sparse, and then iteratively and gradually prune it down across many additional training epochs. Compressed sparse thus trains for more epochs than dynamic sparse. See Appendix A for the hyperparameters used in the experiments.
Dataset  CIFAR10  Imagenet  

Model 



Architecture 



# Parameters  1.5M  25.6M 
For brevity architecture specifications omit batch normalization and activations. Preactivation batch normalization was used in all cases. Convolutional (C) layers are specified with output size and kernel size and Max pooling (MaxPool) layers with kernel size. Brackets enclose residual blocks postfixed with repetition numbers; the downsampling convolution in the first block of a scale group is implied.
4 Experimental results
WRN282 on CIFAR10. (fig:cf10_accuracy) Test accuracy plotted against number of trainable parameters in the sparse models for different methods. Dashed lines are used for the full dense model and for models obtained through compression, whereas all methods that maintain a constant parameter count throughout training and inference are represented by solid lines. Circular symbols mark the median of 5 runs, and error bars are the standard deviation. Parameter counts include all trainable parameters, i.e, parameters in sparse tensors plus all other dense tensors, such as those of batch normalization layers. (fig:cf10_block_sparsity) Breakdown of the final sparsities of the parameter tensors in the three residual blocks that emerged from our dynamic sparse parameterization algorithm (Algorithm
1) at different levels of global sparsity.WRN282 on CIFAR10: We ran experiments using a Wide Resnet model WRN282 (Zagoruyko & Komodakis, 2016) trained to classify CIFAR10 images (see Appendix A for details of implementation). We varied the level of global sparsity and evaluated the accuracy of different dynamic and static parameterization training methods. As shown in Figure 1, static sparse and thin dense significantly underperformed the compressed sparse model as expected, whereas our method dynamic sparse performed slightly better on average. Deep rewiring significantly lagged all other method. While the performance of SET was on par with compressed sparse, it lagged behind dynamic sparse at high sparsity levels. At low sparsity levels SET largely closed the gap to compressed sparse. Even though the statically parameterized models static sparse and thin dense were trained for twice the number of epochs, they still failed to match the performance of our method or SET. Keep in mind that thin dense even had more SGDtrainable weights than all the sparse models as described in the methods section.
Our dynamic parameterization method automatically adjusts the sparsity of the parameter tensors in different layers by moving parameters across layers. We looked at the sparsity patterns that emerged at different sparsity levels and observed consistent patterns : (a) larger parameter tensors tended to be sparser than smaller ones, and (b) deeper layers tended to be sparser than shallower ones. Figure. 1 shows a breakdown of the sparsity levels in the different residual blocks at different sparsity levels.
Resnet50 on Imagenet: We also experimented with the Resnet50 bottleneck architecture (He et al., 2015) trained on Imagenet (see Appendix A for details of implementation). We tested two global sparsity levels, and (Table 2). Models obtained using our method (dynamic sparse) outperformed models obtained using all dynamic and static parameterization methods, and even slightly outperformed models obtained through posttraining compression of a large dense model. We also list in Table 2 two representative methods of structured pruning (see Appendix C), ThiNet (Luo et al., 2017) and Sparse Structure Selection (Huang & Wang, 2017), which, consistent with recent criticisms (Liu et al., 2018), underperformed static dense baselines. As with the previous experiments on WRN282, reliable sparsity patterns across the parameter tensors in different layers emerged from dynamic parameter reallocation during training, displaying the same empirical trends described above (Figure 2).
Final overall sparsity (# Parameters)  (7.3M)  (5.1M)  (25.6M)  
Reparameterization 

Thin dense 







Static sparse 
























Compression 









(at 8.7M parameter count)  



(at 15.6M parameter count) 




DeepR  4.466 0.358  5.636 0.218  
SET  1.087 0.049  1.009 0.002  
Dynamic sparse  1.083 0.051  1.005 0.004 
Computational overhead of dynamic reparameterization: We assessed the additional computational cost incurred by reparameterization steps (Algorithm 1) during training, and compared ours with existing dynamic sparse reparameterization techniques, DeepR and SET (Table 3). Because both SET and ours reallocate parameters only intermittently (every few hundred training iterations), the computational overhead was negligible for the experiments presented here^{4}^{4}4 Because of the rather negligible overhead, the reduced operation count thanks to the elimination of sorting operations did not amount to a substantial improvement over SET on GPUs. Our method’s advantage over SET lies in its ability to produce better sparse models and to reallocate free parameters automatically (see Appendix E). . DeepR, however, requires adding noise to gradient updates as well as reallocating parameters every training iteration which led to a significantly larger overhead.
Disentangling the effects of dynamic reparameterization Our dynamic parameter reallocation method consistently yields better accuracy than static parameterization methods even though the later were trained for more epochs, and in the case of thin dense, had more SGDtrainable parameters. The most immediate hypothesis for explaining this phenomenon is that our method is able to discover suitable sparse network structures that can be trained to reach high accuracies. To investigate whether the high performance of networks discovered by our method can be solely attributed to their sparse structure, we did the following experiments using WRN282 trained on CIFAR10: after training with our dynamic reallocation method, the structure (i.e. positions of nonzero entries in sparse parameter tensors) of the final network was retained, and this network was randomly reinitialized and retrained with the structure fixed(green bars in Fig. 3). Even though the network has the same structure as the final network found by our method, its training failed to reach the same accuracy.
One might argue that it is not just the network structure, but also its initialization that allow it to reach high accuracies (Frankle & Carbin, 2018). To assess this argument, we used the final network structure found by our method as described above, and initialized it with the same initial values used when training using our method. As shown in Fig. 3 (blue bars), the combination of final structure and original initialization still fell significantly short of the level of accuracy achieved by our dynamic parameter reallocation method and the performance was not significantly different from training the same network with random initialization (green bars). As control, we also show the static sparse case where the sparse network structure and its initialization were both random (red bars in Fig. 3). Unsurprisingly, these networks performed the worst. Similar trends are observed for resnet50 trained on imagenet as shown in Fig. 3. All static networks, whether their structure+initialization were random or copied from networks trained using our dynamic parameterization method, were trained for double the number of epochs compared to our method.
These results indicate that the dynamics of parameter reallocation themselves are important for learning as the success of the networks it discovers can not be solely attributed to their structure or initialization. For WRN282, we experimented with stopping the parameter reallocation mechanism (i.e, fixing the network structure) at various points during training. As shown in Fig. 4, dynamic parameter reallocation does not need to be active for the entire course of training, but only for some initial epochs.
5 Discussion
In this work, we investigated the following problem: given a small, fixed budget of parameters for a deep residual CNN throughout training time, how to train it to yield the best generalization performance. While this is an openended question, we showed that dynamic parameterization methods can achieve significantly better accuracies than static methods for the same model size. Dynamic parameterization methods have received relatively little attention, with the two principal techniques so far (SET and DeepR) applied only to relatively small and shallow networks. We showed that these techniques are indeed applicable to deep CNNs with SET consistently outperforming DeepR while incurring a lower computational cost. We proposed a dynamic parameterization method that adaptively allocates free parameters across the network based on a simple heuristic. This method yields better accuracies than previous dynamic parameterization methods and it outperforms all the static parameterization methods we tested. In appendix B, we show that our method outperforms another static parameterization method based on hash nets Chen et al. (2015). As we show in appendix E, our method is also able to train networks at extreme sparsity levels where previous static and dynamic parameterization methods often fail catastrophically.
Highperformance sparse networks are often obtained through posttraining pruning of dense networks. Recent work looks into how sparse networks can be trained directly using posthoc information obtained from a pruned model. Ref. Liu et al. (2018) argues that it is the structure alone of the pruned model that matters, i.e, training a model of the same structure, and starting with random weights, can reach the same level of accuracy as the pruned model. Yet other results (Frankle & Carbin, 2018) argue that a standalone sparse network can only be trained effectively if it copies both the pruned network structure as well as the pruned network’s initial weights when it was part of the dense model. We performed experiments in the same spirit: we trained statically parameterized networks that copy only the structure and that copy both the structure and the initial weight values of the sparse networks trained using our scheme. Interestingly, neither managed to match the performance of sparse networks trained using our dynamic parameterization scheme. The value of our dynamic parameter reallocation scheme thus goes beyond discovering the correct sparse network structure; the dynamics of the structure exploration process itself help gradient descent converge to better weights. Extra work is needed to explain the mechanism underlying this phenomenon. One hypothesis is that the discontinuous jumps in network response when the structure changes provide the ‘jolts’ necessary to pull the network from a sharp minimum that generalizes badly Keskar et al. (2016).
Structural degrees of freedom are qualitatively different from the degrees of freedom introduced by overparameterization. The later can be directly exploited using gradient descent. Structural degrees of freedom, however, are explored using nondifferentiable heuristics that only interact indirectly with the dynamics of gradient descent, for example when gradient descent pulls weights towards zero causing the associated connections to be removed. Our results indicate that for residual CNNs, and as far as model size is concerned, we are better off allocating bits to describe and explore structural degrees of freedom in a reasonably sparse network than allocating them to conventional weights.
Beside model size, computational efficiency is also a primary concern. Current mainstream compute architectures such as CPUs and GPUs have trouble efficiently handling unstructured sparsity patterns. To maintain the standard CNN structure, various pruning techniques prune a trained model at the level of entire feature maps. Recent evidence suggests the resulting networks perform no better than conventionallytrained thin networks (Liu et al., 2018), calling into question the value of such coarse pruning. In appendix D, we show that our method can easily extend to operate at an intermediate level of structured sparsity, that of kernel slices. Imposing this sparsity structure causes performance to degrade and the resulting networks perform on par with statically parameterized thin dense networks when the later are trained for double the number of epochs.
In summary, our results indicate that for deep residual CNNs, it is possible to train sparse models directly to reach generalization performance comparable to sparse networks produced by iterative pruning of large dense models. Moreover, our dynamic parameterization method results in models that significantly outperform equivalentsized dense models. Exploring structural degrees of freedom during training is key and our method is the first that is able to fully explore these degrees of freedom using its ability to move parameters within and across layers. Our results do not contradict the common wisdom that extra degrees of freedom are needed while training deep networks, but they point to structural degrees of freedom as an alternative to the degrees of freedom introduced by overparameterization.
References
 Aimar et al. (2018) Aimar, A., Mostafa, H., Calabrese, E., RiosNavarro, A., TapiadorMorales, R., Lungu, I.A., Milde, M. B., Corradi, F., LinaresBarranco, A., Liu, S.C., et al. Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE transactions on neural networks and learning systems, (99):1–13, 2018.
 Arjovsky et al. (2015) Arjovsky, M., Shah, A., and Bengio, Y. Unitary Evolution Recurrent Neural Networks. nov 2015. URL http://arxiv.org/abs/1511.06464.
 Bellec et al. (2017) Bellec, G., Kappel, D., Maass, W., and Legenstein, R. Deep Rewiring: Training very sparse deep networks. nov 2017. URL http://arxiv.org/abs/1711.05136.
 Brutzkus et al. (2017) Brutzkus, A., Globerson, A., Malach, E., and ShalevShwartz, S. SGD Learns Overparameterized Networks that Provably Generalize on Linearly Separable Data. oct 2017. URL http://arxiv.org/abs/1710.10174.
 Bucilua et al. (2006) Bucilua, C., Caruana, R., and NiculescuMizil, A. Model Compression. Technical report, 2006. URL https://www.cs.cornell.edu/{~}caruana/compression.kdd06.pdf.
 Chen et al. (2015) Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., and Chen, Y. Compressing Neural Networks with the Hashing Trick. apr 2015. URL http://arxiv.org/abs/1504.04788.
 Choromanska et al. (2014) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. The Loss Surfaces of Multilayer Networks. nov 2014. URL http://arxiv.org/abs/1412.0233.
 Cooper (2018) Cooper, Y. The loss landscape of overparameterized neural networks. apr 2018. URL http://arxiv.org/abs/1804.10200.
 Dai et al. (2017) Dai, X., Yin, H., and Jha, N. K. NeST: A Neural Network Synthesis Tool Based on a GrowandPrune Paradigm. pp. 1–15, 2017. URL http://arxiv.org/abs/1711.02017.
 Dai et al. (2018) Dai, X., Yin, H., and Jha, N. K. Grow and Prune Compact, Fast, and Accurate LSTMs. may 2018. URL http://arxiv.org/abs/1805.11797.
 Dauphin et al. (2014) Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. arXiv, pp. 1–14, 2014. URL http://arxiv.org/abs/1406.2572.
 Denil et al. (2013) Denil, M., Shakibi, B., Dinh, L., Ranzato, M., and de Freitas, N. Predicting Parameters in Deep Learning. jun 2013. URL http://arxiv.org/abs/1306.0543.
 Dinh et al. (2017) Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp Minima Can Generalize For Deep Nets. 2017. ISSN 19387228. URL http://arxiv.org/abs/1703.04933.
 Frankle & Carbin (2018) Frankle, J. and Carbin, M. The Lottery Ticket Hypothesis: Finding Small, Trainable Neural Networks. mar 2018. URL http://arxiv.org/abs/1803.03635.
 Goodfellow et al. (2014) Goodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualitatively characterizing neural network optimization problems. dec 2014. URL http://arxiv.org/abs/1412.6544.
 Han et al. (2015a) Han, S., Mao, H., and Dally, W. J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. pp. 1–14, 2015a. doi: abs/1510.00149/1510.00149. URL http://arxiv.org/abs/1510.00149.
 Han et al. (2015b) Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both Weights and Connections for Efficient Neural Networks. jun 2015b. URL http://arxiv.org/abs/1506.02626.
 He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. Arxiv.Org, 7(3):171–180, 2015. ISSN 16641078. doi: 10.3389/fpsyg.2013.00124. URL http://arxiv.org/pdf/1512.03385v1.pdf.
 He et al. (2017) He, Y., Zhang, X., and Sun, J. Channel Pruning for Accelerating Very Deep Neural Networks. jul 2017. URL http://arxiv.org/abs/1707.06168.
 Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the Knowledge in a Neural Network. pp. 1–9, 2015. ISSN 00222488. doi: 10.1063/1.4931082. URL http://arxiv.org/abs/1503.02531.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. apr 2017. URL http://arxiv.org/abs/1704.04861.
 Huang & Wang (2017) Huang, Z. and Wang, N. DataDriven Sparse Structure Selection for Deep Neural Networks. jul 2017. URL https://arxiv.org/abs/1707.01213.
 Hubara et al. (2016) Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., and Bengio, Y. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. sep 2016. URL http://arxiv.org/abs/1609.07061.
 Im et al. (2016) Im, D. J., Tao, M., and Branson, K. An empirical analysis of the optimization of deep network loss surfaces. dec 2016. URL http://arxiv.org/abs/1612.04010.
 Jaderberg et al. (2014) Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up Convolutional Neural Networks with Low Rank Expansions. may 2014. URL http://arxiv.org/abs/1405.3866.
 Keskar et al. (2016) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima. sep 2016. URL http://arxiv.org/abs/1609.04836.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. Technical report, 2012.
 Lebedev & Lempitsky (2015) Lebedev, V. and Lempitsky, V. Fast ConvNets Using Groupwise Brain Damage. jun 2015. URL https://arxiv.org/abs/1506.02515.
 Li et al. (2018) Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuring the Intrinsic Dimension of Objective Landscapes. apr 2018. URL http://arxiv.org/abs/1804.08838.
 Li et al. (2016) Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning Filters for Efficient ConvNets. aug 2016. URL http://arxiv.org/abs/1608.08710.
 Liao & Poggio (2017) Liao, Q. and Poggio, T. Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning. arXiv, mar 2017. URL http://arxiv.org/abs/1703.09833.
 Lin et al. (2013) Lin, M., Chen, Q., and Yan, S. Network In Network. arXiv preprint, pp. 10, 2013. ISSN 03029743. doi: 10.1109/ASRU.2015.7404828. URL https://arxiv.org/pdf/1312.4400.pdfhttp://arxiv.org/abs/1312.4400.
 Liu et al. (2017) Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. aug 2017. URL https://arxiv.org/abs/1708.06519.
 Liu et al. (2018) Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Rethinking the Value of Network Pruning. oct 2018. URL http://arxiv.org/abs/1810.05270.
 Luo et al. (2017) Luo, J.H., Wu, J., and Lin, W. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. jul 2017. URL http://arxiv.org/abs/1707.06342.
 McDonnell (2018) McDonnell, M. D. Training wide residual networks for deployment using a single bit for each weight. feb 2018. URL http://arxiv.org/abs/1802.08530.
 Mittal et al. (2018) Mittal, D., Bhardwaj, S., Khapra, M. M., and Ravindran, B. Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks. jan 2018. URL http://arxiv.org/abs/1801.10447.
 Mocanu et al. (2018) Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1):2383, dec 2018. ISSN 20411723. doi: 10.1038/s41467018043163. URL http://www.nature.com/articles/s41467018043163.
 Moczulski et al. (2015) Moczulski, M., Denil, M., Appleyard, J., and de Freitas, N. ACDC: A Structured Efficient Linear Layer. nov 2015. URL http://arxiv.org/abs/1511.05946.
 Narang et al. (2017) Narang, S., Elsen, E., Diamos, G., and Sengupta, S. Exploring Sparsity in Recurrent Neural Networks. apr 2017. URL http://arxiv.org/abs/1704.05119.
 Novak et al. (2018) Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and SohlDickstein, J. Sensitivity and Generalization in Neural Networks: an Empirical Study. feb 2018. URL http://arxiv.org/abs/1802.08760.
 Poggio et al. (2017) Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J., and Mhaskar, H. Theory of Deep Learning III: explaining the nonoverfitting puzzle. 2017. URL http://arxiv.org/abs/1801.00173.
 Salimans & Kingma (2016) Salimans, T. and Kingma, D. P. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. feb 2016. URL http://arxiv.org/abs/1602.07868.
 Sifre & Mallat (2014) Sifre, L. and Mallat, S. Rigidmotion scattering for image classification. PhD thesis, Citeseer, 2014.
 Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very Deep Convolutional Networks for LargeScale Image Recognition. sep 2014. URL http://arxiv.org/abs/1409.1556.
 Sindhwani et al. (2015) Sindhwani, V., Sainath, T. N., and Kumar, S. Structured Transforms for SmallFootprint Deep Learning. oct 2015. URL http://arxiv.org/abs/1510.01722.
 Suau et al. (2018) Suau, X., Zappella, L., and Apostoloff, N. Network Compression using Correlation Analysis of Layer Responses. jul 2018. URL http://arxiv.org/abs/1807.10585.
 Thomas et al. (2018) Thomas, A. T., Gu, A., Dao, T., Rudra, A., and Christopher, R. Learning invariance with compact transforms. pp. 1–7, 2018.
 Treister et al. (2018) Treister, E., Ruthotto, L., Sharoni, M., Zafrani, S., and Haber, E. LowCost Parameterizations of Deep Convolution Neural Networks. may 2018. URL http://arxiv.org/abs/1805.07821.
 Wen et al. (2016) Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
 Wu et al. (2017) Wu, L., Zhu, Z., and E, W. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes. jun 2017. URL http://arxiv.org/abs/1706.10239.
 Yang et al. (2014) Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., and Wang, Z. Deep Fried Convnets. dec 2014. URL http://arxiv.org/abs/1412.7149.
 Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide Residual Networks. may 2016. URL http://arxiv.org/abs/1605.07146.
 Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. nov 2016. URL http://arxiv.org/abs/1611.03530.
 Zhang et al. (2018a) Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., and Poggio, T. Theory of Deep Learning IIb: Optimization Properties of SGD. jan 2018a. URL http://arxiv.org/abs/1801.02254.
 Zhang et al. (2018b) Zhang, T., Ye, S., Zhang, K., Tang, J., Wen, W., Fardad, M., and Wang, Y. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers. apr 2018b. URL http://arxiv.org/abs/1804.03294.
 Zhu & Gupta (2017) Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. 2017. URL http://arxiv.org/abs/1710.01878.
Appendix A Details of implementation
We implemented all models and reparameterization mechanisms using pytorch. Experiments were run on GPUs, and all sparse tensors were represented as dense tensors filtered by a binary mask ^{5}^{5}5 This is a mere implementational choice for ease of experimentation given available hardware and software, which did not save memory because of sparsity. With computing substrate optimized for sparse linear algebra, our method is duly expected to realize the promised memory efficiency. .
Experiment 




Hyperparameters for training  
Number of training epochs  100  200  100  
Minibatch size  100  100  256  








Momentum (Nesterov)  0.9  0.9  0.9  
regularization multiplier  0.0001  0.0  0.0  
regularization multiplier  0.0  0.0005  0.0001  
Hyperparameters for sparse compression (compressed sparse) (Zhu & Gupta, 2017)  
Number of pruning iterations ()  10  20  20  

2  2  2  
Number of training epochs postpruning  20  10  10  
Total number of pruning epochs  40  50  50  








Hyperparameters for dynamic sparse reparameterization (dynamic sparse) (ours)  
Number of parameters to prune ()  600  20,000  200,000  
Fractional tolerance of ()  0.1  0.1  0.1  
Initial pruning threshold ()  0.001  0.001  0.001  








Hyperparameters for Sparse Evolutionary Training (SET) (Mocanu et al., 2018)  

600  20,000  200,000  








Hyperparameters for Deep Rewiring (DeepR) (Bellec et al., 2017)  
regularization multiplier ()  







Training Hyperparameter settings for training are listed in the first block of Table 4. Standard mild data augmentation was used in all experiments for CIFAR10 (random translation, cropping and horizontal flipping) and for Imagenet (random cropping and horizontal flipping). The last linear layer of WRN282 was always kept dense as it has a negligible number of parameters. The number of training epochs for the thin dense and static sparse baselines are double the number of training epochs shown in Table 4.
Sparse compression baseline We compared our method against iterative pruning methods (Han et al., 2015b; Zhu & Gupta, 2017). We start from a full dense model trained with hyperparameters provided in the first block of Table 4 and then gradually prune the network to a target sparsity in steps. As in Zhu & Gupta (2017), the pruning schedule we used was
(2) 
where indexes pruning steps, and the target sparsity reached at the end of training. Thus, this baseline (labeled as compressed sparse in the paper) was effectively trained for more iterations (original training phase plus compression phase) than our dynamic sparse method. Hyperparameter settings for sparse compression are listed in the second block of Table 4.
Dynamic reparameterization (ours) Hyperparameter settings for dynamic sparse reparameterization (Algorithm 1) are listed in the third block of Table 4.
Sparse Evolutionary Training (SET) Because the largerscale experiments here (WRN282 on CIFAR10 and Resnet50 on Imagenet) were not attempted by Mocanu et al. (2018), no specific settings for reparameterization in these cases were available in the original paper. In order to make a fair comparison, we used the same hyperparameters as those used in our dynamic reparameterization scheme (third block in Table 4). At each reparameterization step, the weights in each layer were sorted by magnitude and the smallest fraction was pruned. An equal number of parameters were then randomly allocated in the same layer and initialized to zero. For control, the total number of reallocated weights at each step was chosen to be the same as our dynamic reparameterization method, as was the schedule for reparameterization.
Deep Rewiring (DeepR) The fourth block in Table 4 contain hyperparameters for the DeepR experiments. We refer the reader to Bellec et al. (2017) for details of the deep rewiring algorithm and for explanation of the hyperparameters. We chose the DeepR hyperparameters for the different networks based on a parameter sweep.
Appendix B Comparison to hash nets
We also compared our dynamic sparse reparameterization method to a number of static dense reparameterization techniques, e.g. Denil et al. (2013); Yang et al. (2014); Moczulski et al. (2015); Sindhwani et al. (2015); Chen et al. (2015); Treister et al. (2018). Instead of sparsification, these methods impose structure on large parameter tensors by parameter sharing. Most of these methods have not been used for convolutional layers except for recent ones (Chen et al., 2015; Treister et al., 2018). We found that HashedNet (Chen et al., 2015) had the best performance over other static dense reparameterization methods, and also benchmarked our method against it. Instead of reparameterizing a parameter tensor with entries to a sparse one with nonzero components, HashedNet’s reparameterization is to put free parameters into positions in the parameter through a random mapping from to computed by cheap hashing, resulting in a dense parameter tensor with shared components.
Results of LeNet30010010 on MNIST are presented in Figure 5, those of WRN282 on CIFAR10 in Figure 5, and those of Resnet50 on Imagenet in Table 5. For a certain global sparsity of our method, we compare it against a HashedNet with all reparameterized tensors hashed such that each had a fraction of unique parameters. We found that our method dynamic sparse significantly outperformed HashedNet.
Final global sparsity (# Parameters)  (7.3M)  (5.1M)  

HashedNet  70.0 [4.9]  89.6 [2.8]  66.9 [8.0]  87.4 [5.0] 
Dynamic sparse (ours)  73.3 [1.6]  92.4 [ 0.0]  71.6 [3.3]  90.5 [1.9] 
Appendix C A taxonomy of training methods that yield “sparse” deep CNNs
As an extension to Section 2 of the main text, here we elaborate on existing methods related to ours, how they compare with and contrast to each other, and what features, apart from effectiveness, distinguished our approach from all previous ones. We confine the scope of comparison to training methods that produce smaller versions (i.e. ones with fewer parameters) of a given modern (i.e. postAlexNet) deep convolutional neural network model. We list representative methods in Table 6. We classify these methods by three key features.
Method 






yes  nonstructured  yes  

yes  nonstructured  no  

yes  nonstructured  no  

no  nonstructured  yes  

no  nonstructured  no  

no  nonstructured  no  

no  nonstructured  no  

no  channel  no  

no  channel  no  

no  channel/kernel/layer  yes  

no  channel  no  

no  channel  no  

no  channel  yes  

no  layer  yes  

no  channel  yes/no 
Strict parameter budget throughout training and inference This feature was discussed in depth in the main text. Most of the methods to date are compression techniques, i.e. they start training with a fully parameterized, dense model, and then reduce parameter counts. To the best of our knowledge, only three methods, namely DeepR (Bellec et al., 2017), SET (Mocanu et al., 2018) and ours, strictly impose, throughout the entire course of training, a fixed small parameter budget, one that is equal to the size of the final sparse model for inference. We make a distinction between these direct training methods (first block) and compression methods (second and third blocks of Table 6) ^{6}^{6}6 Note that an intermediate case is NeST (Dai et al., 2017, 2018), which starts training with a small network, grows it to a large size, and finally prunes it down again. Thus, a fixed parameter footprint is not strictly imposed throughout training, so we list NeST in the second block of Table 6. .
This distinction is meaningful in two ways: (a) practically, direct training methods are more memoryefficient on appropriate computing substrate by requiring parameter storage of no more than the final compressed model size; (b) theoretically, these methods, if performing on par with or better than compression methods (as this work suggests), shed light on an important question: whether gross overparameterization during training is necessary for good generalization performance?
Granularity of sparsity The granularity of sparsity refers to the additional structure imposed on the placement of the nonzero entries of a sparsified parameter tensor. The finestgrained case, namely nonstructured, allows each individual weight in a parameter tensor to be zero or nonzero independently. Early compression techniques, e.g. Han et al. (2015b), and more recent pruningbased compression methods based thereon, e.g. Zhu & Gupta (2017), are nonstructured (second block of Table 6). So are all direct training methods like ours (first block of Table 6).
Nonstructured sparsity can not be fully exploited by mainstream compute devices such as GPUs. To tackle this problem, a class of compression methods, structured pruning methods (third block in Table 6), constrain “sparsity” to a much coarser granularity. Typically, pruning is performed at the level of an entire feature map, e.g. ThiNet (Luo et al., 2017), whole layers, or even entire residual blocks (Huang & Wang, 2017). This way, the compressed “sparse” model has essentially smaller and/or fewer dense parameter tensors, and computation can thus be accelerated on GPUs the same way as dense neural networks.
These structured compression methods, however, did not make a useful baseline in this work, for the following reasons. First, because they produce dense models, their relevance to our method (nonstructured, noncompression) is far more remote than nonstructured compression techniques yielding sparse models, for a meaningful comparison. Second, typical structured pruning methods substantially underperformed nonstructured ones (see Table 2 in the main text for two examples, ThiNet and SSS), and emerging evidence has called into question the fundamental value of structured pruning: Mittal et al. (2018) found that the channel pruning criteria used in a number of stateoftheart structured pruning methods performed no better than random channel elimination, and Liu et al. (2018) found that finetuning in a number of stateoftheart pruning methods fared no better than direct training of a randomly initialized pruned model which, in the case of channel/layer pruning, is simply a less wide and/or less deep dense model (see Table 2 in the main text for comparison of ThiNet and SSS against thin dense).
In addition, we performed extra experiments in which we constrained our method to operate on networks with structured sparsity and obtained significantly worse results, see Appendix D.
Predefined versus automatically discovered sparsity levels across layers The last key feature (rightmost column of Table 6) for our classification of methods is whether the sparsity levels of different layers of the network is automatically discovered during training or predefined by manual configuration. The value of automatic sparsification, e.g. ours, is twofold. First, it is conceptually more general because parameter reallocation heuristics can be applied to diverse model architectures, whereas layerspecific configuration has to be cognizant of network architecture, and at times also of the task to learn. Second, it is practically more scalable because it obviates the need for manual configuration of layerwise sparsity, keeping the overhead of hyperparameter tuning constant rather than scaling with model depth/size. In addition to efficiency, we also show in Appendix E extra experiments on how automatic parameter reallocation across layers contributed to its effectiveness.
In conclusion, our method is unique in that it:

strictly maintains a fixed parameter footprint throughout the entire course of training.

automatically discovers layerwise sparsity levels during training.
Appendix D Structured versus nonstructured sparsity
Final overall sparsity (# Parameters)  (7.3M)  (5.1M)  

Thin dense  72.4 [2.5]  90.9 [1.5]  70.7 [4.2]  89.9 [2.5] 
Dynamic sparse (kernel granularity)  72.6 [2.3]  91.0 [1.4]  70.2 [4.7]  89.8 [2.6] 
Dynamic sparse (nonstructured)  73.3 [1.6]  92.4 [ 0.0]  71.6 [3.3]  90.5 [1.9] 
We investigated how our method performs if it were constrained to training sparse models at a coarser granularity. Consider a weight tensor of a convolution layer, of size , where and are the number of output and input channels, respectively. Our method performed dynamic sparse reparameterization by pruning and reallocating individual weights of the 4dimensional parameter tensor–the finest granularity. To adapt our procedure to coarsegrain sparsity on groups of parameters, we modified our algorithm (Algorithm 1 in the main text) in the following ways:

the pruning step now removed entire groups of weights by comparing their norms with the adaptive threshold.

the adaptive threshold was updated based on the difference between the target number and the actual number of groups to prune/grow at each step.

the growth step reallocated groups of weights within and across parameter tensors using the heuristic in Line 17 of Algorithm 1.
We show results at kernellevel granularity (i.e. groups are kernels) in Figure 6 and Table 7, for WRN282 on CIFAR10 and Resnet50 on Imagenet, respectively. We observe that enforcing kernellevel sparsity leads to significantly worse accuracy compared to unstructured sparsity. For WRN282, kernellevel parameter reallocation still outperforms the thin dense baseline, though the performance advantage disappears as the level of sparsity decreases. Note that the thin dense baseline was always trained for double the number of epochs used to train the models with dynamic parameter reallocation.
When we further coarsened the granularity of sparsity to channel level (i.e. groups are slices that generate output feature maps), our method failed to produce performant models.
Appendix E Multilayer perceptrons and training at extreme sparsity levels
We carried out experiments on small multilayer perceptrons to assess whether our dynamic parameter reallocation method can effectively distribute parameters in small networks at extreme sparsity levels. we experimented with a simple LeNet300100 trained on MNIST. Hyperparameters for the experiments are reported in appendix
A. The results are shown in Fig. 7. Our method is the only method, other than pruning from a large dense model, that is capable of effectively training the network at the highest sparsity setting by automatically moving parameters between layers to realize layer sparsities that can be effectively trained. The perlayer sparsities discovered by our method are shown in Fig. 7. Our method automatically leads to a top layer with much lower sparsity than the two hidden layers. Similar sparsity patterns were found through handtuning to improve the performance of DeepR Bellec et al. (2017). All layers were initialized at the same sparsity level (equal to the global sparsity level). While handtuning the perlayer sparsities should allow SET and DeepR to learn at the highest sparsity setting, our method automatically discovers the perlayer sparsities and allows us to dispense with such a tuning step.Appendix F Full description of the dynamic parameter reallocation algorithm
Algorithm 1 in the main text informally describes our parameter reallocation scheme. In this appendix, we present a more rigorous description of the algorithm. Let all reparameterized weight tensors in the original network be denoted by , where indexes layers. Let be the number of parameters in , and the total parameter count.
Sparse reparameterize , where function places components of parameter into positions in indexed by ^{7}^{7}7 By we denote the set of all cardinality ordered subsets of finite set . , s.t. indexing components. Let be the dimensionality of and , i.e. the number of nonzero weights in . Define as the sparsity of . Global sparsity is then defined as where .
During the whole course of training, we kept global sparsity constant, specified by hyperparameter . Reparameterization was initialized by uniformly sampling positions in each weight tensor at the global sparsity , i.e. , where . Associated parameters were randomly initialized.
Dynamic reparameterization was done periodically by repeating the following steps during training:

Train the model (currently reparameterized by ) for batch iterations;

Reallocate free parameters within and across weight tensors following Algorithm 2 to arrive at new reparameterization .
The adaptive reallocation is in essence a twostep procedure: a global pruning followed by a tensorwise growth. Specifically our algorithm has the following key features:

Pruning was based on magnitude of weights, by comparing all parameters to a global threshold , making the algorithm much more scalable than methods relying on layerspecific pruning.

We made adaptive, subject to a simple setpoint control dynamics that ensured roughly weights to be pruned globally per iteration. This is computationally cheaper than pruning exactly smallest weights, which requires sorting all weights in the network.

Growth was by uniformly sampling zero weights and tensorspecific, thereby achieving a reallocation of parameters across layers. The heuristic guiding growth is
(3) where and are the pruned and surviving parameter counts, respectively. This rule allocated more free parameters to weight tensors with more surviving entries, while keeping the global sparsity the same by balancing numbers of parameters pruned and grown ^{8}^{8}8 Note that an exact match is not guanranteed due to rounding errors in Eq. 3 and the possibility that , i.e. free parameters in a weight tensor exceeding its dense size after reallocation. We added an extra step to redistribute parameters randomly to other tensors in these cases, thereby assuring an exact global sparsity. .
The entire procedure can be fully specified by hyperparameters .
Comments
There are no comments yet.