While machine learning research has recently focused on massive deep neural networks, practical deployment of these powerful models on edge devices remains a challenge. One approach to making machine learning for edge intelligence more feasible is neural network pruning, where connections in a neural network are removed to produce a smaller final model. Pruning can be used on its own to improve inference performance and compress models, or it can be combined with other techniques like quantization to produce even more efficient neural networks .
Neural network pruning methods can be divided into structured and unstructured pruning. In unstructured pruning, any set of connections can be removed. Structured pruning instead requires that the pruned connections follow some structure, such as removing entire rows from weight matrices or entire filters from convolutional layers. Structured pruning can usually improve performance metrics like inference speed at lower sparsity levels than unstructured pruning, since the structure can be taken advantage of in optimization . On the other hand, unstructured pruning can generally remove more connections without significantly decreasing accuracy, even finding architectures that outperform the original model .
We introduce a novel pruning method that can be used for both unstructured and structured pruning. Taking inspiration from statistical physics, we propose inducing a Gibbs distribution over the weights of a neural network and sampling from it during training to determine pruning masks. This procedure induces a learned network structure that is resilient to high degrees of pruning. Gibbs distributions are highly flexible in terms of network properties that they can express, and quadratic energy functions, such as Ising models, can capture parameter interactions and induce desired structure in pruning masks. They also naturally allow a temperature parameter to be used for annealing, gradually converging to a final pruning mask during training and improving network resilience to pruning.
Many existing pruning methods start with a trained network and use an iterative procedure that alternates between removing some connections and training for additional epochs to fine-tune the network[1, 4, 5]. We believe that it is preferable to prune the network during the original training phase, since this allows the network and the pruning mask to adapt to each other, in particular encouraging the network to take on a representation that is robust to pruning. We show that our method can take advantage of additional training epochs by stretching the annealing and learning rate schedules rather than adding a separate fine-tuning phase. We compare our proposed method to a number of structured and unstructured pruning methods from the literature, including methods that fine-tune trained networks and methods that prune while training [1, 3, 6, 4, 5]. We generally find that our method outperforms them.
Sample code for our proposed method is available at https://github.com/j201/gibbs-pruning.
Ii Proposed Method
A Gibbs distribution over a vectorhas a density function
where is the Hamiltonian function, or energy, is an inverse temperature parameter, and
is a partition function that normalizes the distribution. Gibbs distributions are a very general family of probability distributions, giving considerable flexibility in defining pruning methods. For instance, every Markov random field corresponds to a Gibbs distribution, and vice versa. They also naturally permit annealing, by changingover the course of training.
In our proposed method, represents a pruning mask for a single neural network layer, with . Given the weights of the layer as a flattened vector , a weight is masked (treated as zero) during training if the corresponding is -1. is sampled from at every training step and once again after training to determine the final mask. Note that weight values are not permanently modified during training, allowing a connection to be masked out in some iterations and active in others.
The goal of our method is to have both the neural network weights and pruning mask converge together from random initial states to a final effective representation over the course of training. Gibbs distributions enable this, since they can be annealed from more random behaviour at high temperatures and converge to a final steady state . The most likely result of sampling — and the result we expect to converge to — corresponds to the state that minimizes the Hamiltonian. We select a Hamiltonian and annealing schedule that will converge to an effective pruned network.
Defining a pruning method within this framework requires:
A Hamiltonian where the minimum energy corresponds to a desired pruned structure, based on the network parameters;
A method for sampling from the distribution;
An approach for annealing during training.
The following sections discuss these elements in more detail.
Ii-a Unstructured Pruning
A widely used heuristic for pruning neural networks is pruning weights with low magnitudes[1, 4, 8]. We follow this approach in designing a Hamiltonian for unstructured pruning: we want the energy to be minimized when weights with low magnitude are pruned away.
Let be the number of weights in the layer and let represent a vector with the elements of sorted such that . Given a pruning rate between 0 and 1, select , where . If is not an integer for the given value of
, we linearly interpolate betweenand , where and are the nearest values below and above respectively for which is an integer:
In effect, is the empirical
th quantile of the squared weight magnitudes.
We define the Hamiltonian for unstructured pruning as
The value of the Hamiltonian is minimized when the fraction of weights with lowest magnitude are masked, making that the state that annealing will converge towards.
In general, sampling from the Gibbs distribution is difficult because the partition function cannot be feasibly computed for large vectors. In the following proposition, we show that our proposed Hamiltonian for unstructured pruning factors in terms of elements of . The elements are therefore independent and can easily be sampled individually.
The Gibbs distribution corresponding to the proposed Hamiltonian (3) has a product form.
The partition function normalizes the distribution. With the proposed Hamiltonian, it has the form:
The Gibbs distribution therefore factors as:
Ii-B Structured Pruning
For structured pruning, we prune blocks of weights together in a way that is advantageous for inference performance. We enforce this by adding a quadratic term to the Hamiltonian that encourages all elements of the pruning mask within given neighbourhoods to take the same value, where the neighbourhoods are chosen to correspond to the blocks of weights that are pruned together. For instance, to do filter-wise pruning on a convolutional network, we would define the neighbourhoods to correspond to the filters of the layer.
We define a neighbourhood as a set of indices corresponding to a subset of weights in the neural network layer. Let represent the th neighbourhood. Neighbourhoods are disjoint and the union of all neighbourhoods is the set of all weight indices, that is and , where is the number of neighbourhoods. Let us also define a vector such that
We propose the Hamiltonian
is a hyperparameter controlling the relative influence of the structure term on the Hamiltonian, andis an empirical quantile function over squared weight magnitudes as described previously, but calculated over the values in instead of . The state that minimizes the Hamiltonian when is large enough to enforce the mask structure therefore corresponds to pruning the fraction of neighbourhoods with lowest average weight magnitude. When , the inner summation in the quadratic term does not sum over anything, so the quadratic term does not apply. Hence the unstructured pruning Hamiltonian (3) is a special case of the Hamiltonian (4).
The following proposition shows that the probability distribution still factors in terms of neighbourhoods, so we can sample each neighbourhood separately.
The Gibbs distribution corresponding to the Hamiltonian (4) has a block product form.
Since the first term of the Hamiltonian (4) is zero for elements from different neighbourhoods, the Hamiltonian can be expressed as:
and the overall distribution factors as:
Sampling from the above distribution is more challenging than in the unstructured case. If neighbourhoods are small enough, the partition function for each can be computed and the distribution can be sampled from directly. But for large neighbourhoods, this is computationally infeasible, and so we use a Markov chain Monte Carlo (MCMC) method to generate samples. A sampling method that parallelizes well and is simple to implement on a GPU is preferable for neural network training to highly iterative methods like standard Gibbs sampling or the Wolff algorithm. We therefore propose using Chromatic Gibbs sampling .
Chromatic Gibbs sampling involves making a colouring over the elements in such that any interacting elements and in the Hamiltonian have different colours. For our Hamiltonians, and interact if there is a term with . An iteration of the Markov chain then consists of sampling all elements of one colour simultaneously given the current values of all other elements.
To create a colouring, we arbitrarily divide each neighbourhood into two sets of elements and modify the Hamiltonian to remove all quadratic terms containing elements in the same set. In graphical terms, this transforms the Markov random field for each neighbourhood from a complete graph to a complete bipartite graph that can be coloured with two colours, as shown in Figure 1. This modification still preserves a high degree of connectivity within each neighbourhood and so is still effective at encouraging elements in neighbourhoods to take the same value. For instance, to perform filter-wise pruning of a convolutional network, we can divide all connections into two groups based on their input channel. Pseudocode for this algorithm is shown in Algorithm 1.
Our Hamiltonians are designed so that we know which value minimizes them and is therefore the mode of the distribution. To minimize burn-in time, we use this value to initialize the Markov chain, as recommended in .
We perform annealing by increasing from a low value to a high value while training. At high temperatures (low ), differences in the Hamiltonian do not affect sampling much, and so roughly 50% of weights are pruned randomly. This phase acts similarly to stochastic methods for regularizing neural networks , in particular dropconnect , regularizing the network and conditioning it to be robust under weight pruning. Once annealed to a low temperature (high ), the Gibbs distribution converges towards the value of that minimizes the Hamiltonian, which is our desired final pruning mask. A significant amount of later training time is therefore spent adapting to the particular structure of the pruning mask. The balance between time spent on adapting to random masks and time spent on adapting to the final mask can be controlled by the particular annealing schedule for used in training.
We compare Gibbs pruning, our proposed method, to several pruning methods that are either well-established or have recently shown exceptional results. We evaluate performance on ResNet neural networks . Nearly all of the parameters of these networks are in convolutional layers, which already are much more sparse than dense layers of similar dimensions, and additionally employ weight sharing, making them more difficult to prune than dense layers. Pruning methods that are effective on networks like AlexNet  with large dense layers often do not show the same performance on networks like ResNet, making them a more challenging test.
In particular, we train ResNet-20 and ResNet-56 with linear projection  on the CIFAR-10  dataset. The networks are trained for 200 epochs using the Adam optimizer , with a learning rate initially set to and reduced by a factor of 10 at epoch 80 and every 40 epochs thereafter. We use data augmentation during training, randomly shifting images horizontally and vertically by up to 10% and flipping images horizontally with 50% probability. This achieves baseline top-1 accuracies of 90.7% for ResNet-20 and 92.2% for ResNet-56. For all methods, we prune all convolutional layers except for the first one, following the recommendation in .
For pruning methods that use additional epochs for fine-tuning, we use a training rate of after the initial training phase. We also test changing the lengths of fine-tuning schedules, to evaluate tradeoffs between the number of additional training epochs and the final accuracy.
Iii-a Unstructured Pruning
We compare Gibbs pruning to four established unstructured pruning methods. The first is using regularization with a tuned penalty of 0.001 to induce sparsity during training before masking the fraction of weights in each layer with the lowest magnitude. The other pruning methods are more advanced approaches from the literature. We evaluate the method proposed in  by Han et al. Once training is complete, a certain percentage of the weights with lowest magnitude are pruned and the network is fine-tuned. This procedure is then repeated several times. We test different training schedules, using pruning percentages of 10%, 20%, 30%, etc. up to 90%, fine-tuning for epochs after each step, leading to a total of epochs. We also compare to iterative magnitude pruning (IMP), proposed by Frankle et al.  with rewinding . This method trains the network several times, pruning gradually more each time and then rewinding the network weights to the values they had after the first 500 training steps. To test different training times, we vary the number of times that the network is trained with intermediate pruning rates between the initial rate of 0% and the final rate of 90%. Finally, we test targeted dropout, a method recently proposed by Gomez et al.. We use their most successful hyperparameter settings: .
When using Gibbs pruning, we anneal according to a logarithmic schedule from 0.7 to 10000. With the baseline training schedule of 200 epochs, we anneal over the first 128 epochs. These values were chosen empirically to cover a range from an effective pruning rate of 50% to .
To compare our proposed method to those that use additional training epochs, we also evaluate ’stretching’ the training and pruning procedure to take place over a longer training schedule. To do this, we run training for some multiple of 200 epochs, and stretch out the learning rate and annealing schedules by the same multiple. This follows our general approach of training and pruning at the same time, but gives the network longer to both converge to the final pruning mask and to fully adapt to it. In some highly stretched configurations, the pruning mask requires a higher final to converge—in these cases we increase the final beta to .
We evaluate pruning 90% of weights in each pruned layer, which is a high degree of compression for the given networks that effectively shows performance differences between different methods. Results are shown in Figure 2. Our proposed method outperforms the other evaluated methods for any number of training epochs. While it is effective at 200 epochs, stretching the learning and annealing schedules provides even better performance.
We evaluate Gibbs pruning using different pruning rates in Figure 3. When additional training epochs are used, the resulting accuracy can come close to or exceed the baseline accuracy even with very high pruning rates. For instance, removing 95% of weights in pruned layers decreases accuracy by only 0.90% for ResNet-20 and 0.45% for ResNet-50. To the best of our knowledge, these are the best results reported for such a high compression rate on CIFAR-10 .
Iii-B Random Masks and Re-initialization
In , the authors show that results with many pruning methods can be replicated by training a network with the same structure as the pruned network, but with newly initialized weights. This calls into question the utility of training large networks and then pruning, as well as the relevance of weight values in pruning networks. To assess this phenomenon with our proposed method, we evaluate retraining pruned networks with randomly re-initialized weights. For comparison, we also test training networks from scratch using random pruning masks with the chosen sparsity.
Results are shown in Table I. We find that random re-initialization does not meet the performance of our proposed method, and is more similar to training with a random mask. These results suggests that Gibbs pruning acts less as a search over possible network architectures, and instead adapts more dynamically along with weight values throughout training, resulting in a final representation that is more tuned to particular weights than other methods are.
Iii-C Structured Pruning
We also evaluate performance for structured convolutional neural network pruning. We consider pruning individual kernels as well as entire filters of size , where is the number of output channels for the layer being pruned. Generally speaking, kernel-wise pruning can achieve higher accuracies for the same number of parameters, but might be harder to optimize performance for compared to filter-wise pruning. In practice, the desired structure depends on the optimization capabilities of the hardware and software the neural network is implemented on.
We compare Gibbs pruning to three established structured pruning methods. One is targeted dropout , implemented as previously described, but dropping entire kernels or filters. For filter-wise pruning, we set its hyperparameters to the most effective values described for unit dropout: . We also evaluate iteratively pruning filters or kernels based on their -norm and retraining, as proposed by Li et al. in . Finally, we evaluate ThiNet pruning, as proposed by Luo et al. . This method greedily prunes channels to minimize changes in activations over an evaluation dataset. Pruning a channel is equivalent to pruning a filter in the previous layer, making channel-wise and filter-wise pruning comparable. This method does not prune convolution layers, which we also omit for filter-wise pruning with our proposed method.
For kernel-wise Gibbs pruning, each neighbourhood is small enough that the partition function is practically computable, so we sample directly from the Gibbs distribution. For filter-wise pruning, we use chromatic Gibbs sampling as described in Section II-B. We run the Markov chain for 50 iterations, as further iterations did not improve performance in practice. For kernel-wise pruning, we set to 0.01 and anneal logarithmically from 0.7 to 10000. For filter-wise pruning, we set to 1 and anneal logarithmically from 0.003 to 1. As with unstructured pruning, these values were chosen empirically to cover a range from an effective pruning rate of 50% to , while also converging to a mask that has the desired structure. To test different training lengths, we stretch the learning rate and annealing schedules as previously described.
Kernel-wise pruning results are shown in Figure 4 and filter-wise pruning results are shown in Figure 5. Note that a lower pruning rate of 75% is used in evaluating filter-wise pruning methods, since filter-wise pruning at high rates results in much lower accuracy. In both types of structured pruning, Gibbs pruning outperforms other methods, and stretching the learning rate and annealing schedules proves to be effective in additionally improving accuracy.
We introduce a novel method for neural network pruning based on Gibbs distributions that achieves high pruning rates with little reduction in accuracy. It can be used for either unstructured or structured pruning, and the general framework can be adapted to a wide range of pruning structures. Our results show the efficacy of simultaneously training and pruning a network rather than training and pruning as distinct steps at different times. Future work could explore other structures that can be expressed within this framework, such as limiting the number of non-zero values in convolutional kernels or structured pruning of recurrent neural network layers.
The authors would like to thank Fujitsu Laboratories Ltd. and Fujitsu Consulting (Canada) Inc. for providing financial support for this project at the University of Toronto.
-  Song Han, Huizi Mao, and William J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
-  Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li, “Learning structured sparsity in deep neural networks,” in Advances in neural information processing systems, 2016, pp. 2074–2082.
-  Jonathan Frankle and Michael Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2019.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf, “Pruning filters for efficient convnets,” in Proceedings of the International Conference on Learning Representations (ICLR), 2017.
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin,
“ThiNet: A filter level pruning method for deep neural network
Proceedings of the IEEE international conference on computer vision, 2017, pp. 5058–5066.
-  Aidan N Gomez, Ivan Zhang, Kevin Swersky, Yarin Gal, and Geoffrey E Hinton, “Learning sparse networks using targeted dropout,” arXiv preprint arXiv:1905.13678, 2019.
-  Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi, “Optimization by simulated annealing,” science, vol. 220, no. 4598, pp. 671–680, 1983.
-  Trevor Gale, Erich Elsen, and Sara Hooker, “The state of sparsity in deep neural networks,” arXiv preprint arXiv:1902.09574, 2019.
-  Ulli Wolff, “Collective Monte Carlo updating for spin systems,” Physical Review Letters, vol. 62, no. 4, pp. 361, 1989.
Joseph Gonzalez, Yucheng Low, Arthur Gretton, and Carlos Guestrin,
“Parallel Gibbs sampling: From colored fields to thin junction
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 324–332.
-  Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, Handbook of Markov chain Monte Carlo, CRC press, 2011.
-  Alex Labach, Hojjat Salehinejad, and Shahrokh Valaee, “Survey of dropout methods for deep neural networks,” arXiv preprint arXiv:1904.13310, 2019.
-  Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus, “Regularization of neural networks using dropconnect,” in International Conference on Machine Learning, 2013, pp. 1058–1066.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Deep residual learning for image recognition,”
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,
“Imagenet classification with deep convolutional neural networks,”in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  Alex Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, 05 2012.
-  Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Jonathan Frankle, Karolina Dziugaite, Daniel M. Roy, and Michael Carbin, “Stabilizing the lottery ticket hypothesis,” arXiv preprint arXiv:1903.01611, 2019.
-  Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag, “What is the state of neural network pruning?,” arXiv preprint arXiv:2003.03033, 2020.
-  Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell, “Rethinking the value of network pruning,” in Proceedings of the International Conference on Learning Representations (ICLR), 2019.