The architecture of deep neural network (DNN) is one of the most important factors affecting the performance of the model. DNNs with good performance are often carefully designed by experienced researchers. For instance, AlexNet Krizhevsky et al. (2012), GoogleNet (Inception) Szegedy et al. (2015), VGG16 Simonyan and Zisserman (2014), ResNet He et al. (2016) and DenseNet Huang et al. (2017)
However, even a well-designed DNN has proved to have a large amount of parameter redundancy, which is called over-parameterization of DNN Ba and Caruana (2014); Denton et al. (2014). Neural network pruning is the framework to remove those redundant parameters without affecting too much accuracy of the model. Existing works on pruning DNN can be roughly divided into three categories: Pruning at initialization, such as Frankle and Carbin (2019); Lee et al. (2019); Verdenius et al. (2020); Malach et al. (2020), pruning within the training process, such as Zhu and Gupta (2017); Zhang et al. (2018); He et al. (2018); Meng et al. (2020); Lin et al. (2020b); LIU et al. (2020) and pruning after model training, such as Han et al. (2015, 2016); Lin et al. (2020); Sehwag et al. (2020); Luo and Wu (2020). However, no matter which kind of pruning above, the pruning mask is not determined at the beginning, the way of obtaining the mask still requires training with specific datasets, which means that for different datasets, the pruned neural network is different. However, from the perspective of model structure design, a good model structure can often achieve good performance under different datasets. For example, well-designed models such as ResNet and DenseNet are often better than other CNNs under the same complexity. This means that it is possible to design models with better performance without the help of specific datasets.Therefore, in this paper, we first raise a question: Can DNN model pruning be regarded as a kind of model design rather than a kind of parameters compression? Or without using a dataset to specify model parameters, can we find a better sub-structure of a DNN model that can represent its original structure in terms of performance?
We answer these questions from the perspective of graph theory with the information flow in the graph structure of neural networks. You et al. You et al. (2020)
proposed a method to map a DNN to a graph based on the division of neurons and the connection among these divisions, which is called the graph structure of neural network. They searched for different graph structures to find a neural network model with better performance under the same model complexity. In this paper, we conduct a further study on the graph structure of neural networks and use it for pruning neural networks. We first limit the search space of the graph structure to the range of regular graphs to significantly reduce the complexity of the search, and then propose a neural network pruning method based on the average shortest path length minimization of the regular graph. In general, the main contributions of this paper are as follows:
We implement the pruning of neural network from the perspective of graph structure and propose regular graph based pruning (RGP), which prunes neural network through its regular graph structure, and determines the pruning ratio of the neural network according to the node degree value of the graph.
We find that the average shortest path length of the graph is negatively correlated with the performance of the corresponding neural network. Therefore, a graph structure search algorithm based on the principle of minimizing average shortest path length is proposed, which can be used in the search space to obtain the graph with a better neural-network-performance.
Through the analysis of graphs and neural networks, we explain how the average shortest path length affects the performance of neural networks: The average shortest path length is negatively related to the back-propagation resistance of the output neuron, and reflects the number of parameters that can be affected by the gradient of each output neuron.
The experimental results on common models and common datasets show that the graph-guided neural network pruning framework can make DNN lose very little accuracy in the case of pruning at a very high ratio (more than 90% on both parameters and FLOPs reduction). And compared to other neural network pruning methods based on iterative framework, our method is one-shot and the graph structure obtained by our method can be used in multiple models without re-searching.
Regular Graph Based Pruning of Neural Network
In this section, we introduce the detail of regular graph based pruning of neural network.
Graph pruning to DNN pruning
The concept of graph structure of neural network was proposed by You et al. You et al. (2020), which can map a DNN containing any number of convolutional layers and any number of fully connected layers into a graph. The core is to divide the neurons in each fully connected layer and the convolution channels in each convolutional layer into equal parts. When the information is transmitted from one layer to the next layer, the equal parts of the previous layer will be connected to the equal parts of the next layer, leading to connected edges in an -node undirected homogeneous graph. Similarly, an undirected homogeneous graph can also be mapped into a neural network according to the same rules. Formulaically, let be the fully connected operation, and be the convolution operation, a traditional fully connected layer operation can be expressed as:
where and represent the set of neurons in the th layer and that in the th layer, respectively. Then, neurons of each layer is divided into equal parts: and where (if is not divisible by the number of neurons, the number of neurons in the last part will be less than the other parts, i.e, ). Then we have an undirected graph with nodes: , where and is the set of edges. Then, according to the edge distribution in , we can get the pruned fully connected layer:
where is the collection of ’s neighbors in the graph. For the convolutional layer, similar operations can be performed in the channel dimension, and we can get:
where and respectively represent the th part of feature map channels of the th convolutional layer and the th part of feature map channels of the th convolutional layer.
Based on the graph structure of neural network, we proposed a pruning framework of DNN, as shown in Figure 1. Firstly, we choose a DNN as the original network to be pruned. Secondly, an appropriate complete graph is used to represent the original neural network, here the “appropriate” means the number of the nodes of the graph is less than the number of neurons or convolution channels of each layer (except input layer and output layer), so that each node of the graph can represent part of the DNN layer. Thirdly, a large number of edges in the graph will be removed according to the pruning ratio. Finally, in the DNN structure, the existence of parameters of each convolutional layer and fully connected layer will be decided according to the edges in the graph to finish pruning. In actual operation, we only need to generate a sparse graph structure to complete the pruning of the DNN, instead of starting optimization from the complete graph.
Since the graph structure directly represents the existence of the parameters of each layer in neural network, the selection and search optimization of the graph structure are critical to the performance of the pruned network. Next, we will introduce the strategy for selecting graph structure.
Regular graph as the search space
After we determine the number of nodes of a graph, the potential search space is huge, since there are so many kinds of graphs. Here, we choose regular graph as the target graph to greatly reduce the search space, the reasons are as follow:
Firstly, a regular graph is a graph where each node has the same number of neighbors (has the same degree), which is reflected in the neural network structure that each neuron of a fully connected layer gets value from the same number of neurons in the previous layer, then, is connected to the same number of neurons in the next layer; and so as each channel of a convolutional layer’s feature map in the convolution operation. If we rewrite Eq.(2) and Eq.(3) into a form with parameters as:
where and represent the fully connected parameters and convolution kernel parameters, respectively, then a regular graph means for all , is obtained by the operation with the same number of , and is obtained by the operation with the same number of
. Such structure shows the equal importance of each neuron or convolution channel in their own layers from the perspective of the model topology, and we think this is appropriate because before the model is initialized and trained, the neurons or convolution channels in each layer may need to be equally important so that each neuron or channel has the same probability of being trained as an important neuron or channel in subsequent training. This principle is also in line with the topology of traditional dense neural networks, i.e, the original model structure can be mapped as a special regular graph (a complete graph with self-loops).
Secondly, in network science, regular graph is an important type of network. Its strictly controlled degree distribution allows researchers to deeply study other properties of the network. The space of regular graph contains some networks with extremely good properties, such as entangled network Donetti et al. (2005). Actually, the graph structure we searched is very close to the entangled network, one can see more details in Appendix.
Finally, for pruning implementation, a regular sparse structure (each neuron uses the same number of weights in a layer) can easily achieve speedup through dense computation, while other graphs based sparse structure may do it a bit difficult. (A more detailed discussion can be seen in Appendix.)
We also compare the performance of the neural network corresponding to the regular graph and the random graph under the same sparsity through experiments (see Appendix). The experimental results also show that under the same sparsity, the performance of the pruning model based on random graphs is not as good as that based on regular graphs.
Optimization of regular graph via minimizing average shortest path length
In a graph, average shortest path length (ASPL) is defined as the average number of steps along the shortest paths for all possible pairs of nodes. ASPL facilitates the quick transfer of information and reduce costs in a real network such as Internet.
Here, with a small ASPL graph, a neural network structure with more efficient gradient back-propagation can be obtained. We iteratively minimize ASPL of a regular graph. As shown in Algorithm 1, firstly, we randomly select two edges in the initial graph to exchange, and then determine whether the ASPL of the entire graph is reduced. If it is reduced, keep this change, otherwise select two other edges again and repeat the above exchange process, we set a large enough number as the number of repetitions of the above process, in our actual experiments, .
Experimental results show that when the degree value of the graph and the number of nodes remain unchanged, the smaller the ASPL, the better performance neural network will have.
How can the ASPL affect the performance of neural networks
Here we analyze how the ASPL in the graph structure affects the performance of the neural network. We use a graph with 5 nodes as an example, as shown in Figure 2
. For simplicity, the neural network corresponding to the graph structure is set up as a 5-layer MLP, and each layer contains 5 neurons. We analyze node 3 and set up two cases: In case 1, the ASPL of node 3 is 1.5, in case 2, the ASPL of node 3 is 1.25. The calculation of each layer is represented as each round of information transmission in the graph structure. From the perspective of graph, we set that each node will send messages to its neighbors in the next round after receiving neighbors’ messages, at the initial moment, node 3 generates a message and passes it to its neighbors, and then this message will continue to spread in the graph. Finally, in a certain round, all nodes will receive new information from their neighbors and then will keep sending and receiving status, we call such round steady-state, as shown in Figure2 left. We can see in case 1, it needs 4 rounds to get the steady-state while in case 2 it only needs 3 rounds. So how does this phenomenon behave in neural networks? We look at the neural network from the perspective of back-propagation, as shown in Figure 2 right. For case 1, the gradient information of neuron 3 in the output layer will pass through 4 layers. In the fourth layer (the first layer from left), it affects all neurons in this layer. For case 2, the gradient information of neuron 3 in the output layer only needs to pass through 3 layers to affect all neurons in a certain layer. This corresponds to the number of rounds that make graph reach the steady-state. We call such number Gradient-Resistance (GR) of output neuron 3 or node 3, it measures the difficulty of the gradient starting from neuron 3 in the last layer to reach all neurons in a certain layer. Further, we can find that a smaller GR of node 3 means there will be more entire layers that can be used to get the result of neuron 3 in the output layer. It is more conducive to making the output neuron 3 get more powerful prediction or classification ability, because more parameters are used to get the value of output neuron 3. For all nodes of a graph, we can calculate its average GR as the GR of the graph.
The above is an analysis on a simple graph. In fact, for the regular graphs used in our method, we calculate the GR value of each graph, as shown in Figure 3, we choose 64-node regular graphs with each node of degree 4, the ASPL ranges from about 3 to 8, and we set the corresponding neural network as a 15-layer MLP, with 64 neurons in each layer. We also calculate the average output-neuron parameters usage (AOPU), which means the average of the amount of parameters used to calculate the value of each output neuron. We can see that ASPL and GR have approximately linear positive correlation, ASPL and AOPU have approximately linear negative correlation, which statistically validates our above analysis conclusion. In summary, a small ASPL graph means that at the same sparsity, more parameters are used to calculate the final neuron output. Or we can say, under the condition that the total amount of model parameters remains unchanged, a small ASPL graph improve the reuse rate of model parameters.
Experiments and Results
In this section, we do lots of experiments to verify the effectiveness of our method.
DNN models and datasets
Performance changes during graph structure search
In this paper, we implemented our method on multiple common DNN models and public datasets. The DNN models used include ResNet18/56 He et al. (2016), VGG16 Simonyan and Zisserman (2014). The datasets used include Cifar10, Cifar100 Krizhevsky et al. (2009)
and SVHNNetzer et al. (2011). We set the regular graphs of different node degrees according to the pruning ratio from high to low, and search from the nearest neighbor graph to the target graph step by step according to the ASPL minimum principle. Then the target graph is mapped to the sub-network of different DNN models. And we use multiple datasets for training and performance testing.
In our experiments, for ResNet18 and VGG16, the models are represented as 64-nodes regular graphs; for ResNet56, due to the limitation of the number of convolution channels, the model is represented as 16-nodes regular graphs. For evaluation, we use number of parameters and FLOPs (Float Points Operations) to evaluate the model size and computational requirement. For configurations, we use PytorchPaszke et al. (2017)
to implement all models. Stochastic Gradient Descent algorithm (SGD) with an initial learning rate of 0.1 and weight decay of 0.0005 is used as the optimization strategy. And the batch size is set to 256. We set a total training epoch of 100 to ensure that the models are fully trained, and for each model we choose the highest test accuracy as the classification performance of the model. Training and testing of all models are performed on the GPUs of NVIDIA Tesla V100.
We first verify the effect of the change of ASPL on the model performance in the process of graph search optimization. We start with the nearest neighbor graph and minimize the ASPL of the graph by randomly exchanging edge pairs which is shown in Algorithm 1. More specifically, we set different node degree values to represent the pruning ratio of the model, the degree ranges from 4 to 20. For example, if we want to keep about (6.25%) of neural network parameters through pruning, then we start with a 4-nearest neighbor graph with 64 nodes, and minimize the ASPL of the graph through Algorithm 1, finally, the obtained graph is mapped to a specified neural network to complete the pruning. We use three public datasets Cifar10, Cifar100 and SVHN to train and test the pruned model, and repeat the experiment for many times to reduce the chance, the target original models are VGG16 and ResNet18.
The experimental results are shown in Figure 4, where and represents that degree value of each node is and in the graph, respectively. As we can see, for all the results, when more parameters of a model are pruned, the validation accuracy of the model decreases steadily, from to . Under the same pruning ratio, the classification performance of the model and the ASPL of the graph structure show an obvious negative correlation.
Performance of pruned models
|Dataset||Model||Top1 Acc||Acc Drop||Parameters||
Under the same pruning ratio, we choose the model with the highest classification accuracy as the final performance of the pruned model, and record the sub-network (pruned model) performance under 4 different pruning ratios, as shown in Table 1. As we can see, our method can greatly reduce the complexity of the model in terms of parameters and FLOPs at the same time, and the ratio of parameter reduction is almost equal to the ratio of FLOPs reduction. This is due to the fact that our graph structure guided pruning method has the same pruning rules for each layer of the neural network. As far as we know, most of the existing pruning strategies can often reduce a large amount of parameters (more than 80%), but relatively, the reduction ratio of FLOPs is far less than the reduction ratio of the parameter amount (often less than 70%). Our method can reduce the number of parameters and FLOPs by more than 90% at the same time, but only reduce the top1 accuracy of a very small amount (1.69% of VGG16 on Cifar10, 0.46% of VGG16 on SVHN and 0.17% of ResNet18 on SVHN).
Comparison with other neural network structure pruning methods
|Dataset||Model||Method||Top1 ACC||Parameters Reduction||FLOPs Reduction|
|Cifar10||VGG16||GAL-0.05 Lin et al. (2019)||92.03%||77.6%||39.6%|
|HRank1 Lin et al. (2020a)||92.34%||82.1%||65.3%|
|GAL-0.1 Lin et al. (2019)||90.73%||82.2%||45.2%|
|HRank2 Lin et al. (2020a)||91.23%||92.0%||76.5%|
|TRP Xu et al. (2020)||91.62%||-||77.82%|
|ResNet56||He et al. He et al. (2017)||90.80%||-||50.6%|
|GAL-0.8 Lin et al. (2019)||90.36%||65.9%||60.2%|
|HRank Lin et al. (2020a)||90.72%||68.1%||74.1%|
We also compare our results with other neural network pruning methods. However, most results of other methods are not with a high pruning ratio (less than 70%) and the advantage of our method is mainly to maintain good model performance under high pruning ratio, therefore, we choose several neural network pruning methods under the high pruning ratio for comparison, as shown in Table 2. We compare these methods on the dataset Cifar10, the method we are more concerned about is HRank Lin et al. (2020a), because as an advanced pruning method, it has detailed performance records and complete comparison results with other methods. We list the results of the maximum and the second-maximum pruning ratios in HRank, noted as HRank1 and Hrank2, respectively. Correspondingly, we choose the results of two different pruning ratios of RGP (as RGP1 and RGP2) to compare with the two results of HRank. Also, several other methods are listed, too. As we can see, for VGG16, our method (RGP2) can achieve an accuracy of 91.45% when both the parameter amount and FLOPs are reduced by more than 90%, while HRank2 achieves an accuracy of 91.23% when the parameter amount is reduced by 92% but FLOPs is only reduced by 76.5%. RGP1 is also better than HRank1 in pruning ratio and accuracy. TRP Xu et al. (2020) is slightly higher than RGP2 in accuracy but its FLOPs reduction is still significantly smaller than ours. For ResNet56, our method shows a greater pruning ratio and better top1 accuracy than other methods. In general, the advantage of our method is that it can form a DNN sub-network at one time through the mapping of the graph structure, and its parameters and FLOPs can be greatly reduced at the same time, and better performance can be obtained at a high pruning ratio than other methods.
Conclusion and Broader Impact
In this paper, we propose RGP, a parameter pruning framework based on the regular graph structure of the neural network. The framework can change the degree value of the regular graph to set the pruning ratio, and find a sub-network with better performance under a certain pruning ratio by minimizing ASPL. We also explain the negative correlation between ASPL and neural network performance through analysis and statistical calculations of output neurons. The graph-guided pruning becomes an efficient method that can be directly applied to multiple models without re-searching the graph structure and shows a strong precision retention capability with extremely high parameter reduction and FLOPs reduction. In the future, we may explore the impact of more graph topologies on model performance since there are so many graph categories, and a more suitable neural network to graph mapping may be explored to form an intersecting study of the two fields.
Our work can be used to achieve large-scale compression of neural networks, and can inspire researchers in the graph field and network pruning field to implement model pruning from the perspective of graph structure. They can look for sub-networks of neural networks from the topology rather than through iterative pruning, and such sub-networks may be more explanatory. The disadvantage of our method is that when the pruning ratio is not high, it may lose more accuracy compared to other iterative pruning methods. Also, the regular graphs may not be the best choice, more graph structures may need to be explored and analyzed to establish a better relationship between neural networks and graphs.
- TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: Appendix A.
- Do deep nets really need to be deep?. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Cited by: Introductions.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Appendix A.
- Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Cited by: Introductions.
- Centripetal sgd for pruning very deep convolutional networks with complicated structure. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4943–4953. Cited by: Appendix A, Table 3.
- Entangled networks, super-homogeneity and optimal network topology. Cited by: Appendix A, Regular graph as the search space.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, External Links: Cited by: Introductions.
- DSD: regularizing deep neural networks with dense-sparse-dense training flow. Cited by: Introductions.
- Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Cited by: Introductions.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Appendix A, Introductions, Performance changes during graph structure search.
Soft filter pruning for accelerating deep convolutional neural networks. In
IJCAI International Joint Conference on Artificial Intelligence, Cited by: Appendix A, Table 3.
- Soft filter pruning for accelerating deep convolutional neural networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 2234–2240. External Links: Cited by: Introductions.
- Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: Table 2.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: Introductions.
- Learning multiple layers of features from tiny images. Cited by: Performance changes during graph structure search.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: Introductions.
- SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY. In International Conference on Learning Representations, External Links: Cited by: Introductions.
- Provable filter pruning for efficient neural networks. In International Conference on Learning Representations, External Links: Cited by: Appendix A, Table 3.
- Hrank: filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1529–1538. Cited by: Appendix A, Table 3, Comparison with other neural network structure pruning methods, Table 2.
- Channel pruning via automatic structure search. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 673–679. Note: Main track External Links: Cited by: Introductions.
- 1 n block pattern for network sparsity. arXiv preprint arXiv:2105.14713. Cited by: Appendix A.
- Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2790–2799. Cited by: Appendix A, Table 3, Table 2.
- Dynamic model pruning with feedback. In International Conference on Learning Representations, External Links: Cited by: Introductions.
- Dynamic sparse training: find efficient sparse network from scratch with trainable masked layers. In International Conference on Learning Representations, External Links: Cited by: Introductions.
- ThiNet: a filter level pruning method for deep neural network compression. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 5068–5076. External Links: Cited by: Appendix A, Table 3.
- Autopruner: an end-to-end trainable filter pruning method for efficient deep model inference. Pattern Recognition 107, pp. 107461. Cited by: Appendix A, Table 3, Introductions.
- Proving the lottery ticket hypothesis: pruning is all you need. In International Conference on Machine Learning, pp. 6682–6691. Cited by: Introductions.
- Pruning filter in filter. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 17629–17640. External Links: Cited by: Introductions.
- Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440. Cited by: Table 3.
- Reading digits in natural images with unsupervised feature learning. Cited by: Performance changes during graph structure search.
- Automatic differentiation in pytorch. Cited by: Appendix A, Performance changes during graph structure search.
- Hydra: pruning adversarially robust neural networks. Advances in Neural Information Processing Systems (NeurIPS) 7. Cited by: Introductions.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Introductions, Performance changes during graph structure search.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: Introductions.
- Pruning via iterative ranking of sensitivity statistics. arXiv preprint arXiv:2006.00896. Cited by: Introductions.
- Accelerate your cnn from three dimensions: a comprehensive pruning framework. arXiv preprint arXiv:2010.04879. Cited by: Appendix A.
- Pruning from scratch. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12273–12280. Cited by: Appendix A, Table 3.
- Trp: trained rank pruning for efficient deep neural networks. arXiv preprint arXiv:2004.14566. Cited by: Comparison with other neural network structure pruning methods, Table 2.
- Graph structure of neural networks. In International Conference on Machine Learning, pp. 10881–10891. Cited by: Introductions, Graph pruning to DNN pruning.
- A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: Introductions.
- Accelerate cnn via recursive bayesian pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3306–3315. Cited by: Table 3.
- To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: Introductions.
Appendix A Appendix
The existence of GR value
We discuss the existence of GR value in Section How can the ASPL affect the performance of neural networks, because for some graph structures, information from a certain node may not be able to spread throughout the entire graph in a certain round, which means the GR value is infinite.
First, we make a supplement to Algorithm 1
. In actual operation, we have tested the connectivity of the graph after each time of edges’ swap, and the disconnected graph will be deleted to ensure that the searched graph is connected. We prove that under the condition of connected graph, when there is always an even-length path (can be repeated) from a node to other nodes (including the node itself) or there is always an odd-length path from a node to other nodes, the node has a finite GR value. The proof is as follows:
Suppose that is a node of a connected graph , represents the set of all nodes in whose path length (may not be the shortest path length) to is , for example represents the set of all neighbors of . Assuming that the initial information is sent from , the nodes that receive the information in each round can be expressed as follows:
More generally, Eq.(5) can be written as:
where represents the maximum value of the shortest path length between and other nodes. Then, we can see that when , in the odd round, all nodes with an odd path length to will receive the information at the same time. In the even round, all nodes with an even path length to will receive the information at the same time.
In our experiment, we did not encounter a situation where the GR value does not exist. This may be due to the setting of the regular graph so that all nodes can have several neighbors, which makes each node have a very high probability that there are paths with odd lengths and paths with even lengths to other nodes at the same time, especially when the graph is large.
The convergence of Algorithm 1
We discuss the search results of Algorithm 1. It needs to be mentioned that in the search process of Algorithm 1, we maintain the connectivity of the graph, and disconnected graphs will not be used as search results and intermediate results. For regular graphs, we use their broad-first search spanning trees (BFSSTs) to calculate the theoretical lower bound of their ASPL. For a regular graph to its BFSST, we first randomly choose a node as the root node of the tree, then starting from the root node, the neighbors of each node that are not searched will become the next layer of leaves of the node until the entire graph is searched. The ASPL of a graph can be calculated as:
where is the ASPL between the root node and the other part in corresponding BFSST. And for 32_4 regular graph, BFSST with maximum, intermediate, minimum ASPL is shown in Figure 5, respectively.
Then for regular graph with nodes and uniform degree, when each BFSST has the minimum ASPL, the graph has the minimum ASPL, thus we can calculate the theoretical lower bound of ASPL of the regular graph as:
where represents the number of layers full filled by nodes in BFSST (for example, in figure 5 right, it has two layers (circles) around the root node full filled with other nodes ), and is calculated as:
We calculated the of the regular graph with 64-nodes of different degrees, and compared it with the minimum ASPL obtained by our algorithm after 10,000 searches, as shown in Figure 6. We can see that when the degree value gradually increases, the theoretical lower bound of ASPL gradually decrease but the speed of change is getting slower and slower. When the degree value changes from 7 to 8, changes from 2 to 1, and then remains at 1, the change of lower bound of ASPL becomes very slow and the lower bound gradually overlaps with our search algorithm as the degree increase. Therefore, in most cases, the ASPL obtained by our search algorithm is close to the theoretical lower bound. When the degree value is large, our algorithm can reach the lower bound of ASPL.
In fact, we are excited to find that such optimal graph is very close to the entangled network Donetti et al. (2005) in physics (or cage graph in mathematics) which shows good symmetry in terms of short average distances, large loops, and poor modularity, and exhibits an excellent performance such as robustness against errors and attacks, minimal first-passage time of random walks, good searchability, efficient communication, etc.
The dense computation for regular sparse structure speedup
Actually, at present, the methods of weight pruning are mainly realized by multiplying the sparse mask with the weight matrix, and such methods are usually unable to obtain real acceleration in the deep learning framework based on Pytorch Paszke et al. (2017) or Tensorflow Abadi et al. (2015), because in these frameworks, floating-point multiplication with 0 cannot be omitted. In the research of Lin et al. Lin et al. (2021), they pointed out that regular parameters can be truly accelerated by encoding. For the regular graph based pruning, because each node has the same degree value, it is easier to achieve real acceleration through dense computation. One theoretical framework can be seen in Figure 7, for the parameters of each layer, we can densify the sparse parameters through encoding, then implement the operation through the multiplication of the dense matrices, and finally restore the output through decoding.
Comparison of regular graph based pruning and random graph based pruning
We test the performance of random graph based pruning through experiments, and compare it with the performance of regular graph based pruning under the same sparsity, as shown in Figure 8 and Figure 9. As we can see, for ResNet18 with Cifar10/Cifar100, in most sparsity, the performance of regular graph based pruning is better than random graph based pruning, which shows that choosing regular graph as the optimization space is a more suitable choice.
Additional experiment on ImageNet
We also evaluate the performance of RGP on the large-scale dataset ImageNet Deng et al. (2009) and compare it with other methods.
We use ResNet50 He et al. (2016) as the original model, set the batch size to 256 and the weight decay to 0.0001. The initial learning rate is 0.1 and multiplies 0.1 at epoch 30, 60, and 90 with a total of 100 training epochs, which is the same as the training strategy of Wang et al. (2020a).
|GAL Lin et al. (2019)||76.15%||71.95%||4.20%||16.90%||43.00%|
|ThiNet Luo et al. (2017)||72.88%||72.04%||0.84%||33.72%||36.80%|
|SFP He et al. (2018)||76.15%||74.61%||1.54%||-||41.80%|
|Taylor Molchanov et al. (2016)||76.18%||74.50%||1.68%||44.50%||44.90%|
|Autopruner Luo and Wu (2020)||76.15%||74.76%||1.39%||-||48.70%|
|C-SGD Ding et al. (2019)||75.33%||74.93%||0.40%||-||46.20%|
|HRank-1 Lin et al. (2020a)||76.15%||74.98%||1.17%||36.70%||43.70%|
|PFP-A Liebenwein et al. (2020)||76.13%||75.91%||0.22%||18.10%||10.80%|
|PFP-B Liebenwein et al. (2020)||76.13%||75.21%||0.92%||30.10%||44.00%|
|HRank-2 Lin et al. (2020a)||76.15%||71.98%||4.17%||62.10%||46.00%|
|RRBP Zhou et al. (2019)||76.10%||73.00%||3.10%||-||54.50%|
|HRank-3 Lin et al. (2020a)||76.15%||69.10%||7.05%||76.04%||67.57%|
|Prune-From-Scratch Wang et al. (2020b)||77.20%||72.80%||4.40%||-||75.61%|
We chose more pruning methods for comparison, and the results are shown in Table 3. We can see that when the parameters reduction and the FLOPs reduction are both within 50%, our method (RGP(64_36)) is better than GAL Lin et al. (2019), ThiNet Luo et al. (2017), SFP He et al. (2018), HRank-1 Lin et al. (2020a) and PFP-B Liebenwein et al. (2020) in terms of both pruning ratio and top1 accuracy, and is better than Autopruner Luo and Wu (2020) and C-SGD Ding et al. (2019) in terms of top1 accuracy at a close pruning ratio. PFP-A Liebenwein et al. (2020) has the highest top1 accuracy, but its pruning ratio is also extremely low. At a pruning ratio greater than 60%, our method (RGP (64_24)) is better than HRank-3 Lin et al. (2020a) in top1 accuracy, and at a pruning ratio greater than 70%, our method (RGP (64_16)) is close to Prune-From-Scratch Wang et al. (2020b) in top1 accuracy. In general, our method shows a great pruning effect, and can greatly reduce the amount of neural network parameters and FLOPs. Compared with other pruning methods, our method has a simpler process, the obtained graph can be used in a large number of neural network structures, and the pruning process does not require the dataset to participate in parameter training, which is an efficient and one-shot method.