Multi-objective Pruning for CNNs using Genetic Algorithm

06/02/2019 ∙ by Chuanguang Yang, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 0

In this work, we propose a heuristic genetic algorithm (GA) for pruning convolutional neural networks (CNNs) according to the multi-objective trade-off among error, computation and sparsity. In our experiments, we apply our approach to prune pre-trained LeNet across the MNIST dataset, which reduces 95.42 computation with tiny accuracy loss by laying emphasis on sparsity and computation, respectively. Our empirical study suggests that GA is an alternative pruning approach for obtaining a competitive compression performance. Additionally, compared with state-of-the-art approaches, GA is capable of automatically pruning CNNs based on the multi-objective importance by a pre-defined fitness function.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Vision application scenarios often have different requirements in terms of multi-objective importance about error, computational cost and storage for convolutional neural networks (CNNs), but state-of-the-art pruning approaches do not take this into account. Thus, we develop the genetic algorithm (GA) that can iteratively prune redundant parameters based on the multi-objective trade-off by a two-step procedure. First, we prune the network by taking the advantages of swarm intelligence. Next, we retrain the elite network and reinitialize the population by the trained elite. Compared with state-of-the-art approaches, our approach obtains a comparable result on sparsity and a significant improvement on computation reduction. In addition, we detail how to adjust the fitness function for obtaining diverse compression performances in practical applications.

2 Proposed Approach

2.1 Evaluation Regulation

Similar to general evolutionary algorithms, we design a fitness function

to evaluate the comprehensive performance of a genome. In our method, is defined by the weighted average of error rate , computation remained rate and sparsity . And our target is to minimize the fitness function as follows:


The coefficients , and adjust the importance of the three objectives. , and denote the percentage of misclassified samples, remained multiplication-addition operations (FLOPs) and zeroed out parameters, respectively. From the experimental analyses in section 3, treating the multi-objective nature of the problem by linear combination and scalarization is indeed effective and consistent to our expectation, albeit more sophisticated fitness function may further improve the results.

(a) Filter-wise pruning.
(b) Connection-wise pruning.
Figure 1: Pruning techniques of CONV layer (a) and FC layer (b) in mutation phase. For the current CONV layer, we carry out a filter-wise pruning based on mutation rate , and then a corresponding channel-wise pruning will also take place for the next CONV layer. For the current FC layer, we carry out a connection-wise pruning based on mutation rate .

2.2 Heuristic Pruning Procedure

2.2.1 Genetic Encoding and Initialization.

A CNN is encoded to a genome including parameter genes that denoted by , where denotes the depth of the CNN, denotes the

th layer parameter with a 4D tensor of size

××× in convolutional (CONV) layer, or a 2D tensor of size × in fully-connected (FC) layer, where , , and denote the size of filters, input channels, height and width of kernels, and denote the size of output and input features, respectively. We apply times mutations on a pre-trained CNN to generate the initial population consisting of genomes.

2.2.2 Selection.

We straightforward select the top genomes with minimum fitness to reproduce next generation. It is worth mentioning that we have attempted a variety of selection operations, such as tournament selection, roulette-wheel selection and truncation selection. Our empirical results indicate that different selection operations finally obtain the similar performance but the vanilla selection which we adopt has the fastest convergence speed.

2.2.3 Crossover.

Crossover operations are occurred among the selected genomes based on the crossover rate . We employ the classical microbial crossover which is first proposed in [1]

inspired by bacterial conjugation. For each crossover, we choose two genomes randomly, from which the one with lower fitness is called Winner genome, and the other one is called Loser genome. Then, each gene in Loser genome is copied from Winner genome based on 50% probability. Thus, Winner genome can remain unchanged to preserve the good performance, and Loser genome can be modified to generate possibly better performance by the infection of Winner genome. One potential strength of microbial crossover is implicitly remaining the elite genome to the next generation, since the fittest genome can win any tournaments against any genomes.

2.2.4 Mutation.

Mutation performs for every genome except for the elite with mutation rate and in each CONV layer and FC layer, respectively. Follow [2], we employ the coarse-grained pruning on CONV layers and fine-grained pruning on FC layers, both of which are sketched in Fig.1.

1:pre-trained CNN parameter , maximum number of iterations , population size , number of selected genomes , crossover rate , mutation rate and , number of interval iterations
2:parameter of elite genome
3:for  do
5:end for
6:for  do
14:     if  then
17:         for  do
19:         end for
21:     end if
22:end for
Algorithm 1 Multi-objective Pruning by GA

2.2.5 Main Procedure.

After each heuristic pruning process including selection, crossover and mutation with iterations, we retrain the elite genome so that the remained weights can compensate for the loss of accuracy, and then reinitialize the population by the trained elite genome. The above procedures are repeated iteratively until the fitness of the elite genome is convergence. Algorithm 1 illustrates the whole procedures of multi-objective pruning by GA.

3 Experimental Results and Analyses

Figure 2: Pruning process of GA with different . The blue, orange, green, red curves reflect the indicator of fitness, error, sparsity and FLOPs of the elite, respectively

The hyper-parameter settings of GA are as follows: population size , number of selected genomes , crossover rate , mutation rate and , iteration number . Albeit we find that further hyper-parameter tuning can obtain better results, such as increasing population size or diminishing mutation rate, but corresponding with more time cost.

Comprehensive comparison with state-of-the-art approaches is summarized in Table 1. We highlight in particular that different pruning performances can be obtained by adjusting . Meanwhile, we empirically analyze the effectiveness by custom with corresponding curves which are exhibited in Fig.2. Note that CONV layers and FC layers are the main source of computation and parameter size, respectively. And cannot be set too tiny in order to ensure the low error.

  1. . With the approximate weights for as our baseline, which reach the overall optimal compression performance but with relatively higher error rate.

  2. This setting aims at high-speed inference for CNN. In this case, computation achieves maximum reduction, but sparsity is hard to optimize because GA pays less attention to pruning FC layers which are not the main source of computation.

  3. This setting aims at a CNN with low storage. In this case, we obtain the utmost sparsity and high-level computation reduction simultaneously. Albeit CONV layers only play an unimportant role in the overall parameter size, it can also obtain the high-level sparsity because of the tractability with coarse granularity pruning. Thus, can also indirectly facilitate computation reduction.

  4. This setting aims at minimal performance loss. In this case, error curve is always at the low level resulting in that GA is conservative to pruning both CONV and FC layers. Hence, parameter and FLOPs curves are slower to fall compared with baseline.

Approach Error: Computation: Sparsity: Accuracy change
LeNet Baseline  [4, 5]  0.8%  100%  0%  -
LNA [6]  0.7%  -  90.5%  +0.1%
SSL [7]  0.9%  25.64%  75.1%  -0.1%
TSNN [8]  0.79%  13%  95.84%  +0.01%
SparseVD [9]  0.75%  45.66%  92.58%  +0.05%
StructuredBP [10]  0.86%  9.53%  79.8%  -0.06%
Regularization [11]  1.0%  23.22%  99.14%  -0.2%
RA-2-0.1 [12]  0.9%  -  97.7%  -0.1%
GA()  0.93%  6.22%  94.30%  -0.13%
GA()  0.87%  6.10%  71.63%  -0.07%
GA()  0.89%  9.00%  95.42%  -0.09%
GA()  0.85%  8.16%  91.00%  -0.05%
Table 1: Comparison against the pruning approaches evaluated on MNIST dataset[3]. Note that bold entries represent the emphases on objectives laid by GA.

Compared with other approaches, albeit we do not obtain a minimal sparsity, our computation achieves outstanding reduction because of coarse granularity pruning. While some approaches with larger sparsity always employ fine granularity pruning, which is very tractable for facilitating sparsity but not essentially reducing the FLOPs of sparse weight tensors. Furthermore, our approach can perform a multi-objective trade-off according to the actual requirements whereas state-of-the-art approaches are unable to achieve this task.

4 Conclusion

We propose the heuristic GA to prune CNNs based on the multi-objective trade-off, which can obtain a variety of desirable compression performances. Moreover, we develop a two-step pruning framework for evolutionary algorithms, which may open a door to introduce the biological-inspired methodology to the field of CNNs pruning. As a future work, GA will be further investigated and improved to prune more large-scale CNNs.


  • [1] Harvey, I.: The microbial genetic algorithm. In: Proceedings of the 10th European conference on Advances in artificial life, pp. 126-133 (2009)
  • [2]

    Mao, H., Han, S., Pool, J., Li, W., Liu, X., Wang, Y., J. Dally, W.: Exploring the granularity of sparsity in convolutional neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, pp. 1927–1934 (2017)

  • [3] MNIST dataset, Last accessed 23 Mar 2019
  • [4]

    LeNet implementation by TensorFlow, Last accessed 23 Mar 2019
  • [5] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
  • [6]

    Srinivas, S., Babu, R.V.: Learning neural network architectures using backpropagation. In: Proceedings of the British Machine Vision Conference. BMVA Press (2016)

  • [7] Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in neural information processing systems, pp. 2074–2082 (2016)
  • [8] Srinivas, S., Subramanya, A., Babu, R.V.: Training sparse neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, pp. 455–462 (2017)
  • [9]

    Dmitry, M., Arsenii, A., Dmitry, V.: Variational dropout sparsifies deep neural networks. In: 34th International Conference on Machine Learning, pp. 2498–2507 (2017)

  • [10] Neklyudov, K., Molchanov, D., Ashukha, A., Vetrov, D.: Structured bayesian pruning via log-normal multiplicative noise. In: Advances in Neural Information Processing Systems, pp. 6775-6784 (2017)
  • [11] Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through regularization. In: Proceedings of the International Conference on Learning Representations, ICLR (2018)
  • [12] Dong, X., Liu, L., Li, G., Zhao, P., Feng, X.: Fast CNN pruning via redundancy-aware training. In: 27th International Conference on Artificial Neural Networks, pp. 3–13 (2018). 10.1007/978-3-030-01418-6_1