1 Introduction
Convolutional neural networks (CNNs) have demonstrated superior performance on various computer vision tasks [12, 59, 13]. However, CNNs require large storage space, high computational budget, and great memory utilization, which could far exceed the resource limit of edge devices like mobile phones and embedded gadgets. As a result, many methods have been proposed to reduce their cost, such as weight quantization [7, 10, 24]
, tensor factorization
[33, 37], weight pruning [25, 73], and channel pruning [74, 36, 29]. Among them all, channel pruning is the preferred approach to learn dense compact models, which receives increasing focus from the research community.Channel pruning is usually achieved in three steps: (1) score channels’ importance with a handcrafted pruning function; (2) remove redundant channels based on the scores; (3) retrain the network. The performance of channel pruning largely depends on the pruning function used in step (1). Current scoring metrics are mostly handcrafted to extract crucial statistics from channels’ feature maps [31, 72] or kernel parameters [40, 29] in a labeless [48, 28] or labelaware [74, 36] manner. However, the design space of pruning functions is so large that handcrafted metrics could usually be suboptimal, while enumerating all functions under the space is impossible.
To this end, we propose a novel approach to automatically learn transferable pruning functions, which advances pruning performance, as shown in Fig. 1. In particular, we design a function space and leverage an evolution strategy, genetic programming [3], to discover novel pruning functions. We carry out an endtoend evolution process where a population of functions is evolved by applying them to pruning tasks of small datasets. The evolved functions are closedform and explainable, which are later transferred to conduct pruning tasks on larger and more challenging datasets. Our learned functions are transferable and generalizable: (1) they are applicable to pruning tasks of different datasets and networks without any manual modification after evolution; (2) they demonstrate competitive pruning performance on datasets and networks that are different from those used in the evolution process. Such transferability and generalizability provides a unique advantage to our approach, where prior metapruning methods like MetaPruning [47] and LFPC [27] are learned and evolved on the same tasks with no transferability and perform inferior to us.
More specifically, we adopt an expression tree encoding scheme to represent a pruning function. To ensure the transferability of the evolved functions, the function space of the operands and operators are carefully designed. We metalearn the pruning effectiveness of each function by applying them to pruning tasks of two different networks and datasets, LeNet on MNIST and VGGNet on CIFAR10. For each task, we keep retraining hyperparameters and other pruning settings to be the same for every function evaluation, allowing us to solely optimize functions’ effectiveness. The accuracies from both tasks are combined as the indicator of a function’s effectiveness. We observe that evolving on two tasks produces better functions than only evolving on one of them. More surprisingly, we find that our scheme produces more effective pruning functions than directly evolving on a large dataset, e.g., ILSVRC2012, under the same computational budget. We analyze the merits of an evolved function both mathematically and visually and transfer it to three larger datasets, CIFAR100, SVHN, and ILSVRC2012, where it exceeds the stateoftheart pruning results on all of them.
Our main contributions are threefold: (i) We propose a new paradigm of channel pruning, which learns transferable pruning functions to further improve pruning efficacy. (ii) We develop a unified coevolution framework on two small datasets with a novel transferable function design space and leverage an evolution strategy to traverse it. We show this methodology is more costeffective than evolution on a large dataset, e.g., ILSVRC2012. (iii) Our evolved functions show strong generalizability to datasets unseen by the evolution process, and achieve stateoftheart pruning results. For example, with 26.9% and 53.4% FLOPs reduction from MobileNetV2, we achieve top1 accuracies of 71.90% and 69.16% on ILSVRC2012, outperforming the state of the art.
2 Related Work
HandCrafted Channel Pruning. Channel pruning [72, 36, 74, 29] is generally realized by using a handcrafted pruning function to score channels’ saliency and remove redundant ones. Based on the scoring procedure, it can be categorized into labeless pruning and labelaware pruning.
Labeless channel pruning typically adopts the normbased property of the channel’s feature maps or associated filters as pruning criterion [40, 48, 31, 50, 49, 28, 70, 29, 42]. For example, Liu et al. [48] and Ye et al. [70] use the absolute value of scaling factors in the batchnorm, while 1norm and 2norm of channels’ associated filters are computed in [40, 28, 42] as channels’ importance. On the other hand, researchers have designed metrics to evaluate class discrepancy of channels’ feature maps for labelaware pruning [74, 36, 46]. Zhuang et al. [74] inserts discriminant losses in the network and remove channels that are least correlated to the losses after iterative optimization. Kung et al. [36] and Liu et al. [46] adopt closedform discriminant functions to accelerate the scoring process. While these works use handcrafted scoring metrics, we learn transferable and generalizable pruning functions automatically.
MetaLearning.
Our work falls into the category of metalearning, where research works have attempted to optimize machine learning components, including hyperparameters
[5, 62, 17], optimizers [8, 68, 4], and neural network structures
[75, 76, 67, 44, 69, 58, 45, 57].Prior works on neural architecture search (NAS) have leveraged reinforcement learning (RL) to discover highperforming network structures
[75, 2, 76, 6, 66, 67]. Recently, the NAS algorithm is also adopted to find efficient network structures [67, 66]. Another line of works adopts evolution strategies (ES) to explore the space of network structures [16, 69, 58, 45, 57, 11, 52, 63], which demonstrates competitive performance to RL methods. This notion is pioneered by neuroevolution [65, 18, 64], which evolves the topology of small neural networks. In the era of deep learning, Real et al. [57] leverage ES to find networks that improve over the ones found by RL [76]. Dai et al. [11] apply ES to design efficient and deployable networks for mobile platforms.Compared to prior works, we propose a new paradigm to leverage metalearning techniques for efficient network design. More specifically, we learn transferable pruning functions that outperform current handcrafted metrics to improve channel pruning. These evolved functions can also be applied to prune redundant channels in the NASlearned structures to further enhance their efficiency.
MetaPruning. Prior works [32, 30, 47, 9, 27] have also adopted a similar notion of learning to prune a CNN. We note that the evolution strategy is used in LeGR [9] and MetaPruning [47]
to search for a pair of pruning parameters and network encoding vectors, respectively. However, our evolutionary learning are drastically different from them in terms of search space and search candidates, where we search for effective combinations of operands and operators to build transferable and powerful pruning functions. He et al. propose LFPC
[27] to learn network pruning criteria (functions) across layers by training a differentiable criteria sampler. However, rather than learning new pruning functions, their goal is to search within a pool of existing pruning criteria and find candidates that are good for a certain layer’s pruning. On the contrary, our evolution recombines the operands and operators and produces novel pruning criteria, which are generally good for all layers.We also notice that MetaPruning [47], LFPC [27], and other methods [32, 30, 9] are all learned on one task (dataset and network) and applied only on the same task with no transferability. In contrast, we only need one evolution learning process, which outputs evolved functions that are transferable across multiple tasks and demonstrate competitive performance on all of them.
3 Methodology
In Fig. 2, we present our evolution framework, which leverages genetic programming [3] to learn highquality channel pruning functions. We first describe the design space to encode channel scoring functions. Next, we discuss the pruning tasks to evaluate the functions’ effectiveness. Lastly, genetic operators are defined to traverse the function space for competitive solutions.
3.1 Function Design Space
Filterbased operands  whole layer’s filter , channel’s incoming filter , channel’s batchnormed parameter 
Mapbased operands  Feature maps collection , two partitions of feature maps collections and 
Elementwise operators  addition, subtraction, multiplication, division, absolute value, square, square root, adding ridge factor 
Matrix operators  matrix trace, matrix multiplication, matrix inversion, inner product, outer product, matrix/vector transpose 
Statistics operators  summation, product, mean, standard deviation, variance, counting measure 
Specialized operators  rbf kernel matrix getter, geometric median getter, tensor slicer 
Expression Tree. In channel pruning, a pruning function scores the channels to determine their importance/redundancy, where denotes feature maps, filters, and their statistics associated to the channels. This scoring process can be viewed as a series of operations with operators (addition, matrix multiplication, etc.) and operands (feature maps, filters, etc.). We thus adopt an expression tree encoding scheme to represent a pruning function, where inner nodes are operators, and leaves are operands.
As shown in Tab. 2 and 2, our function design space includes two types of operands (6 operands in total) and four types of operators (23 operators in total), via which a vast number of pruning functions can be expressed. The statistics operators can compute the statistics of an operand in two dimensions, namely, global dimension (subscript with ‘g’) and sample dimension (subscript with ‘s’). The global dimension operators flatten operands into a 1D sequence and extract corresponding statistics, while the sample dimension operators compute statistics on the axis of samples. For example, returns the summation of all entries of a kernel tensor, while returns , which is the sample average of all feature maps. We also include specialized operators which allow us to build complicated but competitive metrics like maximum mean discrepancy (MMD) [23] and filter’s geometric median [29].
Function Encoding. The channel scoring functions can be categorized into two types: labeless metrics and labelaware metrics. For labeless functions like filter’s norm, we adopt a direct encoding scheme as with the expression tree shown in Fig. 3.
For labelaware metrics such as the one in [36] and MMD [23], which measure class discrepancy of the feature maps, we observe a common computation graph among them, as shown in Fig. 3: (1) partition the feature maps in a labelwise manner; (2) apply the same operations on each label partition and all feature maps; (3) average/sum the scores of all partitions to obtain a single scalar. These metrics can be naively encoded as branch trees (: number of class labels in the dataset). However, directly using the naive encoding scheme will result in datadependent nontransferable metrics because: (1) varies from dataset to dataset (e.g., metrics for CIFAR10 is not transferable to CIFAR100); (2) mutating the subtrees differently could make the metric overfit to a specific label numbering scheme. (e.g., for a metric with different subtrees on class1 and class2, renumbering the labels would mean the metric would compute something different, which is undesirable).
To combat the above issues, we express a labelaware function by a unitree which encodes the common operations that are applied to each label partition, as explained in Fig. 3. Instead of directly encoding the operands from a specific label partition, like (feature maps with labels equal to 1) and (feature maps with labels not equal to 1), we use a symbolic representation of and to generically encode the partition concept. In the actual scoring process, the unitree is compiled back to a branch computation graph, with and converted to the specific map partitions. Such unitree encoding allows us to evolve labelaware metrics independent of and label numbering schemes, which ensures their transferability to datasets unseen by the evolution process.
Under the scheme, we can implement a broad range of competitive pruning functions: filter’s norm [40], filter’s norm [28], batch norm’s scaling factor [48], filter’s geometric median [29], Discriminant Information [36], MMD [23], Absolute SNR [22]
, Student’s TTest
[39], Fisher Discriminant Ratio [56], and Symmetric Divergence [51]. For the last four metrics, we adopt the scheme in [46] for channel scoring. We name this group of functions stateoftheart population (SOAP), which helps our evolution in many aspects. For instance, in Sec. 6, we find that initializing the population with SOAP evolves better pruning functions than random initialization. Detailed implementation of SOAP is included in Supplementary.3.2 Function Effectiveness Evaluation
The encoded functions are then applied to empirical pruning tasks to evaluate their effectiveness. To avoid overfitting on certain data patterns and increase the generality of the evolved functions, we coevolve the population of functions on two different pruning tasks, LeNet5 [38] on MNIST [38] and VGG16 [61] on CIFAR10 [35]. In both pruning tasks, we adopt a oneshot pruning scheme and report the retrained accuracies on validation sets. For each pruning task, we keep the pruning settings (layers’ pruning ratios, target pruning layers, etc.) and the retraining hyperparameters (learning rate, optimizer, weight decay factor, etc.) the same for all evaluations throughout the evolution process. This guarantees a fair effectiveness comparison over different functions in all generations and ensures we are evolving better functions rather than better hyperparameters. In this way, we can metalearn powerful functions that perform well on both MNIST and CIFAR10 and are generalizable to other datasets. Not surprisingly, coevolution on both tasks produce stronger pruning functions than evolving on only one of them, shown in Sec. 3.3. Moreover, in Sec. 6, we find our strategy enjoys better costeffectiveness compared to direct evolution on a large dataset, e.g., ILSVRC2012.
3.3 Function Fitness
After evaluation, each encoded function receives two accuracies, and , from the pruning tasks. We investigate two accuracy combination schemes, weighted arithmetic mean (Eqn. 1
) and weighted geometric mean (Eqn.
2), to obtain the joint fitness of a function. A free parameter is introduced to control the weights of different tasks.(1) 
(2) 
Ablation Study. To decide the fitness combination scheme for the main experiments, we conduct 10 small preliminary evolution tests using a grid of with both combination schemes. Note that when , the coevolution degenerates to single dataset evolution. We empirically evaluate the generalizability of the best evolved functions from each test by applying them to prune a ResNet38 on CIFAR100. Note CIFAR100 is not used in the evolution process, and thus the performance on it speaks well for evolved functions’ generalizability. In Fig. 6, we find that solely evolving on MNIST () would be the least effective option for CIFAR100 transfer pruning. In addition, we find that functions evolved on two datasets () generally perform better than the ones that just evolve on a single dataset (). We observe that setting with weighted geometric mean leads to the best result, which is later adopted in the main experiments.
3.4 Genetic Operations
Selection. After evaluation, the population will undergo a selection process, where we adopt tournament selection [21] to choose a subset of competitive functions.
Reproduction. This subset of functions is then used to reproduce individuals for the next generation. However, we observe shrinkage of the population’s genetic diversity when all kids are reproduced from parents, as the selected parents only represent a small pool of genomes. Such diversity shrinkage would result in premature convergence of the evolution process. To combat this issue, we reserve a slot in the reproduction population, and reproduce individuals in the slots by randomly cloning functions from SOAP or building random trees. We find this adjustment empirically useful to help the evolution proceed longer.
Mutation and Crossover. We finally conduct mutation and crossover on the reproduced population to traverse the function design space for new expressions. We adopt the conventional scheme of random tree mutation and one point crossover [3], which are illustrated in Fig. 4 with toy examples. After mutation and crossover, the population will go through the next evolution iteration.
4 CoEvolution on MNIST and CIFAR10
(3) 
Experiment Settings.
We conduct the experiment with a population size of 40 over 25 generations. The population is initialized with 20 individuals randomly cloning functions from SOAP and 20 random expression trees. The size of the selection tournament is 4 and we select 10 functions in each generation. 24 individuals are reproduced from the selected functions, while 6 individuals are from SOAP or randomly built. The mutation and crossover probability are both set to be 0.75. We prune 92.4% of FLOPs from a LeNet5 (baseline acc: 99.26%) and 63.0% of FLOPs from a VGG16 (baseline acc: 93.7%), respectively. Such aggressive pruning schemes help us better identify functions’ effectiveness. We use the weighted geometric mean in Eqn.
2 to combine two validation accuracies with . Our codes are implemented with DEAP [19]and TensorFlow
[1] for the genetic operations and the neural network pruning. The experiments are carried out on a cluster with SLURM job scheduler [71] for workload parallelization.Experiment Result. Our coevolution progress is typified in Fig. 6, where the red curve denotes the functions with the maximum fitness while the green curve plots the ones with the top 25 percentile fitness. Both curves increase monotonically over generations, indicating that the quality of both the best function and the entire population improves over time, which demonstrates the effectiveness of our scheme. Specifically, the best pruned LeNet5/VGG16 in the first generation have accuracies of 99.15%/93.55% while the best accuracies in the last generation are 99.25%/94.0%. As the first generation is initialized with SOAP functions , such results suggest that the algorithm derives metrics that outperform the handcrafted functions. The whole function evolution takes 210 GPUdays, which is an order less than prior network search algorithm, e.g., [76](2000 GPUdays) and [58](3000 GPUdays). Our approach is computationally efficient, which could be crucial when less computation resource is available.
Evolved Function. We present a winning function in Eqn. 3, where denotes sample average of the feature maps and is a vector with all entries to be 1. The first two terms of the function award a high score to channels with classdiverged feature maps whose or is significantly smaller than the other. Channels with these feature maps contain rich class information as it generates distinguishable responses to different classes. The third term’s denominator computes the sum of feature maps variances while its numerator draws statistics from the average feature maps and the distance between and , which resembles the concept of SNR. Two points worth mentioning for this function: (1) it identifies important statistics concepts from humandesigned metrics, where it learns from Symmetric Divergence [51] to measure the divergence of class feature maps. (2) it owns unique math concepts that are empirically good for channel importance measurement, which is shown in the novel statistics combination of the feature maps in the third term’s numerator. Our visual result in Sec. 6 also suggests preserves better features, showing stronger pruning effectiveness.
5 Transfer Pruning
Network  Method  Test Acc (%)  Acc (%)  FLOPs  Pruned (%)  Parameters  Pruned (%)  

SLIM [48]  98.22 98.15  0.07  172M  31.1  1.46M  14.5  

OursA  98.22 98.25  0.03  108M  57.4  0.73M  57.8  
OursB  98.22 98.26  0.04  92M  63.2  0.64M  63.0  
Network  Method  Test Acc (%)  Acc (%)  FLOPs  Pruned (%)  Parameters  Pruned (%)  
VGG19  SLIM [48]  73.26 73.48  0.22  256M  37.1  5.0M  75.1  
GSD [46]  73.40 73.67  0.27  161M  59.5  3.2M  84.0  
Ours  73.40 74.02  0.62  155M  61.0  2.9M  85.5  

SFP [28]  71.33 68.37  2.96  76M  39.3      

FPGM [29]  71.40 68.79  2.61  59M  52.6      
LFPC [27]  71.33 70.83  0.58  61M  51.6      
LeGR [9]  72.41 71.04  1.37  61M  51.4      
Ours  72.05 71.70  0.35  55M  56.2  0.38M  54.9  

LCCL [14]  72.79 70.78  2.01  173M  31.3  1.75M  0.0  
SFP [28]  74.14 71.28  2.86  121M  52.3      
FPGM [29]  74.14 72.55  1.59  121M  52.3      
TAS [15]  75.06 73.16  1.90  120M  52.6      
Ours  74.40 73.85  0.55  111M  56.2  0.77M  55.8  

LCCL [14]  75.67 75.26  0.41  195M  21.3  1.73M  0.0  
SLIM [48]  76.63 76.09  0.54  124M  50.6  1.21M  29.7  
DI [36]  76.63 76.11  0.52  105M  58.0  0.95M  45.1  
Ours  77.15 77.77  0.62  92M  63.2  0.66M  61.8  
To show the generalizability of our evolved pruning function, we apply in Eqn. 3 to more challenging datasets that are not used in the coevolution process: CIFAR100 [35], SVHN [54], and ILSVRC2012 [12]. We report our pruned models at different FLOPs levels by adding a letter suffix (e.g., OursA). Our method is compared with metrics from SOAP, e.g., L1 [40], FPGM [29], GSD [46], and DI [36], where our evolved function outperforms these handcrafted metrics. We also include other “learn to prune” methods like Meta [47] and LFPC [27] and other stateoftheart methods like DSA [55] and CC [41] for comparison. We summarize the performance (including baseline accuracies) in Tab. 4, 4, and 5, where our evolved function achieves stateoftheart results on all datasets.
Same as evolution, we adopt a oneshot pruning scheme for our transfer pruning and use the SGD optimizer Nesterov Momentum
[53]for retraining. The weight decay factor and the momentum are set to be 0.0001 and 0.9, respectively. On SVHN/CIFAR100, we use a batch size of 32/128 to finetune the network with 20/200 epochs. The learning rate is initialized at 0.05 and multiplied by 0.14 at 40% and 80% of the total number of epochs. On ILSVRC2012, we use a batch size of 128 to finetune VGG16/ResNet18/MobileNetV2 for 30/100/100 epochs. For VGG16/ResNet18, the learning rate is started at 0.0006 and multiplied by 0.4 at 40% and 80% of the total number of epochs. We use a cosine decay learning rate schedule for MobileNetV2
[60] with an initial rate of 0.03.SVHN. We first evaluate on SVHN with ResNet164. OursA outperforms SLIM [48] by 0.1% in accuracy with significant hardware resource savings: 26.3% more FLOPs saving and 43.3% more parameters saving. Moreover, we achieve an even better accuracy in OursB with a greater FLOPs and parameters saving compared to OursA, which well demonstrates pruning effectiveness of .
CIFAR100. On VGG19, our pruned model achieves an accuracy gain of 0.35% with respect to GSD [46]. Compared to LFPC [27] and LeGR [9], our pruned ResNet56 achieves an accuracy gain of 0.87% and 0.66%, respectively, while having 5% less FLOPs. On ResNet110, our method outperforms FPGM [29] and TAS [15] by 1.30% and 0.69% in terms of accuracy with 4% less FLOPs. In comparison with LCCL [14], SLIM [48], and DI [36], our pruned ResNet164 achieves an accuracy of 77.77% with 63.2% FLOPs reduction which advances all prior methods.
ILSVRC2012. On VGG16, OursA improves over baseline by nearly 1.1% in top1 accuracy with 2.4 acceleration. The 3.3accelerated OursB achieves top1/top5 accuracies of 71.64%/90.60%, advancing the state of the art. On ResNet18, OursA reduces 16.8% FLOPs without top1 accuracy loss. Compared to LCCL [14], OursB achieves a 2.72% top1 accuracy gain with a higher FLOPs reduction ratio. OursC demonstrates top1 accuracy gains of 1.75% and 1.50% with respect to SFP [28] and DCP [74]. We finally show our performance on a much compact network, MobileNetV2, which is specifically designed for mobile deployment. When 26.9% of FLOPs is pruned, OursA outperforms AMC [30], Meta [47], and LeGR [9] with a top1 accuracy of 71.90%. At a higher pruning ratio, OursB advances DCP [74] and Meta [47] by top1 accuracies of 4.94% and 0.96%, with 53.4% FLOPs reduction.
Network  Method 








VGG16  L1 [40]      89.90 89.10  0.80  7.74 (50.0)    
CP [31]      89.90 89.90  0.00  7.74 (50.0)    
GSD [46]  71.30 71.88  0.58  90.10 90.66  0.56  6.62 (57.2)  133.6 (3.4)  
OursA  71.30 72.37  1.07  90.10 91.05  0.95  6.34 (59.0)  133.5 (3.5)  
28  RNP [43]      89.90 86.67  3.23  5.16 (66.7)  138.3 (0.0)  
SLIM [48]      89.90 88.53  1.37  5.16 (66.7)    
FBS [20]      89.90 89.86  0.04  5.16 (66.7)  138.3 (0.0)  
OursB  71.30 71.64  0.34  90.10 90.60  0.50  5.12 (66.9)  131.6 (4.8)  

OursA  70.05 70.08  0.03  89.40 89.24  0.16  1.50 (16.8)  11.2 (3.9)  
28  SLIM [48]  68.98 67.21  1.77  88.68 87.39  1.29  1.31 (28.0)    
LCCL [14]  69.98 66.33  3.65  89.24 86.94  2.30  1.18 (34.6)  11.7 (0.0)  
OursB  70.05 69.09  0.96  89.40 88.59  0.81  1.14 (36.7)  9.3 (20.1)  
28  SFP [28]  70.28 67.10  3.18  89.63 87.78  1.85  1.06 (41.8)    
DCP [74]  69.64 67.35  2.29  88.98 87.60  1.38  0.98 (46.0)  6.2 (47.0)  
FPGM [29]  70.28 68.41  1.87  89.63 88.48  1.15  1.06 (41.8)    
DSA [55]  69.72 68.61  1.11  89.07 88.35  0.72  1.09 (40.0)    
OursC  70.05 68.85  1.20  89.40 88.45  0.95  1.07 (41.0)  8.8 (24.5)  

Uniform [60]  71.80 69.80  2.00      0.22 (26.9)    
AMC [30]  71.80 70.80  1.00      0.22 (26.9)    
CC [41]  71.88 70.91  0.89      0.22 (28.3)    
Meta [47]  72.70 71.20  1.50      0.22 (27.9)    
LeGR [9]  71.80 71.40  0.40      0.22 (26.9)    
OursA  72.18 71.90  0.28  90.49 90.38  0.11  0.22 (26.9)  2.8 (20.4)  
28  DCP [74]  70.11 64.22  5.89    3.77  0.17 (44.7)  2.6 (25.9)  
Meta [47]  72.70 68.20  4.50      0.14 (53.4)    
OursB  72.18 69.16  3.02  90.49 88.66  1.83  0.14 (53.4)  2.1 (39.3)  
6 Ablation Study
Random Initial Population. In Fig. 7, we conduct a control experiment which initializes all individuals as random expression trees to study the effectiveness of SOAP initialization. We also turn off the SOAP function insertion in the reproduction process for the control experiment. All other parameters (number of generations, size of population, , etc.) are kept to be the same as in Sec. 4 for a fair comparison. We find that evolving with random population also achieves a good pruning fitness, which indicates that our design space is of powerful expressiveness. However, we observe early convergence and final performance gap in the control experiment compared to the main experiment in Sec. 4, demonstrating the advantage of using SOAP for evolution.
Evolution on ILSVRC2012. In contrast to our coevolution strategy on MNIST and CIFAR10, we conduct a function evolution on ILSVRC2012 as control experiment. We restrict the total computation budget to be the same as Sec. 4, i.e. 210 GPUdays, and evolve on ResNet18 with a population size of 40 over 25 generations. Due to the constrained budget, each pruned net is only retrained for 4 epochs. We include detailed evolution settings and results in Supplementary. Two major drawbacks are found with this evolution strategy: (1) Imprecise evaluation. Due to the lack of training epochs, the function’s actual effectiveness is not precisely revealed. We take two functions with fitness 63.24 and 63.46 from the last generation, and use them again to prune ResNet18 but fully retrain for 100 epochs. We find that the one with lower fitness in evolution achieves an accuracy of 68.27% in the full training, while the higher one only has an accuracy of 68.02%. Such result indicates the evaluation in this evolution procedure could be inaccurate, while our strategy ensures a full retraining for precise effectiveness assessment. (2) Inferior performance. The best evolved function with this method, (in Supplementary), performs inferior to shown in Eqn. 3 when transferred to a different dataset. In particular, when applied to pruning 56% FLOPs from ResNet110 on CIFAR100, only achieves an accuracy of 72.51% while reaches 73.85%. These two issues suggest that coevolution on two small datasets would have better costeffectiveness than using a large scale dataset like ILSVRC2012.
Visualization on Feature Selection.
We further visually understand the pruning decision made by (right) vs. DI [36] (middle) on MNIST features in Fig. 8. The red pixels indicate the important features evaluated by the metrics, while the blue ones are redundant. Taking the average feature values map (left) for reference, we find that our evolved function tends to select features with higher means, where the MNIST pattern is more robust.7 Conclusion
In this work, we propose a new paradigm for channel pruning, which first learns novel channel pruning functions from small datasets, and then transfers them to larger and more challenging datasets. We develop an efficient genetic programming framework to automatically search for competitive pruning functions over our vast function design space. We present and analyze a closedform evolved function which can offer strong pruning performance and further streamline the design of our pruning strategy. Without any manual modification, the learned pruning function exhibits remarkable generalizability to datasets different from those in the evolution process. More specifically, on SVHN, CIFAR100, and ILSVRC2012, we achieve stateoftheart pruning results.
References
 [1] (2016) Tensorflow: a system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4, §9.3.
 [2] (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §2.
 [3] (1998) Genetic programming. Springer. Cited by: §1, §3.4, §3.
 [4] (2017) Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 459–468. Cited by: §2.

[5]
(2013)
Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures
. Cited by: §2.  [6] (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.
 [7] (2015) Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pp. 2285–2294. Cited by: §1.
 [8] (2017) Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 748–756. Cited by: §2.

[9]
(2020)
Towards efficient model compression via learned global ranking.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 1518–1528. Cited by: §2, §2, Table 4, Table 5, §5, §5.  [10] (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §1.
 [11] (2019) Chamnet: towards efficient network design through platformaware model adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11398–11407. Cited by: §2.
 [12] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §5.

[13]
(2015)
Image superresolution using deep convolutional networks
. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §1.  [14] (2017) More is less: a more complicated network with less inference complexity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5840–5848. Cited by: Table 4, Table 5, §5, §5.
 [15] (2019) Network pruning via transformable architecture search. In Advances in Neural Information Processing Systems, pp. 759–770. Cited by: Table 4, §5.

[16]
(2016)
Convolution by evolution: differentiable pattern producing networks.
In
Proceedings of the Genetic and Evolutionary Computation Conference 2016
, pp. 109–116. Cited by: §2.  [17] (2015) Efficient and robust automated machine learning. In Advances in neural information processing systems, pp. 2962–2970. Cited by: §2.
 [18] (2008) Neuroevolution: from architectures to learning. Evolutionary Intelligence 1 (1), pp. 47–62. Cited by: §2.

[19]
(201207)
DEAP: evolutionary algorithms made easy
. Journal of Machine Learning Research 13, pp. 2171–2175. Cited by: §4.  [20] (2018) Dynamic channel pruning: feature boosting and suppression. arXiv preprint arXiv:1810.05331. Cited by: Table 5.

[21]
(1991)
A comparative analysis of selection schemes used in genetic algorithms
. In Foundations of genetic algorithms, Vol. 1, pp. 69–93. Cited by: §3.4.  [22] (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science 286 (5439), pp. 531–537. Cited by: §3.1.
 [23] (2012) A kernel twosample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §3.1, §3.1, §3.1.
 [24] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
 [25] (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, Cited by: §1.
 [26] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §9.1, §9.2, §9.3.
 [27] (2020) Learning filter pruning criteria for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2009–2018. Cited by: §1, §2, §2, Table 4, §5, §5.
 [28] (2018) Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866. Cited by: §1, §2, §3.1, Table 4, Table 5, §5.
 [29] (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §1, §1, §2, §2, §3.1, §3.1, Table 4, Table 5, §5, §5.
 [30] (2018) Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §2, §2, Table 5, §5.
 [31] (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §1, §2, Table 5.
 [32] (2018) Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 709–718. Cited by: §2, §2.
 [33] (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §1.
 [34] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §9.2.
 [35] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §3.2, §5, §9.1, §9.2.
 [36] (2019) Methodical design and trimming of deep learning networks: enhancing external bp learning with internal omnipresentsupervision training paradigm. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8058–8062. Cited by: §1, §1, §2, §2, §3.1, §3.1, Table 4, §5, §5, Figure 8, §6.
 [37] (2014) Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553. Cited by: §1.
 [38] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11). Cited by: §3.2, §9.2.
 [39] (2006) Testing statistical hypotheses. Springer Science & Business Media. Cited by: §3.1.
 [40] (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1, §2, §3.1, Table 5, §5, §9.2.
 [41] (2021) Towards compact cnns via collaborative compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6438–6447. Cited by: Table 5, §5.
 [42] (2019) Exploiting kernel sparsity and entropy for interpretable cnn compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2800–2809. Cited by: §2.
 [43] (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. Cited by: Table 5.
 [44] (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §2.
 [45] (2017) Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §2, §2.
 [46] (2020) Rethinking classdiscrimination based cnn channel pruning. arXiv preprint arXiv:2004.14492. Cited by: §2, §3.1, Table 4, Table 5, §5, §5.
 [47] (2019) Metapruning: meta learning for automatic neural network channel pruning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3296–3305. Cited by: §1, §2, §2, Table 5, §5, §5.
 [48] (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §1, §2, §3.1, Table 4, Table 5, §5, §5.
 [49] (2017) Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §2.
 [50] (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §2.

[51]
(2006)
A solution to the curse of dimensionality problem in pairwise scoring techniques
. In International Conference on Neural Information Processing, pp. 314–323. Cited by: §3.1, §4.  [52] (2019) Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Cited by: §2.
 [53] (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, Vol. 269, pp. 543–547. Cited by: §10, §5, §9.1, §9.2.
 [54] (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §5.
 [55] (2020) Dsa: more efficient budgeted pruning via differentiable sparsity allocation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 592–607. Cited by: Table 5, §5.
 [56] (2001) Gene functional classification from heterogeneous data. In Proceedings of the fifth annual international conference on Computational biology, pp. 249–255. Cited by: §3.1.

[57]
(2019)
Regularized evolution for image classifier architecture search
. In Proceedings of the aaai conference on artificial intelligence, Vol. 33, pp. 4780–4789. Cited by: §2, §2.  [58] (2017) Largescale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2902–2911. Cited by: §2, §2, §4.
 [59] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
 [60] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: Table 5, §5.
 [61] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
 [62] (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §2.
 [63] (2019) Designing neural networks through neuroevolution. Nature Machine Intelligence 1 (1), pp. 24–35. Cited by: §2.
 [64] (2009) A hypercubebased encoding for evolving largescale neural networks. Artificial life 15 (2), pp. 185–212. Cited by: §2.
 [65] (2002) Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: §2.
 [66] (2019) Mnasnet: platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §2.
 [67] (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §2, §2.
 [68] (2017) Learned optimizers that scale and generalize. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3751–3760. Cited by: §2.
 [69] (2017) Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1379–1388. Cited by: §2, §2.
 [70] (2018) Rethinking the smallernormlessinformative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124. Cited by: §2.
 [71] (2003) Slurm: simple linux utility for resource management. In Workshop on Job Scheduling Strategies for Parallel Processing, pp. 44–60. Cited by: §4.

[72]
(2018)
Nisp: pruning networks using neuron importance score propagation
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203. Cited by: §1, §2.  [73] (2018) A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §1.
 [74] (2018) Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: §1, §1, §2, §2, Table 5, §5.
 [75] (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §2, §2.
 [76] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §2, §2, §4.
8 SOAP Implementation
8.1 Operator Space
In Tab. 6, we present the detailed operator space with operators and their abbreviations.
Elementwise
operators 
addition  
subtraction  
multiplication  
division  
absolute value  
square  
square root  
adding ridge factor  
Matrix
operators 
matrix trace  
matrix multiplication  
matrix inversion  
inner product  
outer product  
matrix/vector transpose  
Statistics
operators 
summation  
product  
mean  
standard deviation  
variance  
counting measure  
Specialized
operators 
rbf kernel matrix getter  
geometric median getter  
tensor slicer  
8.2 SOAP Functions
With the abbreviations of operators in Tab. 6 and the symbols of operands presented in Tab. 1 of the main paper, we can thus give the precise expressions of the functions in SOAP:

[leftmargin=0pt]

Filter’s norm:

Filter’s norm:

Batch normalization’s scaling factor:

Filter’s geometric median:

Discriminant Information:

Maximum Mean Discrepancy:

Generalized Absolute SNR:

Generalized Student’s TTest:

Generalized Fisher Discriminat Ratio:

Generalized Symmetric Divergence:
(4) 
9 Experimental Details
9.1 Study on Fitness Combination Scheme
Preliminary Evolution. We conduct 10 preliminary experiments, where the variables are: and combination scheme {weighted geometric mean, weighted arithmetic mean}. For each experiment, we have a population of 15 functions which are evolved for 10 generations. The population is initialized with 10 individuals randomly cloned from SOAP and 5 random expression trees. The tournament size is 3, and the number of the selected functions is 5. The next generation is reproduced only from the selected functions. Other settings are the same as the main evolution experiment.
CIFAR100 Pruning. We apply the best evolved functions from each preliminary evolution test to prune a ResNet38 [26] on CIFAR100 [35]. The baseline ResNet38 adopts the bottleneck block structure with an accuracy of 72.3%. We use each evolved function to prune 40% of channels in all layers uniformly, resulting in a 54.7%/52.4% FLOPs/parameter reduction. The network is then finetuned by the SGD optimizer with 200 epochs. We use the Nesterov Momentum [53] with a momentum of 0.9. The minibatch size is set to be 128, and the weight decay is set to be 1e3. The training data is transformed with a standard data augmentation scheme [26]. The learning rate is initialized at 0.1 and divided by 10 at epoch 80 and 160.
9.2 Main Evolution Experiment
MNIST Pruning. On MNIST [38] pruning task, we prune a LeNet5 [38] with a baseline accuracy of 99.26% from shape of 2050800500 to 51216040. Such pruning process reduces 92.4% of FLOPs and 98.0% of parameters. The pruned network is finetuned for 300 epochs with a batch size of 200 and a weight decay of 7e5. We use the Adam optimizer [34] with a constant learning rate of 5e4.
CIFAR10 Pruning. For CIFAR10 [35] pruning, we adopt the VGG16 structure from [40] with a baseline accuracy of 93.7%. We uniformly prune 40% of the channels from all layers resulting in 63.0% FLOPs reduction and 63.7% parameters reduction. The finetuning process takes 200 epochs with a batch size of 128. We set the weight decay to be 1e3 and the dropout ratio to be 0.3. We use the SGD optimizer with Nesterov momentum [53], where the momentum is set to be 0.9. We augment the training samples with a standard data augmentation scheme [26]. The initial learning rate is set to be 0.006 and multiplied by 0.28 at 40% and 80% of the total number of epochs.
9.3 Transfer Pruning
We implement the pruning experiments in TensorFlow [1] and carry them out with NVIDIA Tesla P100 GPUs. CIFAR100 contains 50,000/10,000 training/test samples in 100 classes. SVHN is a 10class dataset where we use 604,388 training images for network training with a test set of 26,032 images. ILSVRC2012 contains 1.28 million training images and 50 thousand validation images in 1000 classes. We adopt the standard data augmentation scheme [26] for CIFAR100 and ILSVRC2012.
9.4 Channel Scoring
As many of our pruning functions require activation maps of the channels to determine channels’ importance, we need to feedforward the input images for channel scoring. Specifically, for pruning experiments on MNIST, CIFAR10, and CIFAR100, we use all their training images to compute the channel scores. On SVHN and ILSVRC2012, we randomly sample 20 thousand and 10 thousand training images for channel scoring, respectively.
10 Evolution on ILSVRC2012
Evolution. We use ResNet18 as the target network for pruning function evolution on ILSVRC2012. Since only one task is evaluated, we directly use the retrained accuracy of the pruned network as the function’s fitness. Other evolution settings for population, selection, mutation, and crossover are kept the same as Sec. 4 of the main paper.
Evaluation. We uniformly prune 30% of channels in each layer from a pretrained ResNet18, resulting in a FLOPs reduction of 36.4%. Due to the constrained computational budget, we only finetune it for 4 epochs using the SGD optimizer with Nesterov momentum [53]. We use a batch size of 128 and initialize our learning rate at 0.001. The learning rate is multiplied by 0.4 at epoch 1 and 2.
Result. We show the evolution progress in Fig. 9. Due to the lack of training budget, the pruned net is clearly not well retrained as they only achieve around 63.5% accuracy, much lower than the performance of methods shown in Tab. 5 of the main paper at the similar pruning level. Such inadequate training results in a imprecise function fitness evaluation evidenced in Sec. 6 of the main paper. Moreover, the best evolved function from this strategy, (Eqn. 4), performs inferior to the coevolved function when transferred for CIFAR100 pruning. These results demonstrate the advantage of our small dataset coevolution strategy in costeffectiveness.
11 Extra Evolved Functions
We present additional evolved functions from our coevolution strategy:
(5) 
(6) 
(7) 
Eqn. 5 presents a metric with the concept of SNR for classification, while having a novel way of statistics combination. Moreover, our evolution experiments find that measuring the variance across all elements in (Eqn. 6) and (Eqn. 7) would help us identify important channels empirically. These two functions are simple and effective yet remain undiscovered from the literature.
12 Function Validity
The function expressions generated from mutation and crossover can be invalid (noninvertible matrix, dimension inconsistency, etc.) due to the random selections of operators, operands, and nodes in the expression trees. To combat this issue and enlarge our valid function space, some operators are deliberately modified from their standard definition. For instance, whenever we need to invert a positive semidefinite scatter matrix
, we automatically add a ridge factor , and invert the matrix. For dimension inconsistency in elementwise operations, we have two options to pad the operand with a smaller dimension: (1) with 0 for
and , and 1 for , and , (2) with its own value if it is a scalar. Moreover, we conduct a validity test on the mutated/crossovered functions every time after the mutation/crossover process. The invalid expressions are discarded, and the mutation/crossover operations are repeated until we recover the population size with all valid functions. These methods ensure we generate valid function expressions under our vast design space during the evolution process.
Comments
There are no comments yet.