Evolving Transferable Pruning Functions

10/21/2021
by   Yuchen Liu, et al.
Princeton University
0

Channel pruning has made major headway in the design of efficient deep learning models. Conventional approaches adopt human-made pruning functions to score channels' importance for channel pruning, which requires domain knowledge and could be sub-optimal. In this work, we propose an end-to-end framework to automatically discover strong pruning metrics. Specifically, we craft a novel design space for expressing pruning functions and leverage an evolution strategy, genetic programming, to evolve high-quality and transferable pruning functions. Unlike prior methods, our approach can not only provide compact pruned networks for efficient inference, but also novel closed-form pruning metrics that are mathematically explainable and thus generalizable to different pruning tasks. The evolution is conducted on small datasets while the learned functions are transferable to larger datasets without any manual modification. Compared to direct evolution on a large dataset, our strategy shows better cost-effectiveness. When applied to more challenging datasets, different from those used in the evolution process, e.g., ILSVRC-2012, an evolved function achieves state-of-the-art pruning results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

04/29/2020

Rethinking Class-Discrimination Based CNN Channel Pruning

Channel pruning has received ever-increasing focus on network compressio...
11/10/2018

Using NonBacktracking Expansion to Analyze k-core Pruning Process

We induce the NonBacktracking Expansion Branch method to analyze the k-c...
12/01/2021

Bumblebee: A Path Towards Fully Autonomous Robotic Vine Pruning

Dormant season grapevine pruning requires skilled seasonal workers durin...
10/16/2021

Neural Network Pruning Through Constrained Reinforcement Learning

Network pruning reduces the size of neural networks by removing (pruning...
05/23/2018

AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference

Channel pruning is an important family of methods to speedup deep model'...
05/08/2021

EZCrop: Energy-Zoned Channels for Robust Output Pruning

Recent results have revealed an interesting observation in a trained con...
02/25/2019

The MBPEP: a deep ensemble pruning algorithm providing high quality uncertainty prediction

Machine learning algorithms have been effectively applied into various r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) have demonstrated superior performance on various computer vision tasks [12, 59, 13]. However, CNNs require large storage space, high computational budget, and great memory utilization, which could far exceed the resource limit of edge devices like mobile phones and embedded gadgets. As a result, many methods have been proposed to reduce their cost, such as weight quantization [7, 10, 24]

, tensor factorization 

[33, 37], weight pruning [25, 73], and channel pruning [74, 36, 29]. Among them all, channel pruning is the preferred approach to learn dense compact models, which receives increasing focus from the research community.

Channel pruning is usually achieved in three steps: (1) score channels’ importance with a hand-crafted pruning function; (2) remove redundant channels based on the scores; (3) retrain the network. The performance of channel pruning largely depends on the pruning function used in step (1). Current scoring metrics are mostly handcrafted to extract crucial statistics from channels’ feature maps [31, 72] or kernel parameters [40, 29] in a labeless [48, 28] or label-aware [74, 36] manner. However, the design space of pruning functions is so large that hand-crafted metrics could usually be sub-optimal, while enumerating all functions under the space is impossible.

To this end, we propose a novel approach to automatically learn transferable pruning functions, which advances pruning performance, as shown in Fig. 1. In particular, we design a function space and leverage an evolution strategy, genetic programming [3], to discover novel pruning functions. We carry out an end-to-end evolution process where a population of functions is evolved by applying them to pruning tasks of small datasets. The evolved functions are closed-form and explainable, which are later transferred to conduct pruning tasks on larger and more challenging datasets. Our learned functions are transferable and generalizable: (1) they are applicable to pruning tasks of different datasets and networks without any manual modification after evolution; (2) they demonstrate competitive pruning performance on datasets and networks that are different from those used in the evolution process. Such transferability and generalizability provides a unique advantage to our approach, where prior meta-pruning methods like MetaPruning [47] and LFPC [27] are learned and evolved on the same tasks with no transferability and perform inferior to us.

Figure 1: Illustration of our approach. Compared to conventional methods which mainly use handcrafted pruning functions, we aim to learn the pruning functions automatically via an evolution strategy. The evolved functions are transferable and generalizable, further enhancing the pruning performance.

More specifically, we adopt an expression tree encoding scheme to represent a pruning function. To ensure the transferability of the evolved functions, the function space of the operands and operators are carefully designed. We meta-learn the pruning effectiveness of each function by applying them to pruning tasks of two different networks and datasets, LeNet on MNIST and VGGNet on CIFAR-10. For each task, we keep retraining hyper-parameters and other pruning settings to be the same for every function evaluation, allowing us to solely optimize functions’ effectiveness. The accuracies from both tasks are combined as the indicator of a function’s effectiveness. We observe that evolving on two tasks produces better functions than only evolving on one of them. More surprisingly, we find that our scheme produces more effective pruning functions than directly evolving on a large dataset, e.g., ILSVRC-2012, under the same computational budget. We analyze the merits of an evolved function both mathematically and visually and transfer it to three larger datasets, CIFAR-100, SVHN, and ILSVRC-2012, where it exceeds the state-of-the-art pruning results on all of them.

Our main contributions are three-fold: (i) We propose a new paradigm of channel pruning, which learns transferable pruning functions to further improve pruning efficacy. (ii) We develop a unified co-evolution framework on two small datasets with a novel transferable function design space and leverage an evolution strategy to traverse it. We show this methodology is more cost-effective than evolution on a large dataset, e.g., ILSVRC-2012. (iii) Our evolved functions show strong generalizability to datasets unseen by the evolution process, and achieve state-of-the-art pruning results. For example, with 26.9% and 53.4% FLOPs reduction from MobileNet-V2, we achieve top-1 accuracies of 71.90% and 69.16% on ILSVRC-2012, outperforming the state of the art.

2 Related Work

Hand-Crafted Channel Pruning. Channel pruning [72, 36, 74, 29] is generally realized by using a handcrafted pruning function to score channels’ saliency and remove redundant ones. Based on the scoring procedure, it can be categorized into labeless pruning and label-aware pruning.

Labeless channel pruning typically adopts the norm-based property of the channel’s feature maps or associated filters as pruning criterion [40, 48, 31, 50, 49, 28, 70, 29, 42]. For example, Liu et al. [48] and Ye et al. [70] use the absolute value of scaling factors in the batch-norm, while 1-norm and 2-norm of channels’ associated filters are computed in [40, 28, 42] as channels’ importance. On the other hand, researchers have designed metrics to evaluate class discrepancy of channels’ feature maps for label-aware pruning [74, 36, 46]. Zhuang et al. [74] inserts discriminant losses in the network and remove channels that are least correlated to the losses after iterative optimization. Kung et al. [36] and Liu et al. [46] adopt closed-form discriminant functions to accelerate the scoring process. While these works use handcrafted scoring metrics, we learn transferable and generalizable pruning functions automatically.

Meta-Learning.

Our work falls into the category of meta-learning, where research works have attempted to optimize machine learning components, including hyper-parameters 

[5, 62, 17], optimizers [8, 68, 4]

, and neural network structures 

[75, 76, 67, 44, 69, 58, 45, 57].

Prior works on neural architecture search (NAS) have leveraged reinforcement learning (RL) to discover high-performing network structures 

[75, 2, 76, 6, 66, 67]. Recently, the NAS algorithm is also adopted to find efficient network structures [67, 66]. Another line of works adopts evolution strategies (ES) to explore the space of network structures [16, 69, 58, 45, 57, 11, 52, 63], which demonstrates competitive performance to RL methods. This notion is pioneered by neuro-evolution [65, 18, 64], which evolves the topology of small neural networks. In the era of deep learning, Real et al. [57] leverage ES to find networks that improve over the ones found by RL [76]. Dai et al. [11] apply ES to design efficient and deployable networks for mobile platforms.

Compared to prior works, we propose a new paradigm to leverage meta-learning techniques for efficient network design. More specifically, we learn transferable pruning functions that outperform current handcrafted metrics to improve channel pruning. These evolved functions can also be applied to prune redundant channels in the NAS-learned structures to further enhance their efficiency.

Meta-Pruning. Prior works [32, 30, 47, 9, 27] have also adopted a similar notion of learning to prune a CNN. We note that the evolution strategy is used in LeGR [9] and MetaPruning [47]

to search for a pair of pruning parameters and network encoding vectors, respectively. However, our evolutionary learning are drastically different from them in terms of search space and search candidates, where we search for effective combinations of operands and operators to build transferable and powerful pruning functions. He et al. propose LFPC 

[27] to learn network pruning criteria (functions) across layers by training a differentiable criteria sampler. However, rather than learning new pruning functions, their goal is to search within a pool of existing pruning criteria and find candidates that are good for a certain layer’s pruning. On the contrary, our evolution recombines the operands and operators and produces novel pruning criteria, which are generally good for all layers.

We also notice that MetaPruning [47], LFPC [27], and other methods [32, 30, 9] are all learned on one task (dataset and network) and applied only on the same task with no transferability. In contrast, we only need one evolution learning process, which outputs evolved functions that are transferable across multiple tasks and demonstrate competitive performance on all of them.

3 Methodology

Figure 2: Illustration of our approach to evolve channel pruning functions. A population of functions is applied to conduct pruning tasks on two datasets, MNIST and CIFAR-10. Each function receives a fitness value by combining its pruned networks’ accuracies. The population will then go through a natural selection process to improve the functions’ effectiveness.

In Fig. 2, we present our evolution framework, which leverages genetic programming [3] to learn high-quality channel pruning functions. We first describe the design space to encode channel scoring functions. Next, we discuss the pruning tasks to evaluate the functions’ effectiveness. Lastly, genetic operators are defined to traverse the function space for competitive solutions.

3.1 Function Design Space

Filter-based operands whole layer’s filter , channel’s incoming filter , channel’s batch-normed parameter
Map-based operands Feature maps collection , two partitions of feature maps collections and
Table 1: Operand Space
Elementwise operators addition, subtraction, multiplication, division, absolute value, square, square root, adding ridge factor
Matrix operators matrix trace, matrix multiplication, matrix inversion, inner product, outer product, matrix/vector transpose
Statistics operators

summation, product, mean, standard deviation, variance, counting measure

Specialized operators rbf kernel matrix getter, geometric median getter, tensor slicer
Table 2: Operator Space

Expression Tree. In channel pruning, a pruning function scores the channels to determine their importance/redundancy, where denotes feature maps, filters, and their statistics associated to the channels. This scoring process can be viewed as a series of operations with operators (addition, matrix multiplication, etc.) and operands (feature maps, filters, etc.). We thus adopt an expression tree encoding scheme to represent a pruning function, where inner nodes are operators, and leaves are operands.

As shown in Tab. 2 and 2, our function design space includes two types of operands (6 operands in total) and four types of operators (23 operators in total), via which a vast number of pruning functions can be expressed. The statistics operators can compute the statistics of an operand in two dimensions, namely, global dimension (subscript with ‘g’) and sample dimension (subscript with ‘s’). The global dimension operators flatten operands into a 1D sequence and extract corresponding statistics, while the sample dimension operators compute statistics on the axis of samples. For example, returns the summation of all entries of a kernel tensor, while returns , which is the sample average of all feature maps. We also include specialized operators which allow us to build complicated but competitive metrics like maximum mean discrepancy (MMD) [23] and filter’s geometric median [29].

Function Encoding. The channel scoring functions can be categorized into two types: labeless metrics and label-aware metrics. For labeless functions like filter’s -norm, we adopt a direct encoding scheme as with the expression tree shown in Fig. 3.

For label-aware metrics such as the one in [36] and MMD [23], which measure class discrepancy of the feature maps, we observe a common computation graph among them, as shown in Fig. 3: (1) partition the feature maps in a labelwise manner; (2) apply the same operations on each label partition and all feature maps; (3) average/sum the scores of all partitions to obtain a single scalar. These metrics can be naively encoded as -branch trees (: number of class labels in the dataset). However, directly using the naive encoding scheme will result in data-dependent non-transferable metrics because: (1) varies from dataset to dataset (e.g., metrics for CIFAR-10 is not transferable to CIFAR-100); (2) mutating the subtrees differently could make the metric overfit to a specific label numbering scheme. (e.g., for a metric with different subtrees on class-1 and class-2, renumbering the labels would mean the metric would compute something different, which is undesirable).

To combat the above issues, we express a label-aware function by a uni-tree which encodes the common operations that are applied to each label partition, as explained in Fig. 3. Instead of directly encoding the operands from a specific label partition, like (feature maps with labels equal to 1) and (feature maps with labels not equal to 1), we use a symbolic representation of and to generically encode the partition concept. In the actual scoring process, the uni-tree is compiled back to a -branch computation graph, with and converted to the specific map partitions. Such uni-tree encoding allows us to evolve label-aware metrics independent of and label numbering schemes, which ensures their transferability to datasets unseen by the evolution process.

Under the scheme, we can implement a broad range of competitive pruning functions: filter’s -norm [40], filter’s -norm [28], batch norm’s scaling factor [48], filter’s geometric median [29], Discriminant Information [36], MMD [23], Absolute SNR [22]

, Student’s T-Test 

[39], Fisher Discriminant Ratio [56], and Symmetric Divergence [51]. For the last four metrics, we adopt the scheme in [46] for channel scoring. We name this group of functions state-of-the-art population (SOAP), which helps our evolution in many aspects. For instance, in Sec. 6, we find that initializing the population with SOAP evolves better pruning functions than random initialization. Detailed implementation of SOAP is included in Supplementary.

Figure 3: Illustration of the pruning function encoding. Left: For labeless scoring metrics like filter’s 1-norm, we adopt a direct tree encoding scheme. Right: For label-aware scoring metrics, we encode the -subtree computation graph by a uni-tree (: number of class labels). The uni-tree encodes the common operations (op) on each label partition () and all feature maps (). This scheme allows transferable function evolution.

3.2 Function Effectiveness Evaluation

The encoded functions are then applied to empirical pruning tasks to evaluate their effectiveness. To avoid overfitting on certain data patterns and increase the generality of the evolved functions, we co-evolve the population of functions on two different pruning tasks, LeNet-5 [38] on MNIST [38] and VGG-16 [61] on CIFAR-10 [35]. In both pruning tasks, we adopt a one-shot pruning scheme and report the retrained accuracies on validation sets. For each pruning task, we keep the pruning settings (layers’ pruning ratios, target pruning layers, etc.) and the retraining hyper-parameters (learning rate, optimizer, weight decay factor, etc.) the same for all evaluations throughout the evolution process. This guarantees a fair effectiveness comparison over different functions in all generations and ensures we are evolving better functions rather than better hyper-parameters. In this way, we can meta-learn powerful functions that perform well on both MNIST and CIFAR-10 and are generalizable to other datasets. Not surprisingly, co-evolution on both tasks produce stronger pruning functions than evolving on only one of them, shown in Sec. 3.3. Moreover, in Sec. 6, we find our strategy enjoys better cost-effectiveness compared to direct evolution on a large dataset, e.g., ILSVRC-2012.

3.3 Function Fitness

After evaluation, each encoded function receives two accuracies, and , from the pruning tasks. We investigate two accuracy combination schemes, weighted arithmetic mean (Eqn. 1

) and weighted geometric mean (Eqn. 

2), to obtain the joint fitness of a function. A free parameter is introduced to control the weights of different tasks.

(1)
(2)

Ablation Study. To decide the fitness combination scheme for the main experiments, we conduct 10 small preliminary evolution tests using a grid of with both combination schemes. Note that when , the co-evolution degenerates to single dataset evolution. We empirically evaluate the generalizability of the best evolved functions from each test by applying them to prune a ResNet-38 on CIFAR-100. Note CIFAR-100 is not used in the evolution process, and thus the performance on it speaks well for evolved functions’ generalizability. In Fig. 6, we find that solely evolving on MNIST () would be the least effective option for CIFAR-100 transfer pruning. In addition, we find that functions evolved on two datasets () generally perform better than the ones that just evolve on a single dataset (). We observe that setting with weighted geometric mean leads to the best result, which is later adopted in the main experiments.

3.4 Genetic Operations

Selection. After evaluation, the population will undergo a selection process, where we adopt tournament selection [21] to choose a subset of competitive functions.

Reproduction. This subset of functions is then used to reproduce individuals for the next generation. However, we observe shrinkage of the population’s genetic diversity when all kids are reproduced from parents, as the selected parents only represent a small pool of genomes. Such diversity shrinkage would result in premature convergence of the evolution process. To combat this issue, we reserve a slot in the reproduction population, and reproduce individuals in the slots by randomly cloning functions from SOAP or building random trees. We find this adjustment empirically useful to help the evolution proceed longer.

Mutation and Crossover. We finally conduct mutation and crossover on the reproduced population to traverse the function design space for new expressions. We adopt the conventional scheme of random tree mutation and one point crossover [3], which are illustrated in Fig. 4 with toy examples. After mutation and crossover, the population will go through the next evolution iteration.

Figure 4: Mutation and crossover operations. Left: Mutation operation: a node in the tree is randomly selected and replaced by a randomly constructed tree. Right: Crossover operation: one node is randomly selected for each tree, and the subtrees at the nodes are exchanged.

4 Co-Evolution on MNIST and CIFAR-10

Figure 5: Preliminary evolution tests on the choice of fitness combination scheme. The best evolved function from each scheme is applied to conduct a pruning test on CIFAR-100 with ResNet-38, and their accuracies are plotted.
Figure 6: Progress of the evolution experiment. Each dot represents an individual function evaluation. The red curve shows functions with the best fitness over generations, while the green curve shows the individuals at the 25 percentile fitness. The effectiveness of the best function and the population’s overall quality are both monotonically increasing.
(3)

Experiment Settings.

We conduct the experiment with a population size of 40 over 25 generations. The population is initialized with 20 individuals randomly cloning functions from SOAP and 20 random expression trees. The size of the selection tournament is 4 and we select 10 functions in each generation. 24 individuals are reproduced from the selected functions, while 6 individuals are from SOAP or randomly built. The mutation and crossover probability are both set to be 0.75. We prune 92.4% of FLOPs from a LeNet-5 (baseline acc: 99.26%) and 63.0% of FLOPs from a VGG-16 (baseline acc: 93.7%), respectively. Such aggressive pruning schemes help us better identify functions’ effectiveness. We use the weighted geometric mean in Eqn. 

2 to combine two validation accuracies with . Our codes are implemented with DEAP [19]

and TensorFlow 

[1] for the genetic operations and the neural network pruning. The experiments are carried out on a cluster with SLURM job scheduler [71] for workload parallelization.

Experiment Result. Our co-evolution progress is typified in Fig. 6, where the red curve denotes the functions with the maximum fitness while the green curve plots the ones with the top 25 percentile fitness. Both curves increase monotonically over generations, indicating that the quality of both the best function and the entire population improves over time, which demonstrates the effectiveness of our scheme. Specifically, the best pruned LeNet-5/VGG-16 in the first generation have accuracies of 99.15%/93.55% while the best accuracies in the last generation are 99.25%/94.0%. As the first generation is initialized with SOAP functions , such results suggest that the algorithm derives metrics that outperform the handcrafted functions. The whole function evolution takes 210 GPU-days, which is an order less than prior network search algorithm, e.g.,  [76](2000 GPU-days) and [58](3000 GPU-days). Our approach is computationally efficient, which could be crucial when less computation resource is available.

Evolved Function. We present a winning function in Eqn. 3, where denotes sample average of the feature maps and is a vector with all entries to be 1. The first two terms of the function award a high score to channels with class-diverged feature maps whose or is significantly smaller than the other. Channels with these feature maps contain rich class information as it generates distinguishable responses to different classes. The third term’s denominator computes the sum of feature maps variances while its numerator draws statistics from the average feature maps and the distance between and , which resembles the concept of SNR. Two points worth mentioning for this function: (1) it identifies important statistics concepts from human-designed metrics, where it learns from Symmetric Divergence [51] to measure the divergence of class feature maps. (2) it owns unique math concepts that are empirically good for channel importance measurement, which is shown in the novel statistics combination of the feature maps in the third term’s numerator. Our visual result in Sec. 6 also suggests preserves better features, showing stronger pruning effectiveness.

5 Transfer Pruning

Network Method Test Acc (%) Acc (%) FLOPs Pruned (%) Parameters Pruned (%)
ResNet
164
SLIM [48] 98.22 98.15 0.07 172M 31.1 1.46M 14.5
Ours-A 98.22 98.25 -0.03 108M 57.4 0.73M 57.8
Ours-B 98.22 98.26 -0.04 92M 63.2 0.64M 63.0
Table 3: SVHN Transfer Pruning Results
Network Method Test Acc (%) Acc (%) FLOPs Pruned (%) Parameters Pruned (%)
VGG19 SLIM [48] 73.26 73.48 -0.22 256M 37.1 5.0M 75.1
G-SD [46] 73.40 73.67 -0.27 161M 59.5 3.2M 84.0
Ours 73.40 74.02 -0.62 155M 61.0 2.9M 85.5
ResNet
56
SFP [28] 71.33 68.37 2.96 76M 39.3 - -
FPGM [29] 71.40 68.79 2.61 59M 52.6 - -
LFPC [27] 71.33 70.83 0.58 61M 51.6 - -
LeGR [9] 72.41 71.04 1.37 61M 51.4 - -
Ours 72.05 71.70 0.35 55M 56.2 0.38M 54.9
ResNet
110
LCCL [14] 72.79 70.78 2.01 173M 31.3 1.75M 0.0
SFP [28] 74.14 71.28 2.86 121M 52.3 - -
FPGM [29] 74.14 72.55 1.59 121M 52.3 - -
TAS [15] 75.06 73.16 1.90 120M 52.6 - -
Ours 74.40 73.85 0.55 111M 56.2 0.77M 55.8
ResNet
164
LCCL [14] 75.67 75.26 0.41 195M 21.3 1.73M 0.0
SLIM [48] 76.63 76.09 0.54 124M 50.6 1.21M 29.7
DI [36] 76.63 76.11 0.52 105M 58.0 0.95M 45.1
Ours 77.15 77.77 -0.62 92M 63.2 0.66M 61.8
Table 4: CIFAR-100 Transfer Pruning Results

To show the generalizability of our evolved pruning function, we apply in Eqn. 3 to more challenging datasets that are not used in the co-evolution process: CIFAR-100 [35], SVHN [54], and ILSVRC-2012 [12]. We report our pruned models at different FLOPs levels by adding a letter suffix (e.g., Ours-A). Our method is compared with metrics from SOAP, e.g., L1 [40], FPGM [29], G-SD [46], and DI [36], where our evolved function outperforms these handcrafted metrics. We also include other “learn to prune” methods like Meta [47] and LFPC [27] and other state-of-the-art methods like DSA [55] and CC [41] for comparison. We summarize the performance (including baseline accuracies) in Tab. 44, and 5, where our evolved function achieves state-of-the-art results on all datasets.

Same as evolution, we adopt a one-shot pruning scheme for our transfer pruning and use the SGD optimizer Nesterov Momentum 

[53]

for retraining. The weight decay factor and the momentum are set to be 0.0001 and 0.9, respectively. On SVHN/CIFAR-100, we use a batch size of 32/128 to fine-tune the network with 20/200 epochs. The learning rate is initialized at 0.05 and multiplied by 0.14 at 40% and 80% of the total number of epochs. On ILSVRC-2012, we use a batch size of 128 to fine-tune VGG-16/ResNet-18/MobileNet-V2 for 30/100/100 epochs. For VGG-16/ResNet-18, the learning rate is started at 0.0006 and multiplied by 0.4 at 40% and 80% of the total number of epochs. We use a cosine decay learning rate schedule for MobileNet-V2 

[60] with an initial rate of 0.03.

SVHN. We first evaluate on SVHN with ResNet-164. Ours-A outperforms SLIM [48] by 0.1% in accuracy with significant hardware resource savings: 26.3% more FLOPs saving and 43.3% more parameters saving. Moreover, we achieve an even better accuracy in Ours-B with a greater FLOPs and parameters saving compared to Ours-A, which well demonstrates pruning effectiveness of .

CIFAR-100. On VGG-19, our pruned model achieves an accuracy gain of 0.35% with respect to G-SD [46]. Compared to LFPC [27] and LeGR [9], our pruned ResNet-56 achieves an accuracy gain of 0.87% and 0.66%, respectively, while having 5% less FLOPs. On ResNet-110, our method outperforms FPGM [29] and TAS [15] by 1.30% and 0.69% in terms of accuracy with 4% less FLOPs. In comparison with LCCL [14], SLIM [48], and DI [36], our pruned ResNet-164 achieves an accuracy of 77.77% with 63.2% FLOPs reduction which advances all prior methods.

ILSVRC-2012. On VGG-16, Ours-A improves over baseline by nearly 1.1% in top-1 accuracy with 2.4 acceleration. The 3.3-accelerated Ours-B achieves top-1/top-5 accuracies of 71.64%/90.60%, advancing the state of the art. On ResNet-18, Ours-A reduces 16.8% FLOPs without top-1 accuracy loss. Compared to LCCL [14], Ours-B achieves a 2.72% top-1 accuracy gain with a higher FLOPs reduction ratio. Ours-C demonstrates top-1 accuracy gains of 1.75% and 1.50% with respect to SFP [28] and DCP [74]. We finally show our performance on a much compact network, MobileNet-V2, which is specifically designed for mobile deployment. When 26.9% of FLOPs is pruned, Ours-A outperforms AMC [30], Meta [47], and LeGR [9] with a top-1 accuracy of 71.90%. At a higher pruning ratio, Ours-B advances DCP [74] and Meta [47] by top-1 accuracies of 4.94% and 0.96%, with 53.4% FLOPs reduction.

Network Method
Top-1
Acc. (%)
Top-1
(%)
Top-5
Acc. (%)
Top-5
(%)
FLOPs (B)
Pruned (%)
Params (M)
Pruned (%)
VGG16 L1 [40] - - 89.90 89.10 0.80 7.74 (50.0) -
CP [31] - - 89.90 89.90 0.00 7.74 (50.0) -
G-SD [46] 71.30 71.88 -0.58 90.10 90.66 -0.56 6.62 (57.2) 133.6 (3.4)
Ours-A 71.30 72.37 -1.07 90.10 91.05 -0.95 6.34 (59.0) 133.5 (3.5)
2-8 RNP [43] - - 89.90 86.67 3.23 5.16 (66.7) 138.3 (0.0)
SLIM [48] - - 89.90 88.53 1.37 5.16 (66.7) -
FBS [20] - - 89.90 89.86 0.04 5.16 (66.7) 138.3 (0.0)
Ours-B 71.30 71.64 -0.34 90.10 90.60 -0.50 5.12 (66.9) 131.6 (4.8)
ResNet
18
Ours-A 70.05 70.08 -0.03 89.40 89.24 0.16 1.50 (16.8) 11.2 (3.9)
2-8 SLIM [48] 68.98 67.21 1.77 88.68 87.39 1.29 1.31 (28.0) -
LCCL [14] 69.98 66.33 3.65 89.24 86.94 2.30 1.18 (34.6) 11.7 (0.0)
Ours-B 70.05 69.09 0.96 89.40 88.59 0.81 1.14 (36.7) 9.3 (20.1)
2-8 SFP [28] 70.28 67.10 3.18 89.63 87.78 1.85 1.06 (41.8) -
DCP [74] 69.64 67.35 2.29 88.98 87.60 1.38 0.98 (46.0) 6.2 (47.0)
FPGM [29] 70.28 68.41 1.87 89.63 88.48 1.15 1.06 (41.8) -
DSA [55] 69.72 68.61 1.11 89.07 88.35 0.72 1.09 (40.0) -
Ours-C 70.05 68.85 1.20 89.40 88.45 0.95 1.07 (41.0) 8.8 (24.5)
MobileNet
V2
Uniform [60] 71.80 69.80 2.00 - - 0.22 (26.9) -
AMC [30] 71.80 70.80 1.00 - - 0.22 (26.9) -
CC [41] 71.88 70.91 0.89 - - 0.22 (28.3) -
Meta [47] 72.70 71.20 1.50 - - 0.22 (27.9) -
LeGR [9] 71.80 71.40 0.40 - - 0.22 (26.9) -
Ours-A 72.18 71.90 0.28 90.49 90.38 0.11 0.22 (26.9) 2.8 (20.4)
2-8 DCP [74] 70.11 64.22 5.89 - 3.77 0.17 (44.7) 2.6 (25.9)
Meta [47] 72.70 68.20 4.50 - - 0.14 (53.4) -
Ours-B 72.18 69.16 3.02 90.49 88.66 1.83 0.14 (53.4) 2.1 (39.3)
Table 5: ILSVRC-2012 Transfer Pruning Results

6 Ablation Study

Figure 7: Comparing random initial population evolution (dashed line) with the evolution in Sec. 4 (solid line). Thanks to the expressiveness of our function space, the evolution with randomly-initialized functions also achieve good pruning fitness. However, we observe that it converges very early around the 8th generation and stalls at the plateau for a long period. Moreover, its final fitness has a clear performance gap with respect to the one in Sec. 4.

Random Initial Population. In Fig. 7, we conduct a control experiment which initializes all individuals as random expression trees to study the effectiveness of SOAP initialization. We also turn off the SOAP function insertion in the reproduction process for the control experiment. All other parameters (number of generations, size of population, , etc.) are kept to be the same as in Sec. 4 for a fair comparison. We find that evolving with random population also achieves a good pruning fitness, which indicates that our design space is of powerful expressiveness. However, we observe early convergence and final performance gap in the control experiment compared to the main experiment in Sec. 4, demonstrating the advantage of using SOAP for evolution.

Evolution on ILSVRC-2012. In contrast to our co-evolution strategy on MNIST and CIFAR-10, we conduct a function evolution on ILSVRC-2012 as control experiment. We restrict the total computation budget to be the same as Sec. 4, i.e. 210 GPU-days, and evolve on ResNet-18 with a population size of 40 over 25 generations. Due to the constrained budget, each pruned net is only retrained for 4 epochs. We include detailed evolution settings and results in Supplementary. Two major drawbacks are found with this evolution strategy: (1) Imprecise evaluation. Due to the lack of training epochs, the function’s actual effectiveness is not precisely revealed. We take two functions with fitness 63.24 and 63.46 from the last generation, and use them again to prune ResNet-18 but fully retrain for 100 epochs. We find that the one with lower fitness in evolution achieves an accuracy of 68.27% in the full training, while the higher one only has an accuracy of 68.02%. Such result indicates the evaluation in this evolution procedure could be inaccurate, while our strategy ensures a full retraining for precise effectiveness assessment. (2) Inferior performance. The best evolved function with this method, (in Supplementary), performs inferior to shown in Eqn. 3 when transferred to a different dataset. In particular, when applied to pruning 56% FLOPs from ResNet-110 on CIFAR-100, only achieves an accuracy of 72.51% while reaches 73.85%. These two issues suggest that co-evolution on two small datasets would have better cost-effectiveness than using a large scale dataset like ILSVRC-2012.

Visualization on Feature Selection.

We further visually understand the pruning decision made by (right) vs. DI [36] (middle) on MNIST features in Fig. 8. The red pixels indicate the important features evaluated by the metrics, while the blue ones are redundant. Taking the average feature values map (left) for reference, we find that our evolved function tends to select features with higher means, where the MNIST pattern is more robust.

Figure 8: Feature importance evaluated by DI [36] (middle) and (right) for MNIST, where tends to preserve features with higher means and more robust pattern in reference of the average feature values map (left).

7 Conclusion

In this work, we propose a new paradigm for channel pruning, which first learns novel channel pruning functions from small datasets, and then transfers them to larger and more challenging datasets. We develop an efficient genetic programming framework to automatically search for competitive pruning functions over our vast function design space. We present and analyze a closed-form evolved function which can offer strong pruning performance and further streamline the design of our pruning strategy. Without any manual modification, the learned pruning function exhibits remarkable generalizability to datasets different from those in the evolution process. More specifically, on SVHN, CIFAR-100, and ILSVRC-2012, we achieve state-of-the-art pruning results.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4, §9.3.
  • [2] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §2.
  • [3] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone (1998) Genetic programming. Springer. Cited by: §1, §3.4, §3.
  • [4] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le (2017) Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 459–468. Cited by: §2.
  • [5] J. Bergstra, D. Yamins, and D. D. Cox (2013)

    Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures

    .
    Cited by: §2.
  • [6] H. Cai, L. Zhu, and S. Han (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.
  • [7] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen (2015) Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pp. 2285–2294. Cited by: §1.
  • [8] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. De Freitas (2017) Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 748–756. Cited by: §2.
  • [9] T. Chin, R. Ding, C. Zhang, and D. Marculescu (2020) Towards efficient model compression via learned global ranking. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1518–1528. Cited by: §2, §2, Table 4, Table 5, §5, §5.
  • [10] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §1.
  • [11] X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia, et al. (2019) Chamnet: towards efficient network design through platform-aware model adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11398–11407. Cited by: §2.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §5.
  • [13] C. Dong, C. C. Loy, K. He, and X. Tang (2015)

    Image super-resolution using deep convolutional networks

    .
    IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §1.
  • [14] X. Dong, J. Huang, Y. Yang, and S. Yan (2017) More is less: a more complicated network with less inference complexity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5840–5848. Cited by: Table 4, Table 5, §5, §5.
  • [15] X. Dong and Y. Yang (2019) Network pruning via transformable architecture search. In Advances in Neural Information Processing Systems, pp. 759–770. Cited by: Table 4, §5.
  • [16] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg, M. Lanctot, and D. Wierstra (2016) Convolution by evolution: differentiable pattern producing networks. In

    Proceedings of the Genetic and Evolutionary Computation Conference 2016

    ,
    pp. 109–116. Cited by: §2.
  • [17] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter (2015) Efficient and robust automated machine learning. In Advances in neural information processing systems, pp. 2962–2970. Cited by: §2.
  • [18] D. Floreano, P. Dürr, and C. Mattiussi (2008) Neuroevolution: from architectures to learning. Evolutionary Intelligence 1 (1), pp. 47–62. Cited by: §2.
  • [19] F. Fortin, F. De Rainville, M. Gardner, M. Parizeau, and C. Gagné (2012-07)

    DEAP: evolutionary algorithms made easy

    .
    Journal of Machine Learning Research 13, pp. 2171–2175. Cited by: §4.
  • [20] X. Gao, Y. Zhao, Ł. Dudziak, R. Mullins, and C. Xu (2018) Dynamic channel pruning: feature boosting and suppression. arXiv preprint arXiv:1810.05331. Cited by: Table 5.
  • [21] D. E. Goldberg and K. Deb (1991)

    A comparative analysis of selection schemes used in genetic algorithms

    .
    In Foundations of genetic algorithms, Vol. 1, pp. 69–93. Cited by: §3.4.
  • [22] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science 286 (5439), pp. 531–537. Cited by: §3.1.
  • [23] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §3.1, §3.1, §3.1.
  • [24] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
  • [25] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, Cited by: §1.
  • [26] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §9.1, §9.2, §9.3.
  • [27] Y. He, Y. Ding, P. Liu, L. Zhu, H. Zhang, and Y. Yang (2020) Learning filter pruning criteria for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2009–2018. Cited by: §1, §2, §2, Table 4, §5, §5.
  • [28] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang (2018) Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866. Cited by: §1, §2, §3.1, Table 4, Table 5, §5.
  • [29] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §1, §1, §2, §2, §3.1, §3.1, Table 4, Table 5, §5, §5.
  • [30] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §2, §2, Table 5, §5.
  • [31] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §1, §2, Table 5.
  • [32] Q. Huang, K. Zhou, S. You, and U. Neumann (2018) Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 709–718. Cited by: §2, §2.
  • [33] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §1.
  • [34] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §9.2.
  • [35] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §3.2, §5, §9.1, §9.2.
  • [36] S.Y. Kung, Z. Hou, and Y. Liu (2019) Methodical design and trimming of deep learning networks: enhancing external bp learning with internal omnipresent-supervision training paradigm. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8058–8062. Cited by: §1, §1, §2, §2, §3.1, §3.1, Table 4, §5, §5, Figure 8, §6.
  • [37] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: §1.
  • [38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11). Cited by: §3.2, §9.2.
  • [39] E. L. Lehmann and J. P. Romano (2006) Testing statistical hypotheses. Springer Science & Business Media. Cited by: §3.1.
  • [40] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1, §2, §3.1, Table 5, §5, §9.2.
  • [41] Y. Li, S. Lin, J. Liu, Q. Ye, M. Wang, F. Chao, F. Yang, J. Ma, Q. Tian, and R. Ji (2021) Towards compact cnns via collaborative compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6438–6447. Cited by: Table 5, §5.
  • [42] Y. Li, S. Lin, B. Zhang, J. Liu, D. Doermann, Y. Wu, F. Huang, and R. Ji (2019) Exploiting kernel sparsity and entropy for interpretable cnn compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2800–2809. Cited by: §2.
  • [43] J. Lin, Y. Rao, J. Lu, and J. Zhou (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. Cited by: Table 5.
  • [44] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §2.
  • [45] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2017) Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §2, §2.
  • [46] Y. Liu, D. Wentzlaff, and S. Kung (2020) Rethinking class-discrimination based cnn channel pruning. arXiv preprint arXiv:2004.14492. Cited by: §2, §3.1, Table 4, Table 5, §5, §5.
  • [47] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K. Cheng, and J. Sun (2019) Metapruning: meta learning for automatic neural network channel pruning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3296–3305. Cited by: §1, §2, §2, Table 5, §5, §5.
  • [48] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §1, §2, §3.1, Table 4, Table 5, §5, §5.
  • [49] C. Louizos, M. Welling, and D. P. Kingma (2017) Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §2.
  • [50] J. Luo, J. Wu, and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §2.
  • [51] M. Mak and S. Kung (2006)

    A solution to the curse of dimensionality problem in pairwise scoring techniques

    .
    In International Conference on Neural Information Processing, pp. 314–323. Cited by: §3.1, §4.
  • [52] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, et al. (2019) Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Cited by: §2.
  • [53] Y. E. Nesterov (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, Vol. 269, pp. 543–547. Cited by: §10, §5, §9.1, §9.2.
  • [54] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §5.
  • [55] X. Ning, T. Zhao, W. Li, P. Lei, Y. Wang, and H. Yang (2020) Dsa: more efficient budgeted pruning via differentiable sparsity allocation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 592–607. Cited by: Table 5, §5.
  • [56] P. Pavlidis, J. Weston, J. Cai, and W. N. Grundy (2001) Gene functional classification from heterogeneous data. In Proceedings of the fifth annual international conference on Computational biology, pp. 249–255. Cited by: §3.1.
  • [57] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    .
    In Proceedings of the aaai conference on artificial intelligence, Vol. 33, pp. 4780–4789. Cited by: §2, §2.
  • [58] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §2, §2, §4.
  • [59] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
  • [60] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: Table 5, §5.
  • [61] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
  • [62] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §2.
  • [63] K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen (2019) Designing neural networks through neuroevolution. Nature Machine Intelligence 1 (1), pp. 24–35. Cited by: §2.
  • [64] K. O. Stanley, D. B. D’Ambrosio, and J. Gauci (2009) A hypercube-based encoding for evolving large-scale neural networks. Artificial life 15 (2), pp. 185–212. Cited by: §2.
  • [65] K. O. Stanley and R. Miikkulainen (2002) Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: §2.
  • [66] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §2.
  • [67] M. Tan and Q. V. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §2, §2.
  • [68] O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Colmenarejo, M. Denil, N. de Freitas, and J. Sohl-Dickstein (2017) Learned optimizers that scale and generalize. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3751–3760. Cited by: §2.
  • [69] L. Xie and A. Yuille (2017) Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1379–1388. Cited by: §2, §2.
  • [70] J. Ye, X. Lu, Z. Lin, and J. Z. Wang (2018) Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124. Cited by: §2.
  • [71] A. B. Yoo, M. A. Jette, and M. Grondona (2003) Slurm: simple linux utility for resource management. In Workshop on Job Scheduling Strategies for Parallel Processing, pp. 44–60. Cited by: §4.
  • [72] R. Yu, A. Li, C. Chen, J. Lai, V. I. Morariu, X. Han, M. Gao, C. Lin, and L. S. Davis (2018)

    Nisp: pruning networks using neuron importance score propagation

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203. Cited by: §1, §2.
  • [73] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang (2018) A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §1.
  • [74] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu (2018) Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: §1, §1, §2, §2, Table 5, §5.
  • [75] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §2, §2.
  • [76] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §2, §2, §4.

8 SOAP Implementation

8.1 Operator Space

In Tab. 6, we present the detailed operator space with operators and their abbreviations.

Elementwise
operators
addition
subtraction
multiplication
division
absolute value
square
square root
adding ridge factor
Matrix
operators
matrix trace
matrix multiplication
matrix inversion
inner product
outer product
matrix/vector transpose
Statistics
operators
summation
product
mean
standard deviation
variance
counting measure
Specialized
operators
rbf kernel matrix getter
geometric median getter
tensor slicer
Table 6: Detailed Operator Space

8.2 SOAP Functions

With the abbreviations of operators in Tab. 6 and the symbols of operands presented in Tab. 1 of the main paper, we can thus give the precise expressions of the functions in SOAP:

  • [leftmargin=0pt]

  • Filter’s -norm:

  • Filter’s -norm:

  • Batch normalization’s scaling factor:

  • Filter’s geometric median:

  • Discriminant Information:

  • Maximum Mean Discrepancy:

  • Generalized Absolute SNR:

  • Generalized Student’s T-Test:

  • Generalized Fisher Discriminat Ratio:

  • Generalized Symmetric Divergence:

(4)

9 Experimental Details

9.1 Study on Fitness Combination Scheme

Preliminary Evolution. We conduct 10 preliminary experiments, where the variables are: and combination scheme {weighted geometric mean, weighted arithmetic mean}. For each experiment, we have a population of 15 functions which are evolved for 10 generations. The population is initialized with 10 individuals randomly cloned from SOAP and 5 random expression trees. The tournament size is 3, and the number of the selected functions is 5. The next generation is reproduced only from the selected functions. Other settings are the same as the main evolution experiment.

CIFAR-100 Pruning. We apply the best evolved functions from each preliminary evolution test to prune a ResNet-38 [26] on CIFAR-100 [35]. The baseline ResNet-38 adopts the bottleneck block structure with an accuracy of 72.3%. We use each evolved function to prune 40% of channels in all layers uniformly, resulting in a 54.7%/52.4% FLOPs/parameter reduction. The network is then fine-tuned by the SGD optimizer with 200 epochs. We use the Nesterov Momentum [53] with a momentum of 0.9. The mini-batch size is set to be 128, and the weight decay is set to be 1e-3. The training data is transformed with a standard data augmentation scheme [26]. The learning rate is initialized at 0.1 and divided by 10 at epoch 80 and 160.

9.2 Main Evolution Experiment

MNIST Pruning. On MNIST [38] pruning task, we prune a LeNet-5 [38] with a baseline accuracy of 99.26% from shape of 20-50-800-500 to 5-12-160-40. Such pruning process reduces 92.4% of FLOPs and 98.0% of parameters. The pruned network is fine-tuned for 300 epochs with a batch size of 200 and a weight decay of 7e-5. We use the Adam optimizer [34] with a constant learning rate of 5e-4.

CIFAR-10 Pruning. For CIFAR-10 [35] pruning, we adopt the VGG-16 structure from [40] with a baseline accuracy of 93.7%. We uniformly prune 40% of the channels from all layers resulting in 63.0% FLOPs reduction and 63.7% parameters reduction. The fine-tuning process takes 200 epochs with a batch size of 128. We set the weight decay to be 1e-3 and the dropout ratio to be 0.3. We use the SGD optimizer with Nesterov momentum [53], where the momentum is set to be 0.9. We augment the training samples with a standard data augmentation scheme [26]. The initial learning rate is set to be 0.006 and multiplied by 0.28 at 40% and 80% of the total number of epochs.

9.3 Transfer Pruning

We implement the pruning experiments in TensorFlow [1] and carry them out with NVIDIA Tesla P100 GPUs. CIFAR-100 contains 50,000/10,000 training/test samples in 100 classes. SVHN is a 10-class dataset where we use 604,388 training images for network training with a test set of 26,032 images. ILSVRC-2012 contains 1.28 million training images and 50 thousand validation images in 1000 classes. We adopt the standard data augmentation scheme [26] for CIFAR-100 and ILSVRC-2012.

9.4 Channel Scoring

As many of our pruning functions require activation maps of the channels to determine channels’ importance, we need to feed-forward the input images for channel scoring. Specifically, for pruning experiments on MNIST, CIFAR-10, and CIFAR-100, we use all their training images to compute the channel scores. On SVHN and ILSVRC-2012, we randomly sample 20 thousand and 10 thousand training images for channel scoring, respectively.

10 Evolution on ILSVRC-2012

Evolution. We use ResNet-18 as the target network for pruning function evolution on ILSVRC-2012. Since only one task is evaluated, we directly use the retrained accuracy of the pruned network as the function’s fitness. Other evolution settings for population, selection, mutation, and crossover are kept the same as Sec. 4 of the main paper.

Evaluation. We uniformly prune 30% of channels in each layer from a pretrained ResNet-18, resulting in a FLOPs reduction of 36.4%. Due to the constrained computational budget, we only fine-tune it for 4 epochs using the SGD optimizer with Nesterov momentum [53]. We use a batch size of 128 and initialize our learning rate at 0.001. The learning rate is multiplied by 0.4 at epoch 1 and 2.

Result. We show the evolution progress in Fig. 9. Due to the lack of training budget, the pruned net is clearly not well retrained as they only achieve around 63.5% accuracy, much lower than the performance of methods shown in Tab. 5 of the main paper at the similar pruning level. Such inadequate training results in a imprecise function fitness evaluation evidenced in Sec. 6 of the main paper. Moreover, the best evolved function from this strategy, (Eqn. 4), performs inferior to the co-evolved function when transferred for CIFAR-100 pruning. These results demonstrate the advantage of our small dataset co-evolution strategy in cost-effectiveness.

Figure 9:

Function evolution on ImageNet.

11 Extra Evolved Functions

We present additional evolved functions from our co-evolution strategy:

(5)
(6)
(7)

Eqn. 5 presents a metric with the concept of SNR for classification, while having a novel way of statistics combination. Moreover, our evolution experiments find that measuring the variance across all elements in (Eqn. 6) and (Eqn. 7) would help us identify important channels empirically. These two functions are simple and effective yet remain undiscovered from the literature.

12 Function Validity

The function expressions generated from mutation and crossover can be invalid (non-invertible matrix, dimension inconsistency, etc.) due to the random selections of operators, operands, and nodes in the expression trees. To combat this issue and enlarge our valid function space, some operators are deliberately modified from their standard definition. For instance, whenever we need to invert a positive semi-definite scatter matrix

, we automatically add a ridge factor , and invert the matrix

. For dimension inconsistency in elementwise operations, we have two options to pad the operand with a smaller dimension: (1) with 0 for

and , and 1 for , and , (2) with its own value if it is a scalar. Moreover, we conduct a validity test on the mutated/crossovered functions every time after the mutation/crossover process. The invalid expressions are discarded, and the mutation/crossover operations are repeated until we recover the population size with all valid functions. These methods ensure we generate valid function expressions under our vast design space during the evolution process.