1 Introduction
Convolutional Neural Network has a prominent performance in computer vision, object detection and other fields by extracting features through neural architectures which imitate the mechanism of human brain. Humandesigned neural architectures such as ResNet (He et al., 2015), DenseNet (Huang et al., 2016), PyramidNet (Han et al., 2016) and so on which contain several effective blocks are successively proposed to increase the accuracy of image classification. In order to design neural architectures adaptable for various datasets, more researchers have a growing interest in studying the algorithmic solutions based on human experience to achieve automatic neural architecture search (Zoph and Le, 2016; Liu et al., 2017b, 2018; Pham et al., 2018; Cai et al., 2018b, 2019).
Many architecture search algorithms perform remarkable but demand for lots of computational effort. For example, obtaining a stateoftheart architecture for CIFAR10 required 7 days with 450 GPUs of evolutionary algorithm
(Real et al., 2018)or used 800 GPUs for 28 days of reinforcement learning
(Zoph and Le, 2016). The latest algorithms based on reinforcement learning (RL) (Pham et al., 2018), sequential modelbased optimization (SMBO) (Liu et al., 2017a) and bayesian optimization (Kandasamy et al., 2018) over a discrete domain are proposed to speed up the search process but the basically directionless search leads to a large number of architectures evaluations required. Although several algorithms based on gradient descent over a continuous domain, such as DARTS (Liu et al., 2018) and NAO (Luo et al., 2018) address this problem to some extent, the training of every intermediate architecture is still computational expensive.In this work, we propose a method for efficient architecture search called EENA (Efficient Evolution of Neural Architecture) guided by the experience gained in the prior learning to speed up the search process and thus consume less computational effort. The concept, guidance of experience gained, is inspired by Net2Net (Chen et al., 2015), which generate large networks by transforming small networks via functionpreserving. There are several precedents (Cai et al., 2017; Wistuba, 2019) based on this for neural architecture search, but the basic operations are limited to the experience in parameters and are relatively simple, so the algorithms may degenerate into a random search. We absorb more basic blocks of classical networks, discard several ineffective blocks and even extend the guidance of experience gained to the prior architectures by crossover in our method. Due to the loss continue to decrease and the evolution becomes directional, robust and globally optimal models can be discovered rapidly in the search space.
Our experiments (Sect. 3) of neural architecture search on CIFAR10 show that our method using minimal computational resources (0.65 GPUhours^{1}^{1}1All of our experiments were performed using a NVIDIA Titan Xp GPU.) can design highly effective neural cell that achieves 2.56% test error with 8.47M parameters. We further transfer the best architecture discovered on CIFAR10 to CIFAR100 datasets and the results perform remarkable as well.
Our contributions are summarized as follows:

We are the first to propose the crossover operation guided by experience gained to effectively reuse the prior learned architectures and parameters.

We study a large number of basic mutation operations absorbed from typical architectures and select the ones that have significant effects.

We achieve remarkable architecture search efficiency (2.56% error on CIFAR10 in 0.65 GPUdays) which we attribute to the use of EENA.

We show that the neural architectures searched by EENA on CIFAR10 are transferable for CIFAR100 datasets.
Part of the code implementation and several models we searched on CIFAR10 of EENA is available at https://github.com/zhuhui123/EENA.
2 Methods of Efficient Evolution of Neural Architectures
In this section, we illustrate our basic mutation and crossover operations with an example of several connected layers which come from a simple convolutional neural network and describe the method of selection and discard of individuals from the population in the evolution process.
2.1 Search space and mutation operations.
The birth of a better network architecture is usually achieved based on local improvements. Chollet (2016) proposes to replace Inception with depthwise separable convolutions to reduce the number of parameters. Grouped convolutions given by Krizhevsky et al. (2012) is used to distributing the model over two GPUs and Xie et al. (2016) further proposes that increasing cardinality is more effective than going deeper or wider based on this. He et al. (2015) solves the degradation problem of deep neural networks by residual blocks. Huang et al. (2016)
proposes dense blocks to solve the vanishinggradient problem and substantially reduce the number of parameters. Some of the existing methods
(Chen et al., 2015; Wistuba, 2019) based on functionpreserving are briefly reviewed in this section and our method is built on them. Specifically, we absorb more blocks of classical networks such as dense block, add some effective changes such as noises for new parameters and discard several ineffective operations such as kernel widening in our method.Our method explores the search space by mutation and crossover operations and every mutation operation refers to a random change for an individual. is the input to the network, the guidance of experience gained in parameters is to choose a new set of parameters for a student network which transform from the teacher network such that ^{2}^{2}2The ’’ here doesn’t mean completely equivalent, noise may be added to make the student more robust.. Assume that the th convolutional layer to be changed is represented by a shaped matrix . The input for the convolution operation in layer is represented as
and the processing of BatchNorm and ReLU is expressed as
. In this work, we consider the following mutation operations.Widen a layer.
Fig. 1(b) is an example of this operation. is extend by replicating the parameters along the last axis at random and the parameters in need to be divided along the third axis corresponding to the counts of the same filters in the th layer. is the new parameter matrix and is the number of filters in the layer +1. Specifically, A noise is randomly added to every new parameter in to break symmetry.
.
.
Branch a layer.
Fig. 1(c) is an example of this operation. and are the new parameter matrices. This operation adds no further parameters and will always be combined with other operations.
.
.
Insert a single layer.
Fig. 1(d) is an example of this operation. The new layer weight matrix with a
kernel is initialized to an identity matrix.
satisfies the restriction for the activation function
: , so this operation is possible..
Insert a layer with shortcut connection.
Fig. 1(e) is an example of this operation. All the parameters of the new layer weight matrix are initialized to 0.
.
Insert a layer with dense connection
Fig. 1(f) is an example of this operation. All the parameters of the new layer weight matrix are initialized to 0.
.
In addition, many other important methods, such as separable convolution, grouped convolution and bottleneck etc. can be absorbed into the mutation operations. We run several simple tests and notice that the search space is expanded but the accuracy of classification is not improved. Therefore, we finally abandoned these operations in our experiment.
2.2 Crossover operation.
Crossover refers to the combination of the prominent parents to produce offsprings which may perform even more excellent. The parents refer to the architectures with high fitness (accuracy) that have been already discovered and every offspring can be considered as a new exploration of the search space. Obviously, although our mutation operations reduce the computational effort of the repeated retraining, the exploration of the search space is still random without taking advantage of the experience already gained in prior architectures. It is crucial and difficult to find a crossover operation that can effectively reuse the parameters already trained and even produce the next generation guided by experience of the prior excellent architectures.
NEAT (Stanley and Miikkulainen, 2002), as a existing method in the field of evolutionary algorithm, identify which genes line up with which by assigning the innovation number to each node gene. However, this method is limited to the finegrained crossover for nodes and connections, and will destroy the parameters that have already been trained.
We notice that the architectures with high fitness all derive from the same ancestor of some point in the past (At worst, the ancestor is the initial architecture). Whenever a new architecture appears (through mutation operations), we record the type and the location of the mutation operation. Based on these, we can track the historical origins and find the common ancestor of the two individuals with high fitness. Then the offsprings inherit the same architecture (ancestor) and randomly inherit the different parts of architectures of the parents.
Fig. 2
is a visual example of the crossover operation in our experiments. Based on the records about the previous mutation operations for each individual (for Parent1, mutation c, d, b, e occurred at layer 2, 4, 3, 3, respectively and for Parent2, mutation c, d, f occurred at layer 2, 4, 3, respectively), the common ancestor of the parents (Ancestor with mutation c, d occurred at layer 2, 4) can be easily found. The mutation operations of the two parents different from each other are selected and added to the ancestor architecture according to a certain probability by the mutation operations (mutation b, f occurred at layer 3, 3 are inherited by Offspring and mutation e is randomly discarded).
2.3 The selection and discard of individuals in evolutionary algorithm
The selection of Individuals.
Our evolutionary algorithm uses tournament selection (Goldberg and Deb, 1991) to select an individual for mutation: a fraction of individuals is selected from the population randomly and the individual with highest fitness is final selected from this set. For crossover, the two individuals with the highest fitness but different architectures will be selected.
The discard of Individuals.
In order to constrain the size of the population, the discard of individuals will be accompanied by the generation of each new individual when the population size reaches . We regulate aging and nonaging evolutions (Real et al., 2018) via a variable to affect the convergence rate and overfit: Discarding the worst model with probability and the oldest model with within each round.
3 Experiments
In this section, we report the performances of EENA in neural architecture search on CIFAR10 and the feasibility of transferring the best architecture discovered on CIFAR10 to CIFAR100. In our experiments, we start the evolution from initializing a simple convolutional neural network to show the efficiency of EENA and we use the methods of selection and discard mentioned in (Sect. 2.3) to select individuals from the population and the mutation (Sect. 2.1) and crossover (Sect. 2.2) operations to improve the neural architectures.
Initial model.
The initial model (the number of parameters is 0.67M) is sketched in Figure 3
. It starts with one convolutional layer, followed by three evolutionary blocks and two MaxPooling layers for downsampling which are connected alternately. Then another convolutional layer is added, followed by a GlobalAveragePooling layer and a Softmax layer for transformation from feature map to classification. Each MaxPooling layer has a stride of two and is followed by a DropBlock
(Ghiasi et al., 2018) layer with ( for the first one and for the second one). Specifically, the first convolutional layer contains 64 filters and the last convolutional layer contains 256 filters. An evolutionary block is initialized with a convolutional layer with 128 filters. Every convolutional layer mentioned actually means a ConvBatchNormReLU block with a kernel size of. The weights are initialized as He normal distribution
(He et al., 2015) and the L2 regularization of is applied to the weights.Dataset.
We randomly sample 10,000 images by stratified sampling from the original training set to form a validation set for evaluate the fitness of the individuals while using the remaining 40,000 images for training the individuals during the evolution. We normalize the images using channel means and standard deviations for preprocessing and apply a standard data augmentation scheme (zeropadding with 4 pixels on each side to obtain a
pixels image, then randomly cropping it to size and randomly flipping the image horizontally).Search on CIFAR10.
The initial population consists of 12 individuals, each formed by a single mutation operation from the common initial model. During the process of evolution, Individual selection is determined by the fitness (accuracy) of the neural architecture evaluated on the validation set. In our experiments, the size of in selection of individuals is fixed to 3 and the variable in discard of individuals is fixed to 0.5. We don’t discard any individual at the beginning to make the population grow to the size of 20. Then we use selection and discard together, that is to say, the individual after mutation or crossover operations will be put back into the population after training and at the same time the discard of individuals will be executed. The mutation and crossover operations to improve the neural architectures are applied in the evolutionary block and any mutation operation is selected by the same probability. The crossover operation is executed every 5 rounds, for which we select the two individuals as parents with the highest fitness but different architectures among the population. All the neural architectures are trained with a batch size of 128 using SGDR (Loshchilov and Hutter, 2016) with initial learning rate , and
. The initial model is trained for 63 epochs. Then, 15 epochs are trained after each mutation operation, one round of 7 epochs and another round of 15 epochs are trained after each crossover operation. One search process on CIFAR10 is visualized in figure
4. In the circle phylogenetic tree of EENA, the color of the outermost circle represents fitness, and the same color of the penultimate circle represents the same ancestor. In the rectangular phylogenetic tree, the color on the right side represents fitness. From the inside to the outside in the left figure and from left to right in the right figure, along the direction of time axis, the connections represent the relationship from ancestors to offsprings. We can notice that the fitness of the population increases steadily and rapidly via mutation and crossover operations. In addition, the population is quickly taken over by a highly performing homologous group. After the search budget is exhausted or the highest fitness of the population doesn’t increase over 25 rounds, the individual with highest fitness will be extracted as the best neural architecture for posttraining.Posttraining of the best neural architecture obtained.
We conduct postprocessing and posttraining towards the best neural architecture designed by the EENA. The model is trained on the full training dataset until convergence using Cutout (Devries and Taylor, 2017) and Mixup (Zhang et al., 2017) whose configurations are the same as the original paper (a cutout size of and for mixup). Specifically, in order to reflect the fairness of the result for comparison, we don’t use the latest method proposed by Cubuk et al. (2018) which has significant effects but hasn’t been widely used yet. The neural architectures are trained with a batch size of 128 using SGDR with initial learning rate , and for 511 or 1023 epochs^{3}^{3}3
We did not conduct extensive hyperparameter tuning due to limited computation resources.
. Finally, the error on the test dataset is reported. The comparison against stateoftheart recognition results on CIFAR10 is presented in Table 1. On CIFAR10, Our method using minimal computational resources (0.65 GPUdays) can design highly effective neural cell that achieves 2.56% test error with small number of parameters (8.47M).Method  Params  Search Time  Test Error 
(Mil.)  (GPUdays)  (%)  
DenseNetBC (Huang et al., 2016)  25.6  3.46  
PyramidNetBottleneck (Han et al., 2016)  26.0  3.31  
ResNeXt + ShakeShake (Gastaldi, 2017)  26.2  2.86  
AmoebaNetA (Real et al., 2018)  3.2  3150  3.34 
Largescale Evolution (Real et al., 2017)  5.4  2600  5.4 
NASv3 (Zoph and Le, 2016)  37.4  1800  3.65 
NASNetA (Zoph et al., 2017)  3.3  1800  2.65 
Hierarchical Evolution (Liu et al., 2017b)  15.7  300  3.75 
PNAS (Liu et al., 2017a)  3.2  225  3.41 
PathLevelEAS (Cai et al., 2018b)  14.3  200  2.30 
NAONet (Luo et al., 2018)  128  200  2.11 
EAS (Cai et al., 2018a)  23.4  10  4.23 
DARTS (Liu et al., 2018)  3.4  4  2.83 
NeuroCellbased Evolution (Wistuba, 2019)  7.2  1  3.58 
ENAS (Pham et al., 2018)  4.6  0.45  2.89 
NAC (Kamath et al., 2018)  10  0.25  3.33 
Ours  8.47  0.65  2.56 
Comparison to search without crossover.
Unlike random search by mutation operations, crossover as a heuristic search makes the exploration directional. In order to verify the effect of the crossover operation, we conduct another experiment removing the crossover operation from the search process and all the other configurations remain unchanged. We run the experiment 5 times for 0.65 hours, then report a mean classification error of 3.44% and a best classification error of 2.96%. Thus, we confirm that the crossover operation is indeed effective.
Transfer the best cell searched on CIFAR10 to CIFAR100.
We further try to transfer the best cell of highest fitness searched on CIFAR10 to CIFAR100 and the results perform remarkable as well. For CIFAR100, several hyperparameters are modified: for the first DropBlock layer, for the second and the cutout size is . The comparison against stateoftheart recognition results on CIFAR100 is presented in Table 2.
Method  Params  Search Time  Test Error 
(Mil.)  (GPUdays)  (%)  
DenseNetBC (Huang et al., 2016)  25.6  17.18  
ResNeXt + ShakeShake (Gastaldi, 2017)  26.2  15.20  
AmoebaNetB (Real et al., 2018)  34.9  3150  15.80 
Largescale Evolution (Real et al., 2017)  40.4  2600  23.70 
NASNetA (Zoph et al., 2017)  50.9  1800  16.03 
PNAS (Liu et al., 2017a)  3.2  225  17.63 
NAONet (Luo et al., 2018)  128  200  14.75 
NeuroCellbased Evolution (Wistuba, 2019)  5.3  1  21.74 
ENAS (Pham et al., 2018)  4.6  0.45  17.27 
Ours (transferred from CIFAR10)  8.49  17.78 
4 Conclusions and Ongoing Work
We design an efficient method of neural architecture search based on evolution with the guidance of experience gained in the prior learning. This method takes repeatable CNN blocks (cells) as the basic units for evolution, and achieves a stateoftheart accuracy on CIFAR10 and others with few parameters and little search time. We notice that the initial model and the basic operations are extremely impactful to search speed and final accuracy. Therefore, we are trying to add several effective blocks such as SqueezeExcitation block as mutation operations combined with other methods that might perform effective such as macrosearch (Hu et al., 2018) into our experiments.
References
 Cai et al. [2017] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Reinforcement learning for architecture search by network transformation. CoRR, abs/1707.04873, 2017. URL http://arxiv.org/abs/1707.04873.
 Cai et al. [2018a] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. 04 2018a.
 Cai et al. [2018b] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Pathlevel network transformation for efficient architecture search. CoRR, abs/1806.02639, 2018b. URL http://arxiv.org/abs/1806.02639.
 Cai et al. [2019] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HylVB3AqYm.
 Chen et al. [2015] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. 11 2015.
 Chollet [2016] François Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016. URL http://arxiv.org/abs/1610.02357.
 Cubuk et al. [2018] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation policies from data. CoRR, abs/1805.09501, 2018. URL http://arxiv.org/abs/1805.09501.
 Devries and Taylor [2017] Terrance Devries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. CoRR, abs/1708.04552, 2017. URL http://arxiv.org/abs/1708.04552.
 Gastaldi [2017] Xavier Gastaldi. Shakeshake regularization. CoRR, abs/1705.07485, 2017. URL http://arxiv.org/abs/1705.07485.
 Ghiasi et al. [2018] Golnaz Ghiasi, TsungYi Lin, and Quoc V. Le. Dropblock: A regularization method for convolutional networks. CoRR, abs/1810.12890, 2018. URL http://arxiv.org/abs/1810.12890.

Goldberg and Deb [1991]
David E. Goldberg and Kalyanmoy Deb.
A comparative analysis of selection schemes used in genetic algorithms.
volume 1 of Foundations of Genetic Algorithms, pages 69 – 93. Elsevier, 1991. doi: https://doi.org/10.1016/B9780080506845.500082. URL http://www.sciencedirect.com/science/article/pii/B9780080506845500082.  Han et al. [2016] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. CoRR, abs/1610.02915, 2016. URL http://arxiv.org/abs/1610.02915.
 He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
 Hu et al. [2018] Hanzhang Hu, John Langford, Rich Caruana, Eric Horvitz, and Debadeepta Dey and. Macro neural architecture search revisited. 2018.
 Huang et al. [2016] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016. URL http://arxiv.org/abs/1608.06993.
 Kamath et al. [2018] Purushotham Kamath, Abhishek Singh, and Debo Dutta. Neural architecture construction using envelopenets. CoRR, abs/1803.06744, 2018. URL http://arxiv.org/abs/1803.06744.
 Kandasamy et al. [2018] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabás Póczos, and Eric Xing. Neural architecture search with bayesian optimisation and optimal transport. CoRR, abs/1802.07191, 2018. URL http://arxiv.org/abs/1802.07191.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf.
 Liu et al. [2017a] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan L. Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. CoRR, abs/1712.00559, 2017a. URL http://arxiv.org/abs/1712.00559.
 Liu et al. [2017b] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. CoRR, abs/1711.00436, 2017b. URL http://arxiv.org/abs/1711.00436.
 Liu et al. [2018] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. CoRR, abs/1806.09055, 2018. URL http://arxiv.org/abs/1806.09055.
 Loshchilov and Hutter [2016] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. CoRR, abs/1608.03983, 2016. URL http://arxiv.org/abs/1608.03983.
 Luo et al. [2018] Renqian Luo, Fei Tian, Tao Qin, and TieYan Liu. Neural architecture optimization. CoRR, abs/1808.07233, 2018. URL http://arxiv.org/abs/1808.07233.
 Pham et al. [2018] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. CoRR, abs/1802.03268, 2018. URL http://arxiv.org/abs/1802.03268.
 Real et al. [2017] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc V. Le, and Alex Kurakin. Largescale evolution of image classifiers. CoRR, abs/1703.01041, 2017. URL http://arxiv.org/abs/1703.01041.
 Real et al. [2018] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. CoRR, abs/1802.01548, 2018. URL http://arxiv.org/abs/1802.01548.
 Stanley and Miikkulainen [2002] Kenneth O. Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2):99–127, 2002. URL http://nn.cs.utexas.edu/?stanley:ec02.
 Wistuba [2019] Martin Wistuba. Deep learning architecture search by neurocellbased evolution with functionpreserving mutations. In Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim, editors, Machine Learning and Knowledge Discovery in Databases, pages 243–258, Cham, 2019. Springer International Publishing. ISBN 9783030109288.
 Xie et al. [2016] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. CoRR, abs/1611.05431, 2016. URL http://arxiv.org/abs/1611.05431.
 Zhang et al. [2017] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017. URL http://arxiv.org/abs/1710.09412.
 Zoph and Le [2016] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016. URL http://arxiv.org/abs/1611.01578.
 Zoph et al. [2017] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CoRR, abs/1707.07012, 2017. URL http://arxiv.org/abs/1707.07012.
5 Appendix
Here we plot the best architecture of CNN cells discovered by EENA in Fig. 5.
Comments
There are no comments yet.