CARS: Continuous Evolution for Efficient Neural Architecture Search

09/11/2019 ∙ by Zhaohui Yang, et al. ∙ HUAWEI Technologies Co., Ltd. Peking University 17

Searching techniques in most of existing neural architecture search (NAS) algorithms are mainly dominated by differentiable methods for the efficiency reason. In contrast, we develop an efficient continuous evolutionary approach for searching neural networks. Architectures in the population which share parameters within one supernet in the latest iteration will be tuned over the training dataset with a few epochs. The searching in the next evolution iteration will directly inherit both the supernet and the population, which accelerates the optimal network generation. The non-dominated sorting strategy is further applied to preserve only results on the Pareto front for accurately updating the supernet. Several neural networks with different model sizes and performance will be produced after the continuous search with only 0.4 GPU days. As a result, our framework provides a series of networks with the number of parameters ranging from 3.7M to 5.1M under mobile settings. These networks surpass those produced by the state-of-the-art methods on the benchmark ImageNet dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

CARS

Pytorch code for paper: CARS: Continuous Evolution for Efficient Neural Architecture Search


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Related Works

Network Architecture Search

Network architecture search problem always contains two steps: network parameter optimization and network architecture optimization. The network parameter optimization step adjusts the parameters in the standard layers (i.e.

convolution, batch normalization, fully connected layer) which is similar to train a standard neural network (

i.e. ResNet). The network architecture optimization step learns accurate network architectures with superior performance.

The parameter optimization step contains independent and sharing optimization. Independent optimization learns each network separately, i.e. AmoebaNet [27] takes thousands of GPU days to evaluate thousands of models. To accelerate training time,  [8, 7] initialize parameters by network morphism. One-shot method [1] step further by sharing all the parameters for different architectures among one supernet. Rather than training thousands of different architectures, only one supernet is required to be optimized.

The architecture optimization step include RL-based, EA-based, and gradient-based approaches. RL-based methods [40, 41, 26] use recurrent networks as network architecture controller, and the performance of the generated architectures are utilized as the rewards for training the controller. The controller would converge during training and finally outputs architectures with superior performance. EA-based approaches [35, 27, 33, 30] search architectures with the help of evolutionary algorithms. The validation performance of each individual is utilized as the fitness to evolve the next generation. Gradient-based approaches [21, 36, 34] view the network architecture as a set of learnable parameters and optimize the architecture by the standard back-propagation algorithm.

Multi-objective Network Architecture Search

Considering multiple complementary objectives, i.e. performance, the number of parameters, float operations (FLOPs) and latency, there is no single architecture surpass all the others along with all the objectives. Therefore, a set of architectures within the Pareto front are desired. Many different works have been proposed to deal with multi-objective network architecture search. NEMO [16] targets at speed and accuracy. DPPNet and LEMONADE [6, 8] considers device-related and device-agnostic objectives. MONAS [14] targets at accuracy and energy. NSGANet [23] considers FLOPs and accuracy. These methods are less efficient for models are optimized separately. In contrast, our architecture optimization and parameter optimization steps are conducted iteratively rather than first fully training a set of parameters and then optimize architectures. Besides, the parameters for different architectures are shared, thus much more efficient during searching.

Approach

In this section, we develop a novel continuous evolutionary approach for searching neural architectures including two procedures, i.e. parameters optimization and architecture optimization.

We use Genetic Algorithm (GA) for architecture evolution, because GA could provide a vast searching space. We maintain a set of architectures (a.k.a. connections)

, where is the population size. The architectures in the population gradually update according to the proposed pNSGA-III method during architecture optimization step. To make the searching stage efficient, we maintain a supernet which shares parameters for different architectures, which could dramatically reduce the computational complexity of separately training these different architectures during searching.

Supernet of CARS

Different architectures are sampled from the supernet , and each network can be stood by a set of parameters with float precision and a set of binary connection parameters (i.e. ) . The 0-element connection means the network does not contain this connection to transform data flow and 1-element connection means the network use this connection. From this point of view, each network could be represented as pair.

Full precision parameters are shared by a set of networks. If these network architectures are fixed, the optimal could be optimized by standard back-propagation, which fits for all the networks to achieve higher recognition performance. After the parameters are converged, we could alternately optimize the binary connections by GA algorithm. These two stages form the main optimization for our proposed continuous evolution algorithm and are processed in an alternative training way. We introduce these two kinds of update in the following.

Parameter Optimization

Parameter is the collection of all the parameters in the network. The parameter of the th individual is , where the is the mask operation which keep the parameters of the complete graph only for the positions corresponding to 1-elements in the connection . With input data fed into the network, the predictions of this network is , where is the th architecture and is the sampled weights. Given ground truth , the prediction loss can be expressed as . The gradient of can be calculated as

(1)

Parameters should fit all the individuals, and thus the gradients for all networks are accumulated to calculate the gradient of parameters

(2)

Any layer is only optimized by networks which use this layer during forward. By collecting the gradients of individuals in the population, the parameters are updated through SGD algorithm.

As we have maintained a large set of architectures with shared weights in the supernet, we borrow the idea of stochastic gradient descent and use mini-batch architectures for updating parameters. Accumulating the gradients for all networks would take much time for one-step gradient descent, and thus we use mini-batch architectures for updating shared weights. We use

different architectures where , and the indexes of architectures are to update parameters. The efficient parameter updating of Eqn 2 is detailed as Eqn 3

(3)

Hence, the gradient over a mini-batch of architectures is taken as an approximation of the averaged gradients of all the different individuals. The time cost for each update could be largely reduced, and the appropriate mini-batch size leads to a balance between efficiency and accuracy.

Architecture Optimization

As for the architecture optimization procedure, we use the evolution algorithm with the non-dominated sorting strategy, which was introduced in the NSGA-III [4]. Denoting as different networks and as different measurements we want to minimize. In general, these measurements, including the number of parameters, float operations, latency, and performance, could have some conflicts, which increases the difficulty in discovering an optimal solution that minimizes all these metrics.

In practice, if architecture dominates , the performance of is no less than that of on all the measurements and behaves better on at least one metric. Formally, the definition of domination can be given as below.

Definition 1.

Consider two networks and network , and a series of measurements we want to minimize. If

(4)

is said to dominate , i.e. .

0:  supernet , connections , offspring expand ratio , the number of continuous evolution , multi-objectives , the number of parameter optimization epochs between evolution iterations, criterion .
1:  Warmup parameters of supernet .
2:  for   do
3:     for   do
4:        Get training mini-batch data and target , random sample different networks from the maintained architectures.
5:        Calculate averaged loss by forward sampled architectures = (, ).
6:        Calculate derivative to parameters according to Eqn 3 and update network parameters.
7:     end for
8:     Update within Pareto front using pNSGA-III.
9:  end for
9:  Architectures with various requirements of objectives.
Algorithm 1 Continuous Evolution for Multi-objective Architecture Search

According to the above definition, if dominates , can be replaced by during the evolution procedure since performs better in terms of at least one metric and no worse on other metrics. By exploiting this approach, we can select a series of excellent neural architectures from the population generated in the current iteration. Then, these networks can be utilized for updating the corresponding parameters in the supernet.

Although the above non-dominated sorting using NSGA-III method [4] can help us to select some better models for updating parameter, there exists a small model trap phenomenon during the searching procedure. Specifically, since the parameters in the supernetwork still need optimization, the accuracy of each individual architecture in the current iteration may not always stand for its performance that can be eventually achieved, as discussed in NASBench-101 [37]. Thus, some smaller models of fewer parameters but higher test accuracy tend to dominate those larger models of lower fitness but have the potential for achieving higher accuracies.

NSGA-III pNSGA-III
Figure 1: Comparision between different evolution strategy. supernet is trained with training set and evaluated on the validation set. The left figure shows five evolution iterations using NSGA-III. Evolution with NSGA-III suffers from Small Model Trap, which lead the distribution bias to smaller models. The right figure shows the evolution iterations using pNSGA-III. Evolution with pNSGA-III provides protection for larger models.

Therefore, we propose to improve the conventional NSGA-III for protecting these larger models, namely pNSGA-III. More specifically, the pNSGA-III algorithm takes the increasing speed of performance into consideration. We take the validation performance and the number of parameters as an example. For NSGA-III method, the non-dominated sorting algorithm considering two different objectives and select individuals according to the sorted Pareto stages. For the proposed pNSGA-III, besides considering the number of parameters and performance, we also conduct a non-dominated sorting algorithm considering the performance increasing speed and the number of parameters. Then the two different Pareto stages are merged, and gradually select individuals from the first stage to fill the generation. In this way, the large networks with slower performance increasing speed could be kept in the population.

Figure 1 illustrates the selected models in every iteration using the conventional NSGA-III and the modified pNSGA-III, respectively. It is obvious that the pNSGA-III can provide models with a wide range of model sizes and comparable performance to that of the NSGA-III.

Continuous Evolution for CARS

In summary, the proposed CARS for searching optimal neural architecture has two stages: 1) Architecture Optimization 2) Parameter Optimization. In addition, parameter warmup is also introduced for the few epochs.

Parameter Warmup.

Since the shared weights of our supernet are initialized randomly, if a set of architectures are also initialized with random architectures, the most frequently used blocks for all the architectures would be trained more times compared with other blocks. Thus, by following one-shot NAS methods [1, 11], we use uniform sampling strategy to initialize the parameters in the supernet. In this way, the supernet trains each possible architecture with the same possibility which means each path in a searching space is sampled equivalently. For example, in DARTS [21]

pipeline, there are 8 different operations for each node, including convolution, pooling, identity mapping, and no connection. Each operation will be trained with a probability of

. This initialization step initialized the sharing weights of the supernet for few epochs.

Architecture Optimization.

After initializing the parameters of the supernet, we first randomly sample different architectures, where is the number of maintained individuals among Pareto front using pNSGA-III, and we define as a hyper-parameter. During the architecture evolution step, we first generate offsprings, where is the hyper-parameter to control the number of offsprings. We then use pNSGA-III to select individuals from individuals.

Parameter Optimization.

Given a set of architectures, we use the proposed mini-batch architectures update scheme for parameter optimization according to Eqn 3.

Algorithm 1 summarizes the detailed procedure of the proposed continuous evolutionary algorithm for searching neural architectures.

Architecture Test Error Params Search Cost Search
(%) (M) (GPU days) Method
DenseNet-BC [15] 3.46 25.6 - manual
PNAS [19] 3.41 3.2 225 SMBO
ENAS + cutout [26] 2.91 4.2 4 RL
NASNet-A + cutout [41] 2.65 3.3 2000 RL
AmoebaNet-A + cutout [27] 3.12 3.1 3150 evolution
Hierarchical evolution [20] 3.75 15.7 300 evolution
SNAS (mild) + cutout [36] 2.98 2.9 1.5 gradient
SNAS (moderate) + cutout [36] 2.85 2.8 1.5 gradient
SNAS (aggressive) + cutout [36] 3.10 2.3 1.5 gradient
DARTS (first) + cutout [21] 3.00 3.3 1.5 gradient
DARTS (second) + cutout [21] 2.76 3.3 4 gradient
RENA [39] 3.87 3.4 - RL
NSGANet [23] 3.85 3.3 8 evolution
LEMONADE [8] 3.05 4.7 80 evolution
CARS-A 3.00 2.4 0.4 evolution
CARS-B 2.87 2.7 0.4 evolution
CARS-C 2.84 2.8 0.4 evolution
CARS-D 2.95 2.9 0.4 evolution
CARS-E 2.86 3.0 0.4 evolution
CARS-F 2.79 3.1 0.4 evolution
CARS-G 2.74 3.2 0.4 evolution
CARS-H 2.66 3.3 0.4 evolution
CARS-I 2.62 3.6 0.4 evolution
Table 1:

Comparison with state-of-the-art image classifiers on CIFAR-10 dataset. The multi-objectives used for architecture optimization are performance and model size.

Search Time Analysis

During the searching stage of CARS, the training data is split into train/val set, and the train set is used for parameter optimization, meanwhile, the val set is used for architecture optimization stage. Assuming the average training time on the train set for one architecture is , and the inference time on val set is . The first warmup stage takes epochs, and it needs in this stage to initialize parameters in the supernet .

Assuming the architectures evolve for iterations in total. And each iteration contains parameter optimization and architecture optimization stages. The parameter optimization stage trains the supernet for epochs on train set during one evolution iteration, thus the time cost for parameter optimization in one evolution iteration is if we consider the mini-batch size . For the architecture optimization stage, all the individuals can be inferred in parallel, so the searching time in this stage could be calculated as . Thus the total time cost for evolution iterations is . All the searching time cost in CARS is,

(5)

Experiments

In this section, we first introduce the datasets, backbone, and evolution details we used in our experiments. We then search a set of architectures using our proposed CARS on CIFAR-10 [17] dataset and show the necessity of proposed pNSGA-III over popularly used NSGA-III [4] during continuous evolution. We consider three objectives. Besides performance, we separately consider device-aware latency and device-agnostic model size. After that, we evaluate our searched architectures on CIFAR-10 dataset and ILSVRC2012 [5] large scale recognition dataset to demonstrate the effectiveness of CARS.

Experimental Settings

Datasets.

Our CARS experiments are performed on CIFAR-10 [17] and ILSVRC2012 [5] large scale image classification task. These two datasets are popular benchmarks on the recognition task. We search architectures on CIFAR-10 dataset, and the searched architectures are evaluated on CIFAR-10 and ILSVRC2012 datasets.

supernet Backbones.

To illustrate the effectiveness of our method, we evaluate our CARS over state-of-the-art NAS system DARTS [21]

. DARTS is a differentiable NAS system searching for cells and shares the searched cells from shallow layers to deeper layers. The searching space contains 8 different blocks, including four types of convolution, two kinds of pooling, skip connect and no connection. DARTS searches for two kinds of topology for normal cell and reduction cell. A normal cell is used for the layers that have the same spatial size for input feature and output feature. Reduction cell is used for layers with downsampling on input feature maps. After searching for these two kinds of cells, the network is constructed by stacking a set of searched cells.

Evolution Details.

In the DARTS method, each intermediate node in a cell is connected with two previous nodes. Thus each node has its own searching space. Crossover and mutation are conducted on the corresponding nodes with the same searching space. Both crossover ratio and mutation ratio are set as 0.25, 0.25, and we randomly generate new architectures with a probability of 0.5. For the crossover operation, each node has a ratio of 0.5 to crossover its connections, and for mutation operation, each node has a ratio of 0.5 to be randomly reassigned.

Experiments on CIFAR-10

Architecture Top-1 Top-5 Params Search Cost Search
Acc (%) Acc (%) (M) (M) (GPU days) Method
ResNet50 [13] 75.3 92.2 25.6 4100 - manual
InceptionV1 [32] 69.8 90.1 6.6 1448 - manual
MobileNetV2 (1[29] 72.0 90.4 3.4 300 - manual
ShuffleNetV2 (2[24] 74.9 90.1 7.4 591 - manual
PNAS [19] 74.2 91.9 5.1 588 224 SMBO
SNAS (mild) [36] 72.7 90.8 4.3 522 1.5 gradient
DARTS [21] 73.3 91.3 4.7 574 4 gradient
PARSEC [3] 74.0 91.6 5.6 548 1 gradient
NASNet-A [41] 74.0 91.6 5.3 564 2000 RL
NASNet-B [41] 72.8 91.3 5.3 488 2000 RL
NASNet-C [41] 72.5 91.0 4.9 558 2000 RL
AmoebaNet-A [27] 74.5 92.0 5.1 555 3150 evolution
AmoebaNet-B [27] 74.0 91.5 5.3 555 3150 evolution
AmoebaNet-C [27] 75.7 92.4 6.4 570 3150 evolution
CARS-A 72.8 90.8 3.7 430 0.4 evolution
CARS-B 73.1 91.3 4.0 463 0.4 evolution
CARS-C 73.3 91.4 4.2 480 0.4 evolution
CARS-D 73.3 91.5 4.3 496 0.4 evolution
CARS-E 73.7 91.6 4.4 510 0.4 evolution
CARS-F 74.1 91.8 4.5 530 0.4 evolution
CARS-G 74.2 91.9 4.7 537 0.4 evolution
CARS-H 74.7 92.2 4.8 559 0.4 evolution
CARS-I 75.2 92.5 5.1 591 0.4 evolution
Table 2: An overall comparison on ILSVRC2012 dataset. The CARS models are the architectures searched on the CIFAR-10 dataset, then evaluated on ILSVRC2012 dataset.

Our experiments on CIFAR-10 include NSGA method comparison, searching and evaluation. For NSGA method comparison, we compare NSGA-III with proposed pNSGA-III. After that, we conduct CARS on CIFAR-10 and search for a set of architectures which separately considers latency, model size and performance. At last, we examine the searched architectures with different computing resources.

Search on CIFAR-10.

To search for a set of architectures using CARS, we split the CIFAR-10 train set by 25,000 in searching which named as train set and 25,000 for evaluation, which named as val set. The split strategy is same with DARTS and SNAS. We search for 500 epochs in total, and the parameter warmup stage lasts for the first 50 epochs. After that, we initialize the population, which maintains different architectures and gradually evolve them using proposed pNSGA-III. The parameter optimization stage is optimized by train set and trains for 10 epochs during one evolution iteration. The architecture optimization stage uses the val set to update architectures according to proposed pNSGA-III. We separately search by considering performance and model size/latency.

NSGA-III vs. pNSGA-III.

We train CARS with different NSGA methods during architecture optimization stage for comparison and objectives are the model size and validation performance. We visualize the growing trend for different architecture optimization methods after parameter warmup stage. As Figure 1 shows, CARS with NSGA-III would encounter the Small Model Trap problem, for small models tend to eliminate large models during architecture optimization stage. In contrast, architecture optimization using pNSGA-III protects larger models which have the potential to increase their accuracies during later epochs but converge slower than small models at the beginning. To search for several models with various computing resources, it is essential to maintain larger models in the population rather than dropping them during architecture optimization stage.

Search Time analysis.

For the experiment of considering the model size and performance, the training time on the train set takes around 1 minute, and the inference time on val set is around 5 seconds. For the first initialization stage, it trains for 50 epochs, so the time cost in this stage takes around an hour. The continuous evolution algorithm evolves the architectures for iterations. For the architecture optimization stage in one iteration, the parallel evaluation time is . For the parameter optimization stage, we set to be 1 in our experiments since different batch size does not affect the overall growth trend for individuals in the population which is discussed in supplementary material, and we train the supernet for 10 epochs in one evolution iteration. So the parameter optimization time is about 10 minutes. Thus the time cost for total evolution iteration is around 9 hours, and the total searching time is around 0.4 GPU day. For the experiment of considering the latency and performance, the running latency for each model is evaluated during architecture optimization step. Thus the searching time is around 0.5 GPU day.

Evaluate on CIFAR-10.

After finishing the CARS, our method keeps different architectures, and we evaluate architectures which have a similar number of parameters with previous works [21, 36] for comparison. We retrain the searched architectures on CIFAR-10 dataset with all the training data and evaluate on the test set. All the training parameters are the same with DARTS [21].

We compare the searched architectures with state-of-the-art methods which utilize similar searching space in Table 1. All the searched architectures can be found in the supplementary material. Our searched architectures have the number of parameters vary from 2.4M to 3.6M on CIFAR-10 dataset, and the performance of these architectures are on par with the state-of-the-art NAS methods, while if we evolve with NSGA-III method, we would only search for a set of architectures with approximately 2.4M parameters without larger models, and the models perform relatively poor.

Compared to previous methods like DARTS and SNAS, our method is capable of searching for complete architectures over a large range of the searching space. Our CARS-G model achieves comparable accuracy with DARTS (second order) resulting in an approximate 2.75 error rate with smaller model size. With the same 3.3M parameters as DARTS (second order), our CARS-H achieves lower test error, 2.66% vs. 2.76%. For the small models, our searched CARS-A/C/D also achieve comparable results with SNAS. Besides, our large model CARS-I achieves lower error rate 2.62% with slightly more parameters. The overall trend from CARS-A to CARS-J is that the error rate gradually decreases while increasing the model size. These models are all Pareto solutions. Compared to other multi-objective methods like RENA [39], NSGANet [23] and LEMONADE [8], our searched architectures also show superior performance over these methods.

Figure 2: CARS-H and DARTS. On the top are the normal and reduction blocks of CARS-H, and the bottom are the normal and reduction blocks in DARTS (second order).

Comparison on Searched Block.

In order to have an explicit understanding of the proposed method, we further visualize the normal and reduction blocks searched using CARS and DARTS in Figure 5, respectively. Wherein, the CARS-H and DARTS (second order) have a similar number of parameters (3.3M), but the CARS-H has higher accuracy. It can be found in Figure 5 that, there are more parameters in the CARS-H reduction block for preserving more useful information, and the size of the normal block of CARS-H is smaller than that of the DARTS (second order) to avoid unnecessary computations. This phenomenon mainly because the proposed method using EA has much larger searching space, and the genetic operations can effectively jump out of the local optimum, which demonstrates its superiority.

Evaluate on ILSVRC2012

For those architectures searched on CIFAR-10 dataset, we evaluate the transferability of the architectures on ILSVRC2012 dataset. We use 8 Nvidia Tesla V100 to train in parallel, and the batch size is set to 640. We train 250 epochs in total, the learning rate is set to 0.5 with linear decay scheduler, and we warmup [10] the learning rate for the first five epochs due to the large batch size we used. Momentum is set to 0.9, and weight decay is set to 3e-5. Label smooth is also included with a smooth ratio of 0.1.

The evaluated results show the transferability capability of our searched architectures. Our models cover an extensive range of the parameters ranging from 3.7M to 5.1M with 430 to 590 MFLOPs. For different deploy environments, we can easily select an architecture which satisfies the restrictions.

Figure 3: CARS-Lat models are searched on CIFAR-10 dataset which considers mobile device latency and validation performance. The top-1 accuracy is the searched models’ performance on the ILSVRC2012 dataset.

Our CARS-I surpasses PNAS by 1% on Top1 with the same number of parameters and approximate FLOPs. The CARS-G shows superior results over DARTS by 0.9% Top1 accuracy with the same number of parameters. Also, CARS-D surpasses SNAS (mild) by 0.6% Top1 accuracy with the same number of parameters. For different models of NASNet and AmoebaNet, our method also has various models that achieve higher accuracy using the same number of parameters. By using the proposed pNSGA-III, the larger architectures like CARS-I could be protected during architecture optimization stages. Because of the efficient parameter sharing, we could search a set of superior transferable architectures during the one-time search.

For experiments that search architectures by runtime latency and performance, we use the evaluated runtime latency on the mobile phone as an objective and the performance as another. These two objectives are used for generating the next population. We evaluated the searched architectures on ILSVRC2012 in Figure 3. The searched architectures cover an actual runtime latency from 40ms to 90ms and surpass the counterparts.

Conclusion

The evolutionary algorithm based NAS methods are able to find models with high-performance, but the searching time of these methods is extremely long, the main problem is each candidate network is trained separately. In order to make this efficient, we propose continuous evolution architecture search, namely, CARS, which maximally utilizes the learned knowledge such as architectures and parameters in the latest evolution iteration. A supernet is constructed with considerable cells and blocks. Individuals are generated through the benchmark operations in an evolutionary algorithm. Non-dominated sort strategy is utilized to select architectures with different model sizes and high accuracies for updating the supernet. Experiments on benchmark datasets show that the proposed CARS can provide a number of architectures on the Pareto front with high efficiency, e.g. the searching cost on the CIFAR-10 benchmark is only 0.4 GPU days. The searched models are superior to models produced by state-of-the-art methods in terms of both model size/latency and accuracies.

References

Appendix

In this supplementary material, we list all the searched architectures using CARS on the CIFAR10 dataset and explore the effect of hyper-parameters utilized in CARS.

Network Architectures

In Table 3, we list all the searched architectures considering the model size and validation performance. In Table 4, the searched architecture of considering latency and validation performance are listed. Notations are same with DARTS [21].

Evolution Trend of NSGA-III

(a) Evolution trend for 15 iterations using NSGA-III.
Figure 4: The overall evolution trend for all the 128 individuals during 15 iterations of evolution. The multi-objectives continuous evolution is conducted by NSGA-III.

In this section, we visualize the evolution trend by using NSGA-III [4] to guide architecture optimization. Figure 4 illustrates the first 15 generations. We draw different generations from light to dark. The Small Model Trap phenomenon becomes much more obvious during evolution, and larger models tend to be eliminated. Our proposed pNSGA-III could protect large models thus could solve the Small Model Trap problem.

Impact of Batch Size in pNSGA-III

Figure 5: The evolution trend of using pNSGA-III with different batchsize .

For the parameter optimization stage in the continuous evolution, different architectures are sampled from the maintained population to update parameters. We evaluated different batch size and find that they all grow with a similar trend. Thus we use in our experiment which is similar to ENAS [26].

Architecture Cell Type N1 N2 N3 N4
CARS-A Normal Skip C(k-2) Max3 C(k-2) Max3 C(k-2) Sep3 C(k-2)
Sep5 C(k-1) Avg3 C(k-1) Max3 C(k-1) Dil5 N1
Reduce Avg3 C(k-2) Max3 C(k-2) Max3 C(k-2) Dil5 C(k-2)
Max3 C(k-1) Skip C(k-1) Dil5 C(k-1) Skip N1
CARS-B Normal Sep5 C(k-2) Sep3 C(k-2) Dil3 C(k-2) Avg3 C(k-2)
Dil3 C(k-1) Avg3 N1 Max3 C(k-1) Skip C(k-1)
Reduce Sep5 C(k-2) Sep3 C(k-2) Avg3 C(k-2) Dil3 N2
Skip C(k-1) Max3 C(k-1) Avg3 C(k-1) Max3 C(k-2)
CARS-C Normal Sep5 C(k-2) Skip C(k-2) Skip C(k-2) Sep5 C(k-2)
Skip C(k-1) Skip C(k-1) Max3 C(k-1) Sep3 C(k-1)
Reduce Max3 C(k-2) Sep5 C(k-2) Dil5 C(k-2) Sep5 C(k-2)
Max3 C(k-1) Sep5 C(k-1) Max3 C(k-1) Dil3 C(k-1)
CARS-D Normal Sep5 C(k-2) Skip C(k-2) Skip C(k-2) Sep5 C(k-2)
Dil3 C(k-1) Avg3 C(k-1) Max3 C(k-1) Sep3 C(k-1)
Reduce Max3 C(k-2) Max3 C(k-2) Dil5 C(k-2) Sep5 C(k-1)
Max3 C(k-1) Sep3 C(k-1) Max3 C(k-1) Dil3 C(k-1)
CARS-E Normal Sep3 C(k-2) Skip C(k-2) Avg3 C(k-1) Skip N2
Sep3 C(k-1) Sep3 N1 Sep3 N1 Skip N3
Reduce Skip C(k-2) Avg3 C(k-2) Sep3 N1 Avg3 C(k-2)
Dil3 C(k-1) Skip N1 Max3 C(k-2) Sep3 N3
CARS-F Normal Skip C(k-2) Sep5 C(k-2) Sep5 N2 Skip C(k-2)
Sep5 C(k-1) Skip N1 Max3 C(k-2) Sep3 C(k-1)
Reduce Avg3 C(k-2) Dil3 C(k-2) Sep5 C(k-1) Max3 C(k-2)
Sep5 C(k-1) Dil5 C(k-1) Skip N1 Max3 C(k-1)
CARS-G Normal Max3 C(k-2) Sep3 C(k-2) Dil5 C(k-2) Avg3 C(k-2)
Dil5 C(k-1) Skip C(k-1) Sep4 C(k-1) Sep3 C(k-1)
Reduce Max3 C(k-2) Sep3 C(k-2) Sep3 C(k-2) Avg3 C(k-2)
Sep3 C(k-1) Sep5 C(k-1) Skip C(k-1) Dil3 C(k-1)
CARS-H Normal Sep5 C(k-2) Sep3 C(k-2) Avg3 C(k-2) Sep5 N1
Sep3 C(k-1) Dil5 N1 Skip C(k-1) Max3 C(k-2)
Reduce Sep5 C(k-2) Sep3 C(k-2) Dil3 N1 Sep5 C(k-2)
Max3 C(k-1) Skip C(k-1) Max3 C(k-2) Avg3 N2
CARS-I Normal Sep3 C(k-2) Skip C(k-2) Skip N1 Sep3 C(n-2)
Sep3 C(k-1) Sep5 C(k-1) Sep3 N2 Dil5 N3
Reduce Dil3 C(k-2) Max3 C(k-2) Skip C(k-1) Dil3 C(k-1)
Skip C(k-1) Max3 N1 Sep5 N2 Max3 N3
Table 3: Searched architectures by CARS using DARTS backbone on CIFAR10 dataset considering model size and performance. Notations are same with DARTS. Sep denotes separable convolution with kernel size. Dil denotes dilated separable convolution with kernel size . Max/Avg3 denotes Max/Avg Pooling with size 3. Skip denotes identity connection. None denotes no connection. C(k-2) and C(k-1) denotes the previous two cells connected with current cell C(k). N1 to N4 denotes four intermediate nodes within cell C(k).
Architecture Top-1 (%) Latency (ms) Cell Type N1 N2 N3 N4
CARS-Lat-A 62.6 41.9 Normal Skip C(k-2) Max3 C(k-1) Avg3 C(k-2) Max3 C(k-1)
Skip N1 Skip N2 Skip N1 Skip N3
Reduce Skip C(k-2) Sep5 C(k-1) Dil5 C(k-2) Max3 N1
Skip C(k-2) Max3 C(k-1) Skip C(k-1) Avg3 N3
CARS-Lat-B 67.8 44.9 Normal Skip C(k-2) Skip C(k-1) Skip C(k-1) Dil3 N1
Skip N1 Skip N2 Max3 C(k-2) Max3 N1
Reduce Max3 C(k-2) Max3 C(k-1) Skip C(k-1) Max3 C(k-2)
Sep3 C(k-2) Max3 C(k-1) Dil5 C(k-2) Avg3 N1
CARS-Lat-C 69.5 45.6 Normal Skip C(k-2) Avg3 C(k-1) Skip C(k-2) Skip C(k-1)
Max3 C(k-1) Skip N2 Dil3 N1 Skip N3
Reduce Skip C(k-2) Sep5 C(k-1) Avg3 C(k-1) Sep5 N1
Max3 C(k-2) Max3 C(k-1) Dil5 N1 Skip N3
CARS-Lat-D 71.9 57.6 Normal Sep3 C(k-2) Skip C(k-1) Skip C(k-2) Skip C(k-1)
Skip C(k-1) Avg3 N2 Dil3 N1 Skip N3
Reduce Dil5 C(k-2) Sep5 C(k-1) Avg3 C(k-1) Sep5 N1
Max3 C(k-2) Max3 C(k-1) Dil5 N1 Skip N3
CARS-Lat-E 72.0 60.8 Normal Dil5 C(k-2) Skip C(k-1) Skip C(k-1) Avg3 N1
Skip C(k-1) Avg3 N1 Skip C(k-2) Max3 C(k-1)
Reduce Skip C(k-2) Dil3 C(k-1) Sep3 C(k-2) Sep3 N1
Dil3 C(k-2) Avg3 N2 Sep3 C(k-1) Sep5 N1
CARS-Lat-F 72.2 64.5 Normal Sep5 C(k-2) Skip C(k-1) Skip C(k-2) Avg3 C(k-1)
Dil3 C(k-1) Max3 C(k-2) Skip C(k-2) Skip C(k-1)
Reduce Sep5 C(k-2) Max3 C(k-1) Sep5 C(k-1) Skip N1
Sep5 C(k-2) Sep5 C(k-1) Sep5 N2 Dil3 N3
CARS-Lat-G 74.0 89.3 Normal Sep5 C(k-2) Skip C(k-1) Sep3 C(k-2) Sep5 N1
Dil3 C(k-1) Max3 C(k-2) Skip C(k-2) Skip C(k-1)
Reduce Sep5 C(k-2) Max3 C(k-1) Sep3 C(k-2) Avg3 N1
Sep5 C(k-2) Sep5 C(k-1) Sep5 N2 Dil3 N3
Table 4: Searched architectures by CARS using DARTS backbone on CIFAR10 dataset considering latency and performance. Notations are same with DARTS. Notations are same with DARTS. Sep denotes separable convolution with kernel size. Dil denotes dilated separable convolution with kernel size . Max/Avg3 denotes Max/Avg Pooling with size 3. Skip denotes identity connection. None denotes no connection. C(k-2) and C(k-1) denotes the previous two cells connected with current cell C(k). N1 to N4 denotes four intermediate nodes within cell C(k).