### CARS

Pytorch code for paper: CARS: Continuous Evolution for Efficient Neural Architecture Search

view repo

Searching techniques in most of existing neural architecture search (NAS) algorithms are mainly dominated by differentiable methods for the efficiency reason. In contrast, we develop an efficient continuous evolutionary approach for searching neural networks. Architectures in the population which share parameters within one supernet in the latest iteration will be tuned over the training dataset with a few epochs. The searching in the next evolution iteration will directly inherit both the supernet and the population, which accelerates the optimal network generation. The non-dominated sorting strategy is further applied to preserve only results on the Pareto front for accurately updating the supernet. Several neural networks with different model sizes and performance will be produced after the continuous search with only 0.4 GPU days. As a result, our framework provides a series of networks with the number of parameters ranging from 3.7M to 5.1M under mobile settings. These networks surpass those produced by the state-of-the-art methods on the benchmark ImageNet dataset.

READ FULL TEXT VIEW PDFPytorch code for paper: CARS: Continuous Evolution for Efficient Neural Architecture Search

view repo

Network architecture search problem always contains two steps: network parameter optimization and network architecture optimization. The network parameter optimization step adjusts the parameters in the standard layers (*i.e. *

convolution, batch normalization, fully connected layer) which is similar to train a standard neural network (

The parameter optimization step contains independent and sharing optimization. Independent optimization learns each network separately, *i.e. *AmoebaNet [27] takes thousands of GPU days to evaluate thousands of models. To accelerate training time, [8, 7] initialize parameters by network morphism. One-shot method [1] step further by sharing all the parameters for different architectures among one supernet. Rather than training thousands of different architectures, only one supernet is required to be optimized.

The architecture optimization step include RL-based, EA-based, and gradient-based approaches. RL-based methods [40, 41, 26] use recurrent networks as network architecture controller, and the performance of the generated architectures are utilized as the rewards for training the controller. The controller would converge during training and finally outputs architectures with superior performance. EA-based approaches [35, 27, 33, 30] search architectures with the help of evolutionary algorithms. The validation performance of each individual is utilized as the fitness to evolve the next generation. Gradient-based approaches [21, 36, 34] view the network architecture as a set of learnable parameters and optimize the architecture by the standard back-propagation algorithm.

Considering multiple complementary objectives, *i.e. *performance, the number of parameters, float operations (FLOPs) and latency, there is no single architecture surpass all the others along with all the objectives. Therefore, a set of architectures within the Pareto front are desired. Many different works have been proposed to deal with multi-objective network architecture search. NEMO [16] targets at speed and accuracy. DPPNet and LEMONADE [6, 8] considers device-related and device-agnostic objectives. MONAS [14] targets at accuracy and energy. NSGANet [23] considers FLOPs and accuracy. These methods are less efficient for models are optimized separately. In contrast, our architecture optimization and parameter optimization steps are conducted iteratively rather than first fully training a set of parameters and then optimize architectures. Besides, the parameters for different architectures are shared, thus much more efficient during searching.

In this section, we develop a novel continuous evolutionary approach for searching neural architectures including two procedures, *i.e. *parameters optimization and architecture optimization.

We use Genetic Algorithm (GA) for architecture evolution, because GA could provide a vast searching space. We maintain a set of architectures (a.k.a. connections)

, where is the population size. The architectures in the population gradually update according to the proposed pNSGA-III method during architecture optimization step. To make the searching stage efficient, we maintain a supernet which shares parameters for different architectures, which could dramatically reduce the computational complexity of separately training these different architectures during searching.Different architectures are sampled from the supernet , and each network can be stood by a set of parameters with float precision and a set of binary connection parameters (*i.e.* ) . The 0-element connection means the network does not contain this connection to transform data flow and 1-element connection means the network use this connection. From this point of view, each network could be represented as pair.

Full precision parameters are shared by a set of networks. If these network architectures are fixed, the optimal could be optimized by standard back-propagation, which fits for all the networks to achieve higher recognition performance. After the parameters are converged, we could alternately optimize the binary connections by GA algorithm. These two stages form the main optimization for our proposed continuous evolution algorithm and are processed in an alternative training way. We introduce these two kinds of update in the following.

Parameter is the collection of all the parameters in the network. The parameter of the th individual is , where the is the mask operation which keep the parameters of the complete graph only for the positions corresponding to 1-elements in the connection . With input data fed into the network, the predictions of this network is , where is the th architecture and is the sampled weights. Given ground truth , the prediction loss can be expressed as . The gradient of can be calculated as

(1) |

Parameters should fit all the individuals, and thus the gradients for all networks are accumulated to calculate the gradient of parameters

(2) |

Any layer is only optimized by networks which use this layer during forward. By collecting the gradients of individuals in the population, the parameters are updated through SGD algorithm.

As we have maintained a large set of architectures with shared weights in the supernet, we borrow the idea of stochastic gradient descent and use mini-batch architectures for updating parameters. Accumulating the gradients for all networks would take much time for one-step gradient descent, and thus we use mini-batch architectures for updating shared weights. We use

different architectures where , and the indexes of architectures are to update parameters. The efficient parameter updating of Eqn 2 is detailed as Eqn 3(3) |

Hence, the gradient over a mini-batch of architectures is taken as an approximation of the averaged gradients of all the different individuals. The time cost for each update could be largely reduced, and the appropriate mini-batch size leads to a balance between efficiency and accuracy.

As for the architecture optimization procedure, we use the evolution algorithm with the non-dominated sorting strategy, which was introduced in the NSGA-III [4]. Denoting as different networks and as different measurements we want to minimize. In general, these measurements, including the number of parameters, float operations, latency, and performance, could have some conflicts, which increases the difficulty in discovering an optimal solution that minimizes all these metrics.

In practice, if architecture dominates , the performance of is no less than that of on all the measurements and behaves better on at least one metric. Formally, the definition of domination can be given as below.

Consider two networks and network , and a series of measurements we want to minimize. If

(4) | ||||

is said to dominate , i.e. .

According to the above definition, if dominates , can be replaced by during the evolution procedure since performs better in terms of at least one metric and no worse on other metrics. By exploiting this approach, we can select a series of excellent neural architectures from the population generated in the current iteration. Then, these networks can be utilized for updating the corresponding parameters in the supernet.

Although the above non-dominated sorting using NSGA-III method [4] can help us to select some better models for updating parameter, there exists a *small model trap* phenomenon during the searching procedure. Specifically, since the parameters in the supernetwork still need optimization, the accuracy of each individual architecture in the current iteration may not always stand for its performance that can be eventually achieved, as discussed in NASBench-101 [37]. Thus, some smaller models of fewer parameters but higher test accuracy tend to dominate those larger models of lower fitness but have the potential for achieving higher accuracies.

NSGA-III | pNSGA-III |

Therefore, we propose to improve the conventional NSGA-III for protecting these larger models, namely pNSGA-III. More specifically, the pNSGA-III algorithm takes the increasing speed of performance into consideration. We take the validation performance and the number of parameters as an example. For NSGA-III method, the non-dominated sorting algorithm considering two different objectives and select individuals according to the sorted Pareto stages. For the proposed pNSGA-III, besides considering the number of parameters and performance, we also conduct a non-dominated sorting algorithm considering the performance increasing speed and the number of parameters. Then the two different Pareto stages are merged, and gradually select individuals from the first stage to fill the generation. In this way, the large networks with slower performance increasing speed could be kept in the population.

Figure 1 illustrates the selected models in every iteration using the conventional NSGA-III and the modified pNSGA-III, respectively. It is obvious that the pNSGA-III can provide models with a wide range of model sizes and comparable performance to that of the NSGA-III.

In summary, the proposed CARS for searching optimal neural architecture has two stages: 1) Architecture Optimization 2) Parameter Optimization. In addition, parameter warmup is also introduced for the few epochs.

Since the shared weights of our supernet are initialized randomly, if a set of architectures are also initialized with random architectures, the most frequently used blocks for all the architectures would be trained more times compared with other blocks. Thus, by following one-shot NAS methods [1, 11], we use uniform sampling strategy to initialize the parameters in the supernet. In this way, the supernet trains each possible architecture with the same possibility which means each path in a searching space is sampled equivalently. For example, in DARTS [21]

pipeline, there are 8 different operations for each node, including convolution, pooling, identity mapping, and no connection. Each operation will be trained with a probability of

. This initialization step initialized the sharing weights of the supernet for few epochs.After initializing the parameters of the supernet, we first randomly sample different architectures, where is the number of maintained individuals among Pareto front using pNSGA-III, and we define as a hyper-parameter. During the architecture evolution step, we first generate offsprings, where is the hyper-parameter to control the number of offsprings. We then use pNSGA-III to select individuals from individuals.

Given a set of architectures, we use the proposed mini-batch architectures update scheme for parameter optimization according to Eqn 3.

Algorithm 1 summarizes the detailed procedure of the proposed continuous evolutionary algorithm for searching neural architectures.

Architecture | Test Error | Params | Search Cost | Search |

(%) | (M) | (GPU days) | Method | |

DenseNet-BC [15] | 3.46 | 25.6 | - | manual |

PNAS [19] | 3.41 | 3.2 | 225 | SMBO |

ENAS + cutout [26] | 2.91 | 4.2 | 4 | RL |

NASNet-A + cutout [41] | 2.65 | 3.3 | 2000 | RL |

AmoebaNet-A + cutout [27] | 3.12 | 3.1 | 3150 | evolution |

Hierarchical evolution [20] | 3.75 | 15.7 | 300 | evolution |

SNAS (mild) + cutout [36] | 2.98 | 2.9 | 1.5 | gradient |

SNAS (moderate) + cutout [36] | 2.85 | 2.8 | 1.5 | gradient |

SNAS (aggressive) + cutout [36] | 3.10 | 2.3 | 1.5 | gradient |

DARTS (first) + cutout [21] | 3.00 | 3.3 | 1.5 | gradient |

DARTS (second) + cutout [21] | 2.76 | 3.3 | 4 | gradient |

RENA [39] | 3.87 | 3.4 | - | RL |

NSGANet [23] | 3.85 | 3.3 | 8 | evolution |

LEMONADE [8] | 3.05 | 4.7 | 80 | evolution |

CARS-A | 3.00 | 2.4 | 0.4 | evolution |

CARS-B | 2.87 | 2.7 | 0.4 | evolution |

CARS-C | 2.84 | 2.8 | 0.4 | evolution |

CARS-D | 2.95 | 2.9 | 0.4 | evolution |

CARS-E | 2.86 | 3.0 | 0.4 | evolution |

CARS-F | 2.79 | 3.1 | 0.4 | evolution |

CARS-G | 2.74 | 3.2 | 0.4 | evolution |

CARS-H | 2.66 | 3.3 | 0.4 | evolution |

CARS-I | 2.62 | 3.6 | 0.4 | evolution |

Comparison with state-of-the-art image classifiers on CIFAR-10 dataset. The multi-objectives used for architecture optimization are performance and model size.

During the searching stage of CARS, the training data is split into train/val set, and the train set is used for parameter optimization, meanwhile, the val set is used for architecture optimization stage. Assuming the average training time on the train set for one architecture is , and the inference time on val set is . The first warmup stage takes epochs, and it needs in this stage to initialize parameters in the supernet .

Assuming the architectures evolve for iterations in total. And each iteration contains parameter optimization and architecture optimization stages. The parameter optimization stage trains the supernet for epochs on train set during one evolution iteration, thus the time cost for parameter optimization in one evolution iteration is if we consider the mini-batch size . For the architecture optimization stage, all the individuals can be inferred in parallel, so the searching time in this stage could be calculated as . Thus the total time cost for evolution iterations is . All the searching time cost in CARS is,

(5) | ||||

In this section, we first introduce the datasets, backbone, and evolution details we used in our experiments. We then search a set of architectures using our proposed CARS on CIFAR-10 [17] dataset and show the necessity of proposed pNSGA-III over popularly used NSGA-III [4] during continuous evolution. We consider three objectives. Besides performance, we separately consider device-aware latency and device-agnostic model size. After that, we evaluate our searched architectures on CIFAR-10 dataset and ILSVRC2012 [5] large scale recognition dataset to demonstrate the effectiveness of CARS.

Our CARS experiments are performed on CIFAR-10 [17] and ILSVRC2012 [5] large scale image classification task. These two datasets are popular benchmarks on the recognition task. We search architectures on CIFAR-10 dataset, and the searched architectures are evaluated on CIFAR-10 and ILSVRC2012 datasets.

To illustrate the effectiveness of our method, we evaluate our CARS over state-of-the-art NAS system DARTS [21]

. DARTS is a differentiable NAS system searching for cells and shares the searched cells from shallow layers to deeper layers. The searching space contains 8 different blocks, including four types of convolution, two kinds of pooling, skip connect and no connection. DARTS searches for two kinds of topology for normal cell and reduction cell. A normal cell is used for the layers that have the same spatial size for input feature and output feature. Reduction cell is used for layers with downsampling on input feature maps. After searching for these two kinds of cells, the network is constructed by stacking a set of searched cells.

In the DARTS method, each intermediate node in a cell is connected with two previous nodes. Thus each node has its own searching space. Crossover and mutation are conducted on the corresponding nodes with the same searching space. Both crossover ratio and mutation ratio are set as 0.25, 0.25, and we randomly generate new architectures with a probability of 0.5. For the crossover operation, each node has a ratio of 0.5 to crossover its connections, and for mutation operation, each node has a ratio of 0.5 to be randomly reassigned.

Architecture | Top-1 | Top-5 | Params | Search Cost | Search | |

Acc (%) | Acc (%) | (M) | (M) | (GPU days) | Method | |

ResNet50 [13] | 75.3 | 92.2 | 25.6 | 4100 | - | manual |

InceptionV1 [32] | 69.8 | 90.1 | 6.6 | 1448 | - | manual |

MobileNetV2 (1) [29] | 72.0 | 90.4 | 3.4 | 300 | - | manual |

ShuffleNetV2 (2) [24] | 74.9 | 90.1 | 7.4 | 591 | - | manual |

PNAS [19] | 74.2 | 91.9 | 5.1 | 588 | 224 | SMBO |

SNAS (mild) [36] | 72.7 | 90.8 | 4.3 | 522 | 1.5 | gradient |

DARTS [21] | 73.3 | 91.3 | 4.7 | 574 | 4 | gradient |

PARSEC [3] | 74.0 | 91.6 | 5.6 | 548 | 1 | gradient |

NASNet-A [41] | 74.0 | 91.6 | 5.3 | 564 | 2000 | RL |

NASNet-B [41] | 72.8 | 91.3 | 5.3 | 488 | 2000 | RL |

NASNet-C [41] | 72.5 | 91.0 | 4.9 | 558 | 2000 | RL |

AmoebaNet-A [27] | 74.5 | 92.0 | 5.1 | 555 | 3150 | evolution |

AmoebaNet-B [27] | 74.0 | 91.5 | 5.3 | 555 | 3150 | evolution |

AmoebaNet-C [27] | 75.7 | 92.4 | 6.4 | 570 | 3150 | evolution |

CARS-A | 72.8 | 90.8 | 3.7 | 430 | 0.4 | evolution |

CARS-B | 73.1 | 91.3 | 4.0 | 463 | 0.4 | evolution |

CARS-C | 73.3 | 91.4 | 4.2 | 480 | 0.4 | evolution |

CARS-D | 73.3 | 91.5 | 4.3 | 496 | 0.4 | evolution |

CARS-E | 73.7 | 91.6 | 4.4 | 510 | 0.4 | evolution |

CARS-F | 74.1 | 91.8 | 4.5 | 530 | 0.4 | evolution |

CARS-G | 74.2 | 91.9 | 4.7 | 537 | 0.4 | evolution |

CARS-H | 74.7 | 92.2 | 4.8 | 559 | 0.4 | evolution |

CARS-I | 75.2 | 92.5 | 5.1 | 591 | 0.4 | evolution |

Our experiments on CIFAR-10 include NSGA method comparison, searching and evaluation. For NSGA method comparison, we compare NSGA-III with proposed pNSGA-III. After that, we conduct CARS on CIFAR-10 and search for a set of architectures which separately considers latency, model size and performance. At last, we examine the searched architectures with different computing resources.

To search for a set of architectures using CARS, we split the CIFAR-10 train set by 25,000 in searching which named as train set and 25,000 for evaluation, which named as val set. The split strategy is same with DARTS and SNAS. We search for 500 epochs in total, and the parameter warmup stage lasts for the first 50 epochs. After that, we initialize the population, which maintains different architectures and gradually evolve them using proposed pNSGA-III. The parameter optimization stage is optimized by train set and trains for 10 epochs during one evolution iteration. The architecture optimization stage uses the val set to update architectures according to proposed pNSGA-III. We separately search by considering performance and model size/latency.

We train CARS with different NSGA methods during architecture optimization stage for comparison and objectives are the model size and validation performance. We visualize the growing trend for different architecture optimization methods after parameter warmup stage. As Figure 1 shows, CARS with NSGA-III would encounter the *Small Model Trap* problem, for small models tend to eliminate large models during architecture optimization stage. In contrast, architecture optimization using pNSGA-III protects larger models which have the potential to increase their accuracies during later epochs but converge slower than small models at the beginning. To search for several models with various computing resources, it is essential to maintain larger models in the population rather than dropping them during architecture optimization stage.

For the experiment of considering the model size and performance, the training time on the train set takes around 1 minute, and the inference time on val set is around 5 seconds. For the first initialization stage, it trains for 50 epochs, so the time cost in this stage takes around an hour. The continuous evolution algorithm evolves the architectures for iterations. For the architecture optimization stage in one iteration, the parallel evaluation time is . For the parameter optimization stage, we set to be 1 in our experiments since different batch size does not affect the overall growth trend for individuals in the population which is discussed in supplementary material, and we train the supernet for 10 epochs in one evolution iteration. So the parameter optimization time is about 10 minutes. Thus the time cost for total evolution iteration is around 9 hours, and the total searching time is around 0.4 GPU day. For the experiment of considering the latency and performance, the running latency for each model is evaluated during architecture optimization step. Thus the searching time is around 0.5 GPU day.

After finishing the CARS, our method keeps different architectures, and we evaluate architectures which have a similar number of parameters with previous works [21, 36] for comparison. We retrain the searched architectures on CIFAR-10 dataset with all the training data and evaluate on the test set. All the training parameters are the same with DARTS [21].

We compare the searched architectures with state-of-the-art methods which utilize similar searching space in Table 1. All the searched architectures can be found in the supplementary material. Our searched architectures have the number of parameters vary from 2.4M to 3.6M on CIFAR-10 dataset, and the performance of these architectures are on par with the state-of-the-art NAS methods, while if we evolve with NSGA-III method, we would only search for a set of architectures with approximately 2.4M parameters without larger models, and the models perform relatively poor.

Compared to previous methods like DARTS and SNAS, our method is capable of searching for complete architectures over a large range of the searching space. Our CARS-G model achieves comparable accuracy with DARTS (second order) resulting in an approximate 2.75 error rate with smaller model size. With the same 3.3M parameters as DARTS (second order), our CARS-H achieves lower test error, 2.66% vs. 2.76%. For the small models, our searched CARS-A/C/D also achieve comparable results with SNAS. Besides, our large model CARS-I achieves lower error rate 2.62% with slightly more parameters. The overall trend from CARS-A to CARS-J is that the error rate gradually decreases while increasing the model size. These models are all Pareto solutions. Compared to other multi-objective methods like RENA [39], NSGANet [23] and LEMONADE [8], our searched architectures also show superior performance over these methods.

In order to have an explicit understanding of the proposed method, we further visualize the normal and reduction blocks searched using CARS and DARTS in Figure 5, respectively. Wherein, the CARS-H and DARTS (second order) have a similar number of parameters (3.3M), but the CARS-H has higher accuracy. It can be found in Figure 5 that, there are more parameters in the CARS-H reduction block for preserving more useful information, and the size of the normal block of CARS-H is smaller than that of the DARTS (second order) to avoid unnecessary computations. This phenomenon mainly because the proposed method using EA has much larger searching space, and the genetic operations can effectively jump out of the local optimum, which demonstrates its superiority.

For those architectures searched on CIFAR-10 dataset, we evaluate the transferability of the architectures on ILSVRC2012 dataset. We use 8 Nvidia Tesla V100 to train in parallel, and the batch size is set to 640. We train 250 epochs in total, the learning rate is set to 0.5 with linear decay scheduler, and we warmup [10] the learning rate for the first five epochs due to the large batch size we used. Momentum is set to 0.9, and weight decay is set to 3e-5. Label smooth is also included with a smooth ratio of 0.1.

The evaluated results show the transferability capability of our searched architectures. Our models cover an extensive range of the parameters ranging from 3.7M to 5.1M with 430 to 590 MFLOPs. For different deploy environments, we can easily select an architecture which satisfies the restrictions.

Our CARS-I surpasses PNAS by 1% on Top1 with the same number of parameters and approximate FLOPs. The CARS-G shows superior results over DARTS by 0.9% Top1 accuracy with the same number of parameters. Also, CARS-D surpasses SNAS (mild) by 0.6% Top1 accuracy with the same number of parameters. For different models of NASNet and AmoebaNet, our method also has various models that achieve higher accuracy using the same number of parameters. By using the proposed pNSGA-III, the larger architectures like CARS-I could be protected during architecture optimization stages. Because of the efficient parameter sharing, we could search a set of superior transferable architectures during the one-time search.

For experiments that search architectures by runtime latency and performance, we use the evaluated runtime latency on the mobile phone as an objective and the performance as another. These two objectives are used for generating the next population. We evaluated the searched architectures on ILSVRC2012 in Figure 3. The searched architectures cover an actual runtime latency from 40ms to 90ms and surpass the counterparts.

The evolutionary algorithm based NAS methods are able to find models with high-performance, but the searching time of these methods is extremely long, the main problem is each candidate network is trained separately. In order to make this efficient, we propose continuous evolution architecture search, namely, CARS, which maximally utilizes the learned knowledge such as architectures and parameters in the latest evolution iteration. A supernet is constructed with considerable cells and blocks. Individuals are generated through the benchmark operations in an evolutionary algorithm. Non-dominated sort strategy is utilized to select architectures with different model sizes and high accuracies for updating the supernet. Experiments on benchmark datasets show that the proposed CARS can provide a number of architectures on the Pareto front with high efficiency, *e.g. *the searching cost on the CIFAR-10 benchmark is only 0.4 GPU days. The searched models are superior to models produced by state-of-the-art methods in terms of both model size/latency and accuracies.

- [1] (2018) Understanding and simplifying one-shot architecture search. ICML. Cited by: Network Architecture Search, Parameter Warmup..
- [2] (2018) Efficient architecture search by network transformation. AAAI. Cited by: CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [3] (2019) Probabilistic neural architecture search.. arXiv. Cited by: Table 2.
- [4] (2014) An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: solving problems with box constraints. TEC 18 (4). Cited by: Architecture Optimization, Architecture Optimization, Experiments, Evolution Trend of NSGA-III.
- [5] (2009) Imagenet: a large-scale hierarchical image database. CVPR. Cited by: Datasets., Experiments.
- [6] (2018) DPP-net: device-aware progressive search for pareto-optimal neural architectures. ECCV. Cited by: Multi-objective Network Architecture Search.
- [7] (2018) Simple and efficient architecture search for convolutional neural networks. ICLR. Cited by: Network Architecture Search.
- [8] (2019) Efficient multi-objective neural architecture search via lamarckian evolution. ICLR. Cited by: Network Architecture Search, Multi-objective Network Architecture Search, Table 1, Evaluate on CIFAR-10..
- [9] (2015) Fast r-cnn. ICCV. Cited by: CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [10] (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv. Cited by: Evaluate on ILSVRC2012.
- [11] (2019) Single path one-shot neural architecture search with uniform sampling. arXiv. Cited by: Parameter Warmup..
- [12] (2017) Mask r-cnn. ICCV. Cited by: CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [13] (2016) Deep residual learning for image recognition. CVPR. Cited by: Table 2, CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [14] (2018) Monas: multi-objective neural architecture search using reinforcement learning. arXiv. Cited by: Multi-objective Network Architecture Search.
- [15] (2017) Densely connected convolutional networks. CVPR. Cited by: Table 1, CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [16] (2017) NEMO : neuro-evolution with multiobjective optimization of deep neural network for speed and accuracy. ICML Workshop. Cited by: Multi-objective Network Architecture Search.
- [17] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: Datasets., Experiments.
- [18] (2012) Imagenet classification with deep convolutional neural networks. NeurIPS. Cited by: CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [19] (2018) Progressive neural architecture search. ECCV. Cited by: Table 1, Table 2.
- [20] (2018) Hierarchical representations for efficient architecture search. ICLR. Cited by: Table 1.
- [21] (2019) Darts: differentiable architecture search. ICLR. Cited by: Network Architecture Search, Parameter Warmup., Table 1, supernet Backbones., Evaluate on CIFAR-10., Table 2, Network Architectures, CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [22] (2016) Ssd: single shot multibox detector. ECCV. Cited by: CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [23] (2019) NSGA-net: a multi-objective genetic algorithm for neural architecture search. GECCO. Cited by: Multi-objective Network Architecture Search, Table 1, Evaluate on CIFAR-10..
- [24] (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. ECCV. Cited by: Table 2.
- [25] (2019) Evolving deep neural networks. Artificial Intelligence in the Age of Neural Networks and Brain Computing. Cited by: CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [26] (2018) Efficient neural architecture search via parameter sharing. ICML. Cited by: Network Architecture Search, Table 1, Impact of Batch Size in pNSGA-III, CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [27] (2019) Regularized evolution for image classifier architecture search. AAAI. Cited by: Network Architecture Search, Network Architecture Search, Table 1, Table 2, CARS: Continuous Evolution for Efficient Neural Architecture Search, CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [28] (2017) Large-scale evolution of image classifiers. ICML. Cited by: CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [29] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. CVPR. Cited by: Table 2.
- [30] (2019) Co-evolutionary compression for unpaired image translation.. ICCV. Cited by: Network Architecture Search.
- [31] (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [32] (2015) Going deeper with convolutions. CVPR. Cited by: Table 2, CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [33] (2018) Towards evolutionary compression. SIGKDD. Cited by: Network Architecture Search.
- [34] (2019) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search.. CVPR. Cited by: Network Architecture Search.
- [35] (2017) Genetic cnn. ICCV. Cited by: Network Architecture Search, CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [36] (2019) SNAS: stochastic neural architecture search. ICLR. Cited by: Network Architecture Search, Table 1, Evaluate on CIFAR-10., Table 2.
- [37] (2019) NAS-bench-101: towards reproducible neural architecture search. ICML. Cited by: Architecture Optimization.
- [38] (2018) Practical block-wise neural network architecture generation. CVPR. Cited by: CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [39] (2018) Resource-efficient neural architect. arXiv. Cited by: Table 1, Evaluate on CIFAR-10..
- [40] (2017) Neural architecture search with reinforcement learning. ICLR. Cited by: Network Architecture Search, CARS: Continuous Evolution for Efficient Neural Architecture Search.
- [41] (2018) Learning transferable architectures for scalable image recognition. CVPR. Cited by: Network Architecture Search, Table 1, Table 2.

In this supplementary material, we list all the searched architectures using CARS on the CIFAR10 dataset and explore the effect of hyper-parameters utilized in CARS.

In this section, we visualize the evolution trend by using NSGA-III [4] to guide architecture optimization. Figure 4 illustrates the first 15 generations. We draw different generations from light to dark. The *Small Model Trap* phenomenon becomes much more obvious during evolution, and larger models tend to be eliminated. Our proposed pNSGA-III could protect large models thus could solve the *Small Model Trap* problem.

For the parameter optimization stage in the continuous evolution, different architectures are sampled from the maintained population to update parameters. We evaluated different batch size and find that they all grow with a similar trend. Thus we use in our experiment which is similar to ENAS [26].

Architecture | Cell Type | N1 | N2 | N3 | N4 |
---|---|---|---|---|---|

CARS-A | Normal | Skip C(k-2) | Max3 C(k-2) | Max3 C(k-2) | Sep3 C(k-2) |

Sep5 C(k-1) | Avg3 C(k-1) | Max3 C(k-1) | Dil5 N1 | ||

Reduce | Avg3 C(k-2) | Max3 C(k-2) | Max3 C(k-2) | Dil5 C(k-2) | |

Max3 C(k-1) | Skip C(k-1) | Dil5 C(k-1) | Skip N1 | ||

CARS-B | Normal | Sep5 C(k-2) | Sep3 C(k-2) | Dil3 C(k-2) | Avg3 C(k-2) |

Dil3 C(k-1) | Avg3 N1 | Max3 C(k-1) | Skip C(k-1) | ||

Reduce | Sep5 C(k-2) | Sep3 C(k-2) | Avg3 C(k-2) | Dil3 N2 | |

Skip C(k-1) | Max3 C(k-1) | Avg3 C(k-1) | Max3 C(k-2) | ||

CARS-C | Normal | Sep5 C(k-2) | Skip C(k-2) | Skip C(k-2) | Sep5 C(k-2) |

Skip C(k-1) | Skip C(k-1) | Max3 C(k-1) | Sep3 C(k-1) | ||

Reduce | Max3 C(k-2) | Sep5 C(k-2) | Dil5 C(k-2) | Sep5 C(k-2) | |

Max3 C(k-1) | Sep5 C(k-1) | Max3 C(k-1) | Dil3 C(k-1) | ||

CARS-D | Normal | Sep5 C(k-2) | Skip C(k-2) | Skip C(k-2) | Sep5 C(k-2) |

Dil3 C(k-1) | Avg3 C(k-1) | Max3 C(k-1) | Sep3 C(k-1) | ||

Reduce | Max3 C(k-2) | Max3 C(k-2) | Dil5 C(k-2) | Sep5 C(k-1) | |

Max3 C(k-1) | Sep3 C(k-1) | Max3 C(k-1) | Dil3 C(k-1) | ||

CARS-E | Normal | Sep3 C(k-2) | Skip C(k-2) | Avg3 C(k-1) | Skip N2 |

Sep3 C(k-1) | Sep3 N1 | Sep3 N1 | Skip N3 | ||

Reduce | Skip C(k-2) | Avg3 C(k-2) | Sep3 N1 | Avg3 C(k-2) | |

Dil3 C(k-1) | Skip N1 | Max3 C(k-2) | Sep3 N3 | ||

CARS-F | Normal | Skip C(k-2) | Sep5 C(k-2) | Sep5 N2 | Skip C(k-2) |

Sep5 C(k-1) | Skip N1 | Max3 C(k-2) | Sep3 C(k-1) | ||

Reduce | Avg3 C(k-2) | Dil3 C(k-2) | Sep5 C(k-1) | Max3 C(k-2) | |

Sep5 C(k-1) | Dil5 C(k-1) | Skip N1 | Max3 C(k-1) | ||

CARS-G | Normal | Max3 C(k-2) | Sep3 C(k-2) | Dil5 C(k-2) | Avg3 C(k-2) |

Dil5 C(k-1) | Skip C(k-1) | Sep4 C(k-1) | Sep3 C(k-1) | ||

Reduce | Max3 C(k-2) | Sep3 C(k-2) | Sep3 C(k-2) | Avg3 C(k-2) | |

Sep3 C(k-1) | Sep5 C(k-1) | Skip C(k-1) | Dil3 C(k-1) | ||

CARS-H | Normal | Sep5 C(k-2) | Sep3 C(k-2) | Avg3 C(k-2) | Sep5 N1 |

Sep3 C(k-1) | Dil5 N1 | Skip C(k-1) | Max3 C(k-2) | ||

Reduce | Sep5 C(k-2) | Sep3 C(k-2) | Dil3 N1 | Sep5 C(k-2) | |

Max3 C(k-1) | Skip C(k-1) | Max3 C(k-2) | Avg3 N2 | ||

CARS-I | Normal | Sep3 C(k-2) | Skip C(k-2) | Skip N1 | Sep3 C(n-2) |

Sep3 C(k-1) | Sep5 C(k-1) | Sep3 N2 | Dil5 N3 | ||

Reduce | Dil3 C(k-2) | Max3 C(k-2) | Skip C(k-1) | Dil3 C(k-1) | |

Skip C(k-1) | Max3 N1 | Sep5 N2 | Max3 N3 |

Architecture | Top-1 (%) | Latency (ms) | Cell Type | N1 | N2 | N3 | N4 |
---|---|---|---|---|---|---|---|

CARS-Lat-A | 62.6 | 41.9 | Normal | Skip C(k-2) | Max3 C(k-1) | Avg3 C(k-2) | Max3 C(k-1) |

Skip N1 | Skip N2 | Skip N1 | Skip N3 | ||||

Reduce | Skip C(k-2) | Sep5 C(k-1) | Dil5 C(k-2) | Max3 N1 | |||

Skip C(k-2) | Max3 C(k-1) | Skip C(k-1) | Avg3 N3 | ||||

CARS-Lat-B | 67.8 | 44.9 | Normal | Skip C(k-2) | Skip C(k-1) | Skip C(k-1) | Dil3 N1 |

Skip N1 | Skip N2 | Max3 C(k-2) | Max3 N1 | ||||

Reduce | Max3 C(k-2) | Max3 C(k-1) | Skip C(k-1) | Max3 C(k-2) | |||

Sep3 C(k-2) | Max3 C(k-1) | Dil5 C(k-2) | Avg3 N1 | ||||

CARS-Lat-C | 69.5 | 45.6 | Normal | Skip C(k-2) | Avg3 C(k-1) | Skip C(k-2) | Skip C(k-1) |

Max3 C(k-1) | Skip N2 | Dil3 N1 | Skip N3 | ||||

Reduce | Skip C(k-2) | Sep5 C(k-1) | Avg3 C(k-1) | Sep5 N1 | |||

Max3 C(k-2) | Max3 C(k-1) | Dil5 N1 | Skip N3 | ||||

CARS-Lat-D | 71.9 | 57.6 | Normal | Sep3 C(k-2) | Skip C(k-1) | Skip C(k-2) | Skip C(k-1) |

Skip C(k-1) | Avg3 N2 | Dil3 N1 | Skip N3 | ||||

Reduce | Dil5 C(k-2) | Sep5 C(k-1) | Avg3 C(k-1) | Sep5 N1 | |||

Max3 C(k-2) | Max3 C(k-1) | Dil5 N1 | Skip N3 | ||||

CARS-Lat-E | 72.0 | 60.8 | Normal | Dil5 C(k-2) | Skip C(k-1) | Skip C(k-1) | Avg3 N1 |

Skip C(k-1) | Avg3 N1 | Skip C(k-2) | Max3 C(k-1) | ||||

Reduce | Skip C(k-2) | Dil3 C(k-1) | Sep3 C(k-2) | Sep3 N1 | |||

Dil3 C(k-2) | Avg3 N2 | Sep3 C(k-1) | Sep5 N1 | ||||

CARS-Lat-F | 72.2 | 64.5 | Normal | Sep5 C(k-2) | Skip C(k-1) | Skip C(k-2) | Avg3 C(k-1) |

Dil3 C(k-1) | Max3 C(k-2) | Skip C(k-2) | Skip C(k-1) | ||||

Reduce | Sep5 C(k-2) | Max3 C(k-1) | Sep5 C(k-1) | Skip N1 | |||

Sep5 C(k-2) | Sep5 C(k-1) | Sep5 N2 | Dil3 N3 | ||||

CARS-Lat-G | 74.0 | 89.3 | Normal | Sep5 C(k-2) | Skip C(k-1) | Sep3 C(k-2) | Sep5 N1 |

Dil3 C(k-1) | Max3 C(k-2) | Skip C(k-2) | Skip C(k-1) | ||||

Reduce | Sep5 C(k-2) | Max3 C(k-1) | Sep3 C(k-2) | Avg3 N1 | |||

Sep5 C(k-2) | Sep5 C(k-1) | Sep5 N2 | Dil3 N3 |