1 Introduction
Neural networks have achieved remarkable performance on large scale supervised learning tasks in computer vision. A majority of this progress was achieved by architectures designed manually by skilled practitioners. Neural Architecture Search (NAS) [43]
attempts to automate this process to find good architectures for a given dataset. This promise has led to tremendous improvements in convolutional neural network architectures, in terms of predictive performance, computational complexity and model size on standard largescale image classification benchmarks such as ImageNet
[32], CIFAR10 [17], CIFAR100 [17] etc. However, the utility of these developments, has so far eluded more widespread and practical applications. These are cases where one wishes to use NAS to obtain highperformance models on custom nonstandard datasets, optimizing possibly multiple competing objectives, and to do so without the steep computation burden of existing NAS methods.The goal of NAS is to obtain both the optimal architecture and its associated optimal weights. The key barrier to realizing the full potential of NAS is the nature of its formulation. NAS is typically treated as a bilevel optimization problem, where an inner optimization loops over the weights of the network for a given architecture, while the outer optimization loops over the network architecture itself. The computational challenge of solving this problem stems from both the upper and lower level optimization. Learning the optimal weights of the network in the lower level necessitates costly iterations of stochastic gradient descent. Similarly, exhaustively searching the optimal architecture is prohibitive due to the discrete nature of the architecture description, size of search space and our desire to optimize multiple, possibly competing, objectives. Mitigating both of these challenges explicitly and simultaneously is the goal of this paper.
Many approaches have been proposed to improve the efficiency of NAS algorithms, both in terms of the upper level and the lower level. A majority of them focuses on the lower level, including weight sharing [1, 30, 20], proxy models [43, 31], coarse training [35], etc. But these approaches still have to sample, explicitly or implicitly, a large number of architectures to evaluate in the upper level. In contrast, there is relatively little focus on improving the sample efficiency of the upper level optimization. A few recent approaches [19, 10] adopt surrogates that predict the lower level performance with the goal of navigating the upper level search space efficiently. However, these surrogate predictive models are still very sample inefficient since they are learned in an offline stage by first sampling a large number of architectures that require full lower level optimization.
In this paper, we propose a practically efficient NAS algorithm, by adopting explicit surrogate models simultaneously at both the upper and the lower level. Our lower level surrogate adopts a finetuning approach, where the initial weights for finetuning are obtained by a supernet model, such as [1, 4, 5]. Our upper level surrogate adopts an online learning algorithm, that focuses on architectures in the search space that are close to the current tradeoff front, as opposed to a random/uniform set of architectures used in the offline surrogate approaches [10, 13, 19]. Our online surrogate significantly improves the sample efficiency of the upper level optimization problem in comparison to the offline surrogates. For instance, OnceForAll [5] and PNAS [19] sample 16,000 and 1,160^{1}^{1}1Estimate from # of models evaluated by PNAS, actual sample size is not reported. architectures, respectively, to learn the upper level surrogate. In contrast, we only have to sample 350 architectures to obtain a model with similar performance.
An overview of our approach is shown in Fig.1. We refer to the proposed NAS algorithm as MSuNAS and the resulting architectures as NSGANetV2. Our method is designed to provide a set of highperformance models on a custom dataset (large or small scale, multiclass or finegrained) while optimizing possibly multiple objectives of interest. Our key contributions are:
 An alternative approach to solve the bilevel NAS problem, i.e., simultaneously optimizing the architecture and learn the optimal model weights. However, instead of gradient based relaxations (e.g., DARTS), we advocate for surrogate models. Overall, given a dataset and a set of objectives to optimize, MSuNAS can design custom neural network architectures as efficiently as DARTS but with higher performance and extends to multiple, possibly competing objectives.
 A simple, yet highly effective, online surrogate model for the upper level optimization in NAS, resulting in a significant increase in sampling efficiency over other surrogatebased approaches.
 Scalability and practicality of MSuNAS on many datasets corresponding to different scenarios. These include standard datasets like ImageNet, CIFAR10 and CIFAR100, and six nonstandard datasets like CINIC10 [11] (multiclass), STL10 [9](small scale mutliclass), Oxford Flowers102 [28](small scale finegrained) etc. Under mobile settings ( 600M MAdds), MSuNAS leads to SOTA performance.
2 Related Work
Methods 







NASNet [43]  RL  C10  
ENAS [30]  RL  ✓  C10  
PNAS [19]  SBMO  ✓  C10  
DPPNet [13]  SBMO  ✓  ✓  C10  
DARTS [20]  Gradient  ✓  C10  
LEMONADE [14]  EA  ✓  ✓  C10, C100  
ProxylessNAS [6]  RL + gradient  ✓  ✓  C10, ImageNet  
MnasNet [35]  RL  ✓  ImageNet  
ChamNet [10]  EA  ✓  ✓  ImageNet  
MobileNetV3 [15]  RL + expert  ✓  ImageNet  
MSuNAS (ours)  EA  ✓  ✓  ✓ 

Lower Level Surrogate: Existing approaches [30, 4, 20, 23] primarily focus on mitigating the computational overhead induced by SGDbased weight optimization in the lower level, as this process needs to be repeated for every architecture sampled by a NAS method in the upper level. A common theme among these methods involves training a supernet which contains all searchable architectures as its subnetworks. During search, accuracy using the weights inherited from the supernet becomes the metric to select architectures. However, completely relying on supernet as a substitute of actual weight optimization for evaluating candidate architectures is unreliable. Numerous studies [18, 40, 41] reported a weak correlation between the performance of the searched architectures (predicted by weight sharing) and the ones trained from scratch (using SGD) during the evaluation phase. MSuNAS instead uses the weights inherited from the supernet only as an initialization to the lower level optimization. Such a finetuning process affords the computation benefit of the supernet, while at the same time improving the correlation in the performance of the weights initialized from the supernet and those trained from scratch.
Upper Level Surrogate: MetaQNN [1] uses surrogate models to predict the final accuracy of candidate architectures (as a timeseries prediction) from the first 25% of the learning curve from SGD training. PNAS [19]
uses a surrogate model to predict the top1 accuracy of architectures with an additional branch added to the cell structure that are repeatedly stacked together. Fundamentally, both of these approaches seek to extrapolate rather than interpolate the performance of the architecture using the surrogates. Consequently, as we show later in the paper, the rankorder between the predicted accuracy and the true accuracy is very low
^{2}^{2}2In the Appendix 0.A we show that better rankorder correlation at the search stage ultimately leads to finding better performing architectures. (0.476). OnceForAll [5] also uses a surrogate model to predict accuracy from architecture encoding. However, the surrogate model is trained offline for the entire search space, thereby needing a large number of samples for learning (16K samples  2 GPUdays  2x search cost of DARTS for just constructing the surrogate model). Instead of using uniformly sampled architectures and their validation accuracy to train the surrogate model to approximate the entire landscape, ChamNet [10] trains many architectures through full lower level optimization and selects only 300 samples with high accuracy with diverse efficiency (FLOPs, Latency, Energy) to train a surrogate model offline. In contrast, MSuNAS learns a surrogate model in an online fashion only on the samples that are close to the current tradeoff front as we explore the search space. The online learning approach significantly improves the sample efficiency of our search, since we only need lower level optimization (full or surrogate assisted) for the samples near the current Pareto front.MultiObjective NAS: Approaches that consider more than one objective to optimize the architecture can be categorized into two groups: (i) scalarization, and (ii) population based approaches. The former include, ProxylessNAS [6], MnasNet [35], FBNet [39], and MobileNetV3 [15] which use a scalarized objective that encourages high accuracy and penalizes compute inefficiency at the same time, e.g., maximize . These methods require a predefined preference weighting of the importance of different objectives before the search, which typically requires a numbers of trials. Methods in the latter category include [22, 14, 13, 7, 21]
and aim to approximate the entire Paretoefficient frontier simultaneously. These approaches rely on heuristics (e.g., EA) to efficiently navigate the search space, which allows practitioners to visualize the tradeoff between the objectives and to choose a suitable network
a posteriori to the search. MSuNAS falls in the latter category using surrogate models to mitigate the computational overhead.3 Proposed Approach
The neural architecture search problem for a target dataset can be formulated as the following bilevel optimization problem [3],
(1)  
where the upper level variable defines a candidate CNN architecture, and the lower level variable defines the associated weights. denotes the crossentropy loss on the training data for a given architecture . constitutes desired objectives. These objectives can be further divided into two groups, where the first group ( to ) consists of objectives that depend on both the architecture and the weights—e.g., predictive performance on validation data , robustness to adversarial attack, etc. The other group ( to ) consists of objectives that only depend on the architecture—e.g., number of parameters, floating point operations, latency etc.
3.1 Search Space
MSuNAS searches over four important dimensions of convolutional neural networks (CNNs), including depth (# of layers), width (# of channels), kernel size and input resolution. Following previous works [35, 15, 5]
, we decompose a CNN architecture into five sequentially connected blocks, with gradually reduced feature map size and increased number of channels. In each block, we search over the number of layers, where only the first layer uses stride 2 if the feature map size decreases, and we allow each block to have minimum of two and maximum of four layers. Every layer adopts the inverted bottleneck structure
[33] and we search over the expansion rate in the firstconvolution and the kernel size of the depthwise separable convolution. Additionally, we allow the input image size to range from 192 to 256. We use an integer string to encode these architectural choices, and we pad zeros to the strings of architectures that have fewer layers so that we have a fixedlength encoding. A pictorial overview of this search space and encoding is shown in Fig.
2.3.2 Overall Algorithm Description
The problem in Eq. 1 poses two main computational bottlenecks for conventional bilevel optimization methods. First, the lower level problem of learning the optimal weights for a given architecture involves a prolonged training process—e.g., one complete SGD training on ImageNet dataset takes two days on an 8GPU server. Second, even though there exist techniques like weightsharing to bypass the gradientdescentbased weight learning process, extensively sampling architectures at the upper level can still render the overall process computationally prohibitive, e.g., 10,000 evaluations on ImageNet take 24 GPU hours, and for methods like NASNet, AmoebaNet that require more than 20,000 samples, it still requires days to complete the search even with weightsharing.
Algorithm 1 and Fig. 3 show the pseudocode and corresponding steps from a sample run of MSuNAS on ImageNet, respectively. To overcome the aforementioned bottlenecks, we use surrogate models at both upper and lower levels to make our NAS algorithm practically useful for a variety of datasets and objectives. At the upper level, we construct a surrogate model that predicts the top1 accuracy from integer strings that encode architectures. Previous approaches [10, 34, 5] that also used surrogatemodeling of the accuracy follow an offline approach, where the accuracy predictor is built from samples collected separately prior to the architecture search and not refined during the search. We argue that such a process makes the search outcome highly dependent on the initial training samples. As an alternative, we propose to model and refine the accuracy predictor iteratively in an online manner during the search. In particular, we start with an accuracy predictor constructed from only a limited number of architectures sampled randomly from the search space. We then use a standard multiobjective algorithm (NSGAII [12], in our case) to search using the constructed accuracy predictor along with other objectives that are also of interest to the user. We then evaluate the outcome architectures from NSGAII and refine the accuracy predictor model with these architectures as new training samples. We repeat this process for a prespecified number of iterations and output the nondominated solutions from the pool of evaluated architectures.
3.3 Speeding Up Upper Level Optimization
Recall that the nested nature of the bilevel problem makes the upper level optimization computationally very expensive, as every upper level function evaluation requires another optimization at the lower level. Hence, to improve the efficiency of our approach at the upper level, we focus on reducing the number of architectures that we send to the lower level for learning optimal weights. To achieve this goal, we need a surrogate model to predict the accuracy of an architecture before we actually train it. There are two desired properties of such a predictor: (1) high rankorder correlation between predicted and true performance; and (2) sample efficient such that the required number of architectures to be trained through SGD are minimized for constructing the predictor.
We first collected four different surrogate models for accuracy prediction from the literature, namely, Multi Layer Perceptron (MLP)
[19], Classification And Regression Trees (CART) [34], Radial Basis Function (RBF)
[1] and Gaussian Process (GP) [10]. From our ablation study, we observed that no one surrogate model is consistently better than others in terms of the above two criteria on all datasets (see section 4.1). Hence, we propose a selection mechanism, dubbed Adaptive Switching (AS), which constructs all four types of surrogate models at every iteration and adaptively selects the best model via crossvalidation.With the accuracy predictor selected by AS, we apply the NSGAII algorithm to simultaneously optimize for both accuracy (predicted) and other objectives of interest to the user (line 10 in Algorithm 1). For the purpose of illustration, we assume that the user is interested in optimizing #MAdds as the second objective. At the conclusion of the NSGAII search, a set of nondominated architectures is output, see Fig. 3(b). Often times, we cannot afford to train all architectures in the set. To select a subset, we first select the architecture with highest predicted accuracy. Then we project all other architecture candidates to the #MAdds axis, and pick the remaining architectures from the sparse regions that help in extending the Pareto frontier to diverse #MAdds regimes, see Fig. 3(c)  (d). The architectures from the chosen subset are then sent to the lower level for SGD training. We finally add these architectures to the training samples to refine our accuracy predictor models and proceed to next iteration, see Fig. 3(e).
3.4 Speeding Up Lower Level Optimization
To further improve the search efficiency of the proposed algorithm, we adopt the widelyused weightsharing technique [4, 23, 25]
. First, we need a supernet such that all searchable architectures are subnetworks of it. We construct such a supernet by taking the searched architectural hyperparameters at their maximum values, i.e., with four layers in each of the five blocks, with expansion ratio set to 6 and kernel size set to 7 in each layer (See Fig.
2). Then we follow the progressive shrinking algorithm [5] to train the supernet. This process is executed once before the architecture search. The weights inherited from the trained supernet are used as a warmstart for the gradient descent algorithm during architecture search.4 Experiments and Results
In this section, we evaluate the surrogate predictor, the search efficiency and the obtained architectures on CIFAR10 [17], CIFAR100 [17], and ImageNet [32].
4.1 Performance of the Surrogate Predictors
To evaluate the effectiveness of the considered surrogate models, we uniformly sample 2,000 architectures from our search space, and train them using SGD for 150 epochs on each of the three datasets and record their accuracy on 5,000 heldout images from the training set. We then fit surrogate models with different number of samples randomly selected from the 2,000 collected. We repeat the process for 10 trials to compare the mean and standard deviation of the rankorder correlation between the predicted and true accuracy, see Fig.
4. In general, we observe that no single surrogate model consistently outperforms the others on all three datasets. Hence, at every iteration, we adopt an Adaptive Switching (AS) routine that compares the four surrogate models and chooses the best based on 10fold crossvalidation. It is evident from Fig. 4 that AS works better than any one of the four surrogate models alone on all three datasets. The construction time of the AS is negligible (relatively to the search cost).4.2 Search Efficiency
Method  Type  Top1 Acc.  #MAdds  #Model  Speedup  #Epochs  Speedup  
CIFAR10  NASNetA [43]  RL  97.4%  569M  20,000  57x  20  up to 4x 
AmoebaNetB [31]  EA  97.5%  555M  27,000  77x  25  up to 5x  
PNASNet5 [19]  SMBO  96.6%  588M  1,160  3.3x  20  up to 4x  
MSuNAS(ours)  EA  98.4%  468M  350  1x  5 / 20  1x  
ImageNet  MnasNetA [35]  RL  75.2%  312M  8,000  23x  5  up to 5x 
OnceForAll [5]  EA  76.0%  230M  16,000  46x  0    
MSuNAS(ours)  EA  75.9%  225M  350  1x  0 / 5  1x 


In this section, we first compare the search efficiency of MSuNAS to other singleobjective methods on both CIFAR10 and ImageNet. To quantify the speedup, we compare the two governing factors, namely, the total number of architectures evaluated by each method to reach the reported accuracy and the number of epochs undertaken to train each sampled architecture during search. The results are provided in Table 2. We observe that MSuNAS is 20x faster than methods that use RL or EA. When compared to PNAS [19], which also utilizes an accuracy predictor, MSuNAS is still at least 3x faster.
We then compare the search efficiency of MSuNAS to NSGANet [22] and random search under a biobjective setup: Top1 accuracy and #MAdds. To perform the comparison, we run MSuNAS for 30 iterations, leading to 350 architectures evaluated in total. We record the cumulative hypervolume [42]
achieved against the number of architectures evaluated. We repeat this process five times on both ImageNet and CIFAR10 datasets to capture the variance in performance due to randomness in the search initialization. For a fair comparison to NSGANet, we apply the search code to our search space and record the number of architectures evaluated by NSGANet to reach a similar hypervolume than that achieved by MSuNAS. The random search baseline is performed by uniformly sampling from our search space. We plot the mean and the standard deviation of the hypervolume values achieved by each method in Fig.
5. Based on the incremental rate of hypervolume metric, we observe that MSuNAS is 2  5x faster, on average, in achieving a better Pareto frontier in terms of number of architectures evaluated.4.3 Results on Standard Datasets
Prior to the search, we train the supernet following the training hyperparameters setting from [5]. For each dataset, we start MSuNAS with 100 randomly sampled architectures and run for 30 iterations. In each iteration, we evaluate 8 architectures selected from the candidates recommended by NSGAII according to the accuracy predictor. For searching on CIFAR10 and CIFAR100, we fine tune the weights inherited from the supernet for five epochs then evaluate on 5K heldout validation images from the original training set. For searching on ImageNet, we recalibrate the running statistics of the BN layers after inheriting the weights from the supernet, and evaluate on 10K heldout validation images from the original training set. At the conclusion of the search, we pick the four architectures from the achieved Pareto front, and further finetune for additional 150300 epochs on the entire training sets. For reference purpose, we name the obtained architectures as NSGANetV2s/m/l/xl in ascending #MAdds order. Architectural details can be found in the Appendix 0.C.
Model  Type 

#Params  #MAdds 






NSGANetV2s  auto  1  6.1M  225M  9.1  30  77.4  93.5  
MobileNetV2 [33]  manual  0  3.4M  300M  8.3  23  72.0  91.0  
FBNetC [39]  auto  9  5.5M  375M  9.1  31  74.9    
ProxylessNAS [6]  auto  8.3  7.1M  465M  8.5  27  75.1  92.5  
MobileNetV3 [15]  combined    5.4M  219M  10.0  33  75.2    
OnceForAll [5]  auto  2  6.1M  230M  9.5  31  76.9    
NSGANetV2m  auto  1  7.7M  312M  11.4  37  78.3  94.1  
EfficientNetB0 [36]  auto    5.3M  390M  14.4  46  76.3  93.2  
MixNetM [37]  auto    5.0M  360M  24.3  79  77.0  93.3  
AtomNASC+ [25]  auto  1  5.5M  329M      77.2  93.5  
NSGANetV2l  auto  1  8.0M  400M  12.9  52  79.1  94.5  
PNASNet5 [19]  auto  250  5.1M  588M  35.6  82  74.2  91.9  
NSGANetV2xl  auto  1  8.7M  593M  16.7  73  80.4  95.2  
EfficientNetB1 [36]  auto    7.8M  700M  21.5  78  78.8  94.4  
MixNetL [37]  auto    7.3M  565M  29.4  105  78.9  94.2 
Table 3 shows the performance of our models on the ImageNet 2012 benchmark [32]. We compare models in terms of predictive performance on the validation set, model efficiency (measured by #MAdds and latencies on different hardware), and associated search cost. Overall, NSGANetV2 consistently either matches or outperforms other models across different accuracy levels with highly competitive search costs. In particular, NSGANetV2s is 2.2% more accurate than MobileNetV3 [15] while being equivalent in #MAdds and latencies; NSGANetV2xl achieves 80.4% Top1 accuracy under 600M MAdds, which is 1.5% more accurate and 1.2x more efficient than EfficientNetB1 [36]. Additional comparisons to models from multiobjective approaches are provided in Fig. 6.
For CIFAR datasets, Fig. 6 compares our models with other approaches in terms of both predictive performance and computational efficiency. On CIFAR10, we observe that NSGANetV2 dominates all previous models including (1) NASNetA [43], PNASNet5 [19] and NSGANet [22] that search on CIFAR10 directly, and (2) EfficientNet [36], MobileNetV3 [15] and MixNet [37] that finetune from ImageNet.
5 Scalability of MSuNAS
5.1 Types of Datasets
Existing NAS approaches are rarely evaluated for their search ability beyond standard benchmark datasets, i.e., ImageNet, CIFAR10, and CIFAR100. Instead, they follow a conventional transfer learning setup, in which the architectures found by searching on standard benchmark datasets are transferred, with weights finetuned, to new datasets. We argue that such a process is conceptually contradictory to the goal of NAS, and the architectures identified under such a process are suboptimal. In this section we demonstrate the scalability of MSuNAS to six additional datasets with various forms of difficulties, in terms of diversity in classification classes (multiclasses vs. finegrained) and size of training set (see Table 4). We adopt the settings of the CIFAR datasets as outlined in Section 3. For each dataset, one search takes less than one day on 8 GPU cards.
Datasets  Type  #Classes  #Train  #Test 

CINIC10 [11]  multiclass  10  90,000  90,000 
STL10 [9]  multiclass  10  5,000  8,000 
Flowers102 [28]  finegrained  102  2,040  6,149 
Pets [29]  finegrained  37  3,680  3,369 
DTD [8]  finegrained  47  3,760  1,880 
Aircraft [24]  finegrained  100  6,667  3,333 
Fig. 7 (Bottom) compares the performance of NSGANetV2 obtained by searching directly on the respective datasets to models from other approaches that transfer architectures learned from either CIFAR10 or ImageNet. Overall, we observe that NSGANetV2 significantly outperforms other models on all three datasets. In particular, NSGANetV2 achieves a better performance than the currently known stateoftheart on CINIC10 [27] and STL10 [2]. Furthermore, on Oxford Flowers102, NSGANetV2 achieves better accuracy to that of EfficientNetB3 [36] while using 1.4B fewer MAdds.
5.2 Number of Objectives
Singleobjective Formulation: Adding a hardware efficiency target as a penalty term to the objective of maximizing predictive performance is a common workaround to handle multiple objectives in the NAS literature [6, 35, 39]. We demonstrate that our proposed algorithm can also effectively handle such a scalarized singleobjective search. Following the scalarization method in [35], we apply MSuNAS to maximize validation accuracy on ImageNet with 600M MAdds as the targeted efficiency. The accumulative top1 accuracy achieved and the performance of the accuracy predictor are provided in Fig. 7(a). Without further finetuning, the obtained architecture yields 79.56% accuracy with 596M MAdds on the ImageNet validation set, which is more accurate and 100M fewer MAdds than EfficientNetB1 [36].
Manyobjective Formulation: Practical deployment of learned models are rarely driven by a single objective, and most often, seek to tradeoff many different, possibly competing, objectives. As an example of one such scenario, we use MSuNAS to simultaneously optimize five objectives—namely, the accuracy on ImageNet, #Params, #MAdds, CPU and GPU latency. We follow the same search setup as in the main experiments and increase the budget to ensure a thorough search on the expanded objective space. We show the obtained Paretooptimal (to five objectives) architectures in Fig. 7(b). We use color and marker size to indicate CPU and GPU latency, respectively. We observe that a Pareto surface emerges, shown in the left 3D scatter plot, suggesting that tradeoffs exist between objectives, i.e., #Params and #MAdds are not fully correlated. We then project all architectures to 2D, visualizing accuracy vs. each one of the four considered efficiency measurements, and highlight the architectures that are nondominated in the corresponding twoobjective cases. We observe that many architectures that are nondominated in the fiveobjective case are now dominated when only considering two objectives. Empirically, we observe that accuracy is highly correlated with #MAdds, CPU and GPU latency, but not with #Params, to some extent.
6 Conclusion
This paper introduced MSuNAS, an efficient neural architecture search algorithm for rapidly designing taskspecific models under multiple competing objectives. The efficiency of our approach stems from (i) online surrogatemodeling at the level of the architecture to improve the sample efficiency of search, and (ii) a supernet based surrogatemodel to improve the weights learning efficiency via finetuning. On standard datasets (CIFAR10, CIFAR100 and ImageNet), NSGANetV2 matches the stateoftheart with a search cost of one day. The utility and versatility of MSuNAS are further demonstrated on nonstandard datasets of various types of difficulties and on different number of objectives. Improvements beyond the stateontheart on STL10 and Flowers102 (under mobile setting) suggest that NAS is a more effective alternative to conventional transfer learning approaches.
References
 [1] Baker, B., Gupta, O., Raskar, R., Naik, N.: Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823 (2017)

[2]
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: A holistic approach to semisupervised learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
 [3] Bracken, J., McGill, J.T.: Mathematical programs with optimization problems in the constraints. Operations Research 21(1), 37–44 (1973), http://www.jstor.org/stable/169087
 [4] Brock, A., Lim, T., Ritchie, J., Weston, N.: SMASH: Oneshot model architecture search through hypernetworks. In: International Conference on Learning Representations (ICLR) (2018)
 [5] Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once for all: Train one network and specialize it for efficient deployment. In: International Conference on Learning Representations (ICLR) (2020)
 [6] Cai, H., Zhu, L., Han, S.: ProxylessNAS: Direct neural architecture search on target task and hardware. In: International Conference on Learning Representations (ICLR) (2019)
 [7] Chu, X., Zhang, B., Xu, R., Li, J.: Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845 (2019)

[8]
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

[9]
Coates, A., Ng, A., Lee, H.: An analysis of singlelayer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (2011)
 [10] Dai, X., Zhang, P., Wu, B., Yin, H., Sun, F., Wang, Y., Dukhan, M., Hu, Y., Wu, Y., Jia, Y., et al.: Chamnet: Towards efficient network design through platformaware model adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
 [11] Darlow, L.N., Crowley, E.J., Antoniou, A., Storkey, A.J.: Cinic10 is not imagenet or cifar10. arXiv preprint arXiv:1810.03505 (2018)

[12]
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsgaii. IEEE Transactions on Evolutionary Computation
6(2), 182–197 (2002). https://doi.org/10.1109/4235.996017  [13] Dong, J.D., Cheng, A.C., Juan, D.C., Wei, W., Sun, M.: Dppnet: Deviceaware progressive search for paretooptimal neural architectures. In: European Conference on Computer Vision (ECCV) (2018)
 [14] Elsken, T., Metzen, J.H., Hutter, F.: Efficient multiobjective neural architecture search via lamarckian evolution. In: International Conference on Learning Representations (ICLR) (2019)
 [15] Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for mobilenetv3. In: International Conference on Computer Vision (ICCV) (2019)
 [16] KENDALL, M.G.: A NEW MEASURE OF RANK CORRELATION. Biometrika 30(12), 81–93 (06 1938). https://doi.org/10.1093/biomet/30.12.81, https://doi.org/10.1093/biomet/30.12.81
 [17] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Tech. rep., Citeseer (2009)
 [18] Li, L., Talwalkar, A.: Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638 (2019)
 [19] Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., FeiFei, L., Yuille, A., Huang, J., Murphy, K.: Progressive neural architecture search. In: European Conference on Computer Vision (ECCV) (2018)
 [20] Liu, H., Simonyan, K., Yang, Y.: DARTS: Differentiable architecture search. In: International Conference on Learning Representations (ICLR) (2019)
 [21] Lu, Z., Deb, K., Boddeti, V.N.: Muxconv: Information multiplexing in convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
 [22] Lu, Z., Whalen, I., Boddeti, V., Dhebar, Y., Deb, K., Goodman, E., Banzhaf, W.: Nsganet: Neural architecture search using multiobjective genetic algorithm. In: Genetic and Evolutionary Computation Conference (GECCO) (2019)
 [23] Luo, R., Tian, F., Qin, T., Chen, E., Liu, T.Y.: Neural architecture optimization. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)
 [24] Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Finegrained visual classification of aircraft. Tech. rep. (2013)
 [25] Mei, J., Li, Y., Lian, X., Jin, X., Yang, L., Yuille, A., Yang, J.: Atom{nas}: Finegrained endtoend neural architecture search. In: International Conference on Learning Representations (ICLR) (2020)
 [26] Myburgh, C., Deb, K.: Derived heuristicsbased consistent optimization of material flow in a gold processing plant. Engineering Optimization 50(1), 1–18 (2018). https://doi.org/10.1080/0305215X.2017.1296436
 [27] Nayman, N., Noy, A., Ridnik, T., Friedman, I., Jin, R., Zelnik, L.: Xnas: Neural architecture search with expert advice. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
 [28] Nilsback, M., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics Image Processing (2008)
 [29] Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2012)

[30]
Pham, H., Guan, M., Zoph, B., Le, Q., Dean, J.: Efficient neural architecture search via parameters sharing. In: International Conference on Machine Learning (ICML) (2018)

[31]
Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: AAAI Conference on Artificial Intelligence Conference on Artificial Intelligence (2019)
 [32] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International journal of computer vision 115(3), 211–252 (2015)
 [33] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[34]
Sun, Y., Wang, H., Xue, B., Jin, Y., Yen, G.G., Zhang, M.: Surrogateassisted evolutionary deep learning using an endtoend random forestbased performance predictor. IEEE Transactions on Evolutionary Computation (2019).
https://doi.org/10.1109/TEVC.2019.2924461  [35] Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le, Q.V.: Mnasnet: Platformaware neural architecture search for mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
 [36] Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML) (2019)
 [37] Tan, M., Le, Q.V.: Mixconv: Mixed depthwise convolutional kernels. In: British Machine Vision Conference (BMVC) (2019)
 [38] Wang, X., Kihara, D., Luo, J., Qi, G.J.: Enaet: Selftrained ensemble autoencoding transformations for semisupervised learning. arXiv preprint arXiv:1911.09265 (2019)
 [39] Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., Keutzer, K.: Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
 [40] Xie, S., Kirillov, A., Girshick, R., He, K.: Exploring randomly wired neural networks for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
 [41] Yu, K., Sciuto, C., Jaggi, M., Musat, C., Salzmann, M.: Evaluating the search phase of neural architecture search. In: International Conference on Learning Representations (ICLR) (2020)
 [42] Zitzler, E., Thiele, L.: Multiobjective optimization using evolutionary algorithms — a comparative case study. In: Eiben, A.E., Bäck, T., Schoenauer, M., Schwefel, H.P. (eds.) Parallel Problem Solving from Nature — PPSN V. pp. 292–301. Springer Berlin Heidelberg, Berlin, Heidelberg (1998)
 [43] Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Appendix
Recall that Neural Architecture Search (NAS) is formulated as a bilevel optimization problem in the original paper. The key idea of MSuNAS is to adopt a surrogate model at both the upper and lower level in order to improve the efficiency of solving the NAS bilevel problem. In this appendix, we include the following material:

Further analysis on the upper level surrogate model of MSuNAS in Section 0.A.

Postsearch analysis in terms of mining for architectural design insights in Section 0.B.1.

“Objective transfer” in Section 0.B.2. Here we seek to quickly search for architectures optimized for target objectives by initializing the search with architectures sampled from insights gained by searching on source objectives.

The visualization of the final architectures on the six datasets that we searched in Section 0.C.
Appendix 0.A Correlation Between Search Performance and Surrogate Model
In MSuNAS, we use a surrogate model at the upper architecture level to reduce the number of architectures sent to the lower level for weight learning. There are at least two desired properties of a surrogate model, namely:

a high rankorder correlation between the performance predicted by the surrogate model and the true performance

a high sampleefficiency such that the number of architectures, that are fully trained and evaluated, for constructing the surrogate model is as low as possible
In this section, we aim to quantify the correlation between the surrogate model’s rankorder correlation (Kendall’ Tau [16]) and MSuNAS’s search performance. On ImageNet dataset, we run MSuNAS with four different surrogate models, including MultiLayer Perceptron (MLP), Classification And Regression Trees (CART), Radial Basis Function (RBF) and Gaussian Processes (GP). We record the accumulative hypervolume [42] and calculate the rankorder correlation on all architectures evaluated during the search. The results are provided in Fig. 9. In MSuNAS, we iteratively fit and refine surrogate models using only architectures that are close to the Pareto frontier. Hence, surrogate models can focus on interpolating across a much restricted region (models close to the current pareto front) in the search space, leading to a significant better rankorder correlation achieved as opposed to existing methods [19, 13], i.e., for MSuNAS vs 0.476 for ProgressiveNas [19]. Furthermore, we empirically observe that high rankorder correlation in a surrogate model translates into better search performance (lower sample complexity), measured by hypervolume [42], when paired with MSuNAS. On ImageNet, RBF outperforms the other three surrogate models considered. However, to improve generalization to other datasets, we follow an adaptive switching routine that compares all four surrogate models and selects the best based on cross validation (see Section 3.3 in the main paper).
Appendix 0.B Post Search Analysis
0.b.1 Mining for Insights
Every single run of MSuNAS generates a set of architectures. Mining the information that is generated through that process allows practitioners to choose a suitable architecture a posteriori to the search. To demonstrate one such scenario, we ran MSuNAS to optimize the predictive performance along with one of four different efficiency related measurements, namely MAdds, Params, CPU and GPU latency. At the end of the evolution, we identify the nondominated architectures and visualize their architectural choices in Fig. 9(a)  9(d). We observe that the efficient architectures under MAdds, CPU and GPU latency requirements are similar, indicating positive correlation among them, which is not the case with Params. We notice that MSuNAS implicitly exploits the fact that Params is agnostic to the image resolution, and choose to use input images at highest allowed resolution to improve predictive performance (see the Input Resolution heatmap in Fig. 9(b)).
0.b.2 Transfer Across Objectives
Further postoptimal analysis of the set of nondominated architectures often times reveals valuable design principles, referred to as derived heuristics [26]. Such derived heuristics can be utilized for novel tasks. Here we consider one such example, transferring architectures and associated weights from models that were searched with respect to one pair of objectives to architectures that are optimal with respect to a different pair of objectives. The idea is that if the objectives that we want to transfer across are related but not identical, for instance, MAdds and Latency, we can improve search efficiency by exploiting such correlations. More specifically, we can search for optimal architectures with respect to a target set of objectives by initializing the search with architectures from a source set of objectives much more efficiently, compared to starting the search from scratch. As a demonstration of this property, we conduct the following experiment:

Target Objectives: predictive performance and CPU latency.

Approach 1 (“from scratch”): MSuNAS from randomly (uniformly) initialized architectures.

Approach 2 (“from objective transfer”): MSuNAS from architectures sampled from distribution constructed from nondominated architectures of source objectives, namely, predictive performance and MAdds (Fig. 9(a)).
In Approach 1, we initialize the search process for the target objectives from randomly sampled architectures (uniformly on the search space). In contrast, in Approach 2, we initialize the search process for the target objectives by architectures sampled from the insights obtained by searching on a related pair of source objectives (predictive performance and MAdds) i.e., from the distribution in Fig. 9(b)(a).
In Fig. 11, we first compare the hypervolume achieved by these two approaches over five runs. We visualize the obtained Pareto front (from the run with median hypervolume) in Fig. 11 (Right). We observe that utilizing insights from searching on related objectives can significantly improve search performance. In general, we believe that heuristics can be derived and utilized to improve search performance on related tasks (e.g. MAdds and CPU latency), which is another desirable property of MSuNAS, where a set of architectures are obtained in a single run. The efficiency gains from the objective transfer (Approach 2) we demonstrate here are directly proportional to the correlation between the source and target objectives. However, if the source and target objectives are not related, then Approach 2 may not be more efficient than Approach 1.
Appendix 0.C Evolved Architectures
In this section, we visualize the obtained architectures in Fig. 12. All architectures are found by simultaneously maximizing predictive performance and minimizing MAdds. We observe that different datasets require different architectures for an efficient tradeoff between MAdds and performance. Finding such architectures is only possible by directly searching on the target dataset, which is the case in MSuNAS.
Comments
There are no comments yet.