1 Introduction
During the past two years, there has been a growing interest in Neural Architecture Search (NAS) that aims to automate the laborious process of designing neural networks. Architectures found by NAS have achieved remarkable results in image classification zoph2016neural ; real2018regularized , object detection and segmentation ghiasi2019fpn ; chen2019detnas ; liu2019auto , as well as other domains such as language tasks luong2018exploring ; so2019evolved .
Starting from handdesigned discrete model space and action space, NAS utilizes search techniques to explore the search space and find the best performing architectures with respect to a single or multiple objectives (e.g., accuracy, latency, or memory), and preferably with minimal search cost.
However, one common issue faced by the previous works on NAS is that the action space needs to be manually designed. The action space proposed by Zoph et al. zoph2018learning involves sequential actions to construct a network, such as selecting two nodes, and choosing their operations. Other prior works including gradientbased liu2018darts ; luo2018neural ; cai2018proxylessnas
, reinforcement learning based
zoph2016neural ; baker2016designing , evolutionbased real2018regularized ; real2017large , and MCTSbased wang2019alphax ; negrinho2017deeparchitect approaches, all use manually designed action spaces. As suggested in sciuto2019evaluating ; li2019random ; xie2019exploring , action space design alone can be critical to network performance. Furthermore, it is often the case that manually designed action space is not related to the performance that needs to be optimized. In Sec 2, we demonstrate an example where subtly different action space can lead to significantly different search efficiency. Finally, unlike games that generally have a predefined action space (e.g., Atari, Chess and Go), in NAS, it is the final network that matters rather than the specific path of search, which gives a large playground for action space learning.Based on above observations, we propose Latent Action Neural Architecture Search(LaNAS) that learns the action space to maximize search efficiency for a given performance metric. While previous methods typically construct an architecture from an empty network by sequentially applying predefined actions, LaNAS takes a dual approach and treats each action as a linear constraint which intersects with the current model space to yield a smaller region. Our goal is to find a highperformance subregion, once multiple actions are applied to the entire model space. To achieve this goal, LaNAS iterates between learning and exploration stage. In the learning stage, each action in LaNAS is learned to partition the model space into highperformance and lowperformance regions, to achieve accurate performance prediction. In exploration stage, LaNAS applies MCTS on the learned action space to get more model architectures and the corresponding performance. The learned action space provides an informed guide for the search algorithm, while the exploration in MCTS collects more data to progressively bias the learned space towards more performancepromising regions. The iterative process is jumpstarted by first collecting a few random models.
We show that LaNAS yields a tremendous acceleration on a diverse set of benchmark tasks, including publicly available NASBench101 (420,000 NASNet models trained on CIFAR10)ying2019bench , our selfcurated ConvNet60K (60,000 plain VGGstyle ConvNets trained on CIFAR10), and LSTM10K (10,000 LSTM cells trained on PTB). Our algorithm consistently finds the best performing architecture on all three tasks with an average of at least an order of fewer samples than random search, vanilla MCTS, and Regularized Evolution. In the open domain search scenario, our algorithm finds a network that achieves 98.0% accuracy on CIFAR10 and 75.0% top1 accuracy (mobile setting) on ImageNet in only 6,000 samples, using 4.4x fewer samples and achieving higher accuracy than AmoebaNet real2018regularized . Moreover, we empirically demonstrate that the learned latent actions can transfer to a new search task to further boost efficiency. Finally, we provide empirical observations to illustrate the search dynamics and analyze the behavior of our approach. We also conduct various ablation studies in together with a partition analysis to provide guidance in determining search hyperparameters and deploying LaNAS framework in practice.
2 A Motivating Example
To demonstrate the importance of action space in NAS, we start with a motivating example. Consider a simple scenario of designing a plain Convolutional Neural Network (CNN) for CIFAR10 image classification. The primitive operation is a ConvReLU layer. Free structural parameters that can vary include network depth
, number of filter channels and kernel size . This configuration results in a search space of 1,364 networks. To perform the search, there are two natural choices of the action space: sequential and global. sequential comprises actions in the following order: adding a layer , setting kernel size , setting filter channel . The actions are repeated times. On the other hand, global uses the following actions instead: {Setting network depth , setting kernel size , setting filter channel }. For these two action spaces, MCTS is employed to perform the search. Note that both action spaces can cover the entire search space but have very different search trajectories.Fig. 2(a) visualizes the search for these two action spaces. Actions in global clearly separates desired and undesired network clusters, while actions in sequential lead to network clusters with a mixture of good or bad networks in terms of performance. As a result, the overall distribution of accuracy along the search path (Fig. 2(b)) shows concentration behavior for global, which is not the case for sequential. We also demonstrate the overall search performance in Fig.2(c). As shown in the figure, global finds desired networks much faster than sequential.
This observation suggests that changing the action space can lead to very different search behavior and thus potentially better sample efficiency. In other words, an early exploration on the network depth is critical. Increasing the depth is an optimization direction that can potentially lead to better model accuracy. One might come across a natural question from this motivating example. Is it possible to find a principle way to distinguish a good action space from a bad action space in NAS? Is it possible to learn an action space such that it can best fit the performance metric to be optimized?
3 Learning Latent Action Space
In this section, we present LaNAS, which comprises two phrases: (1) search phase (2) learning phase. LaNAS iteratively learns an action space and explores with the current action space. In Fig. 3, we illustrate a high level description of LaNAS, of which the corresponding algorithms are further described in Alg.1.
3.1 Learning Phrase
In the learning phrase at iteration , we have a fixed dataset obtained from previous explorations. Each data point in has two components: represents network attributes (e.g., depth, number of filters, kernel size, connectivity, etc) and
represents the performance metric estimated from training (or from pretrained dataset like NASBench101). Our goal is to learn a good action space from the dataset to guide future exploration as well as to find the model with the desired performance metric efficiently.
Starting from the entire model/architecture space , we recursively (and greedily) split it into smaller regions such that the estimation of performance metric becomes more accurate. This helps us prune away poor regions as soon as possible and increase the sample efficiency of architecture search.
In particular, we model the recursive splitting process as a tree. The root node corresponds to the entire model space , while each tree node corresponds to a subregion (Fig. 1). At each tree node , we partition into disjoint regions , such that on each child region , the estimation of performance metric is the most accurate (or equivalently, has lowest variance).
At each node
, we learn a classifier to split the model space. The linear classifier takes the portion of the dataset that falls into its own region
, and output different outcomes, each corresponding to one possible action at the current node . To minimize the variance of for all child nodes, we learn a linear regressor that minimizes . Once learned, we use to predict an estimated metric for attributes , sort them and partition them into parts. For convenience, we always send network attributes with the best performance to the leftmost child, and so on. The partition thresholds, combined with , become the classifier at node .Note that we cannot use directly for partition, since during the search phrase, new architecture won’t have available until it has been trained and evaluated. Instead, we use to decide which child node falls into, and explore branches of the subtree that is likely to give higher performance.
3.2 Search Phrase
Once action space is learned, the search phase follows. It uses the learned action space to explore more architectures as well as their performance. Note that in the learning phrase, we use a fixed (and static) tree structure, and what we learn is the different decisions at each node. Therefore, during the search phrase we need to decide which tree branches to try first.
Following the construction of the classifier at each node, a trivial search strategy can be used to evaluate at each node , and send it to different child node according to the thresholds. However, this is not a good strategy since it only exploits the current action space, which is learned from the current dataset and may not be optimal. There can be good model regions that are hidden in the right (or bad) leaves that need to be explored.
In order to overcome this issue, we use Monte Carlo Tree Search (MCTS) as the search method, which has the characteristics of adaptive exploration, and has shown superior efficiency in various tasks. MCTS keeps track of visiting statistics at each node to achieve the balance between exploitation of existing good children and exploration of new children. In lieu of MCTS, our search phase also has select, sampling and backpropagate stages. LaNAS skips the expansion stage in regular MCTS since the connectivity of our search tree is static. Note that we reset all the visitation counts when a new search phase starts, since the counts from the last search phase corresponds to the old action space. In the first iteration, when there is no learned action space, we random sample the model space to get jump started.
3.3 LaNAS Procedures
Search phase: 1) select w.r.t UCB: the calculation of UCB follows the standard definition in auer2002finite . As illustrated in Alg. 1 as the search procedure; the input is the number of visits of current and next state, and the next state value . The selecting policy takes the node with the largest UCB score. Starting from , we follow to traverse down to a leaf (Alg. 2 line 713). 2) sampling from a leaf: in Fig. 3, the sequence of actions impose several linear constraints on the original model space , yield a polytope region for the leaf. Alg. 2 line 16 explains how to obtain constraints from an action sequence. There are various techniques george1993variable ; neal2003slice
to perform uniform sampling in a polytope. Here we use a variant of Markov Chain Monte Carlo (MCMC) sampler to get uniformly distributed samples. 3)
backpropagate reward: after training the sampled network, LaNAS backpropagates the reward, i.e. accuracy, to update the node statistics, e.g. , . It also backpropagates the sampled network so that every parent node keeps the network in for training.Learning phase: the learning phase is to train at every node . With more samples, becomes more accurate, and becomes closer to . However, as MCTS biases towards selecting good samples, LaNAS will zoom into the promising hyperspace, goes up, resulting in .
3.4 Partition Analysis
The sample efficiency, i.e. the number of samples to find the global optimal, is closely related to the quality of partition from tree nodes. Here we seek an upper bound for the number of samples in the leftmost node (the most promising subdomain) to estimate the sampling efficiency.
Assumption 1
Given a search domain having finite samples , there exists a probabilistic density , where is the network accuracy.
Therefore, gives the number of networks having accuracies in . Since the accuracy distribution has finite variance, the following holds median_mean
(1) 
is the mean accuracy over , and is the median accuracy. Note , and let’s denote . Therefore, the maximal distance from to is ; and the number of networks falling between and is , denoted as . Therefore, the root partitions into two sets that have architectures.
Theorem 1
Given a search tree with height = , the subdomain represented by the leftmost leaf contains at most architectures, and is the largest partition error from the node on the leftmost path.
Proof: let’s assume the left child is always assigned with the larger partition, and let’s recursively apply this all the way down to the leftmost leaf times, resulting in . is related to and ; note with more samples as as samples , and can be estimated from samples. The analysis indicates that LaNAS is approximating a good search region at the speed of , suggesting 1) the performance improvement will remain near plateau as , while the computational costs ( nodes) exponentially increase; 2) the performance improvement is limited when is small. These two points are empirically verified in Fig.(a)a and Fig.5, respectively.
4 Experiment
We performed extensive experiments on both offline collected benchmark datasets (e.g., NASBench ying2019bench ) and open search domain to validate the effectiveness of LaNAS.
4.1 Analysis of Search Algorithm
We analyze LaNAS using NASBench101, which contains more than 400K models. The dataset provides us with the true distribution of model accuracy, given any subset of model specifications, or equivalently a collection of actions (or constraints). By construction, left nodes contain regions of the good metric while right nodes contain regions of the poor metric. Therefore, at each node , we can construct reference distribution from the entire dataset, by sorting them with respect to the metric, and partition them into different buckets of even size. We compare with the empirical distribution estimated from the learned action space, where is the number of accumulated samples at the node . To compare and , we use KLdivergence , and their mean value and .
In our experiments, we use a complete binary tree with the height of . We label nodes 014 as internal nodes, and nodes 1529 as leaves. By definition, . At the beginning of the search (), where belongs to good subdomains, e.g. , are expected to be different from due to their random initialization. With more samples (), starts to approximate , manifested by the increased similarity between and , the transition from to in Fig. (b)b, and the decreasing in Fig. (a)a. This is because MCTS explores the underexplored regions. As the search continues (), LaNAS explores deeper into promising subdomains and is biased toward the region with good performance, deviated from even partition which is used to construct . As a result, bounces back. These search dynamics demonstrate that our algorithm can adapt to different stages during the course of the search.
4.2 Performance on NAS Datasets
Evaluating on NAS Datasets: We use NASBench101 ying2019bench
as one benchmark that contains over 420K NASNet CNN models. For each network, it records the architecture information and the associated accuracy for fast retrieval by NAS algorithm, avoiding timeconsuming model retraining. In addition, we construct two more datasets for benchmarking, ConvNet60K (plain ConvNet models, VGGstyle, no residual connections, trained on CIFAR10) and LSTM10K (LSTM cells trained on PTB) to further validate the effectiveness of the proposed LaNAS framework.
Baselines: We compare LaNAS with a few baseline methods that can obtain optimal solution given sufficient explorations. Random Search can find the global optimal in expected samples from a dataset of size , and is datasetagnostic. Regularized Evolution empirically finds the global optimal, and is applied in AmoebaNet real2018regularized that achieves SoTA performance for image recognition. MCTS is an anytime search algorithm used in NAS wang2019alphax with the global optimality guarantees.
Analysis of Results: Fig. 5 demonstrates that LaNAS consistently outperforms the baselines by significant margins on three separate tasks. Particularly, on NASBench. LaNAS is on average using 22x, 14.6x, and 12.4x fewer samples than Random Search, Regularized Evolution, and MCTS, respectively, to find the global optimal. On LSTM, LaNAS still performs the best despite that the dataset is small.
Fig. 4 shows that LaNAS minimizes the variance of reward on a search path, making good networks located on good paths, thereby drastically improving the search efficiency. In contrast, Random Search relies on blind search and leads to the worst performance. Regularized Evolution utilizes a static exploration strategy that maintains a pool of top 500 architectures for random mutations, which is not guided by previous search experience. MCTS builds online models of both performance and visitation counts for adaptive exploration. However, without a good action space, the performance model at each node cannot be highly selective, leading to inefficient search (Fig. 2).
4.3 Performance on Open Domain Search
Model  Params  top1 err  M 
NASNetA zoph2018learning  3.3 M  2.65  20000 
NASNetA zoph2018learning  27.6 M  2.40  20000 
AmoebaNetB real2018regularized  3.2 M  27000  
AmoebaNetB real2018regularized  34.9 M  27000  
PNASNet5 liu2018progressive  3.2 M  1160  
NAO luo2018neural  10.6 M  3.18  1000 
NAO luo2018neural  128.0 M  2.11  1000 
ENAS pham2018efficient  4.6 M  2.89  N/A 
DARTS liu2018darts  3.3 M  N/A  
LaNet  3.2 M  6000  
LaNet  38.7 M  6000 

trained with cutout.

M: number of samples selected.
Model  FLOPs  Params  top1 / top5 err 

NASNetA zoph2018learning  564M  5.3 M  26.0 / 8.4 
NASNetB zoph2018learning  488M  5.3 M  27.2 / 8.7 
NASNetC zoph2018learning  558M  4.9 M  27.5 / 9.0 
AmoebaNetA real2018regularized  555M  5.1 M  25.5 / 8.0 
AmoebaNetB real2018regularized  555M  5.3 M  26.0 / 8.5 
AmoebaNetC real2018regularized  570M  6.4 M  24.3 / 7.6 
PNASNet5 liu2018progressive  588M  5.1 M  25.8 / 8.1 
DARTS liu2018darts  574M  4.7 M  26.7 / 8.7 
FBNetC FBNet  375M  5.5 M  25.1 /  
RandWireWS xie2019exploring  583M  5.6 M  25.3 / 7.8 
LaNet  570M  5.1 M  25.0 / 7.7 
Table. 2 compares our results in the context of searching NASNet style architecture on CIFAR10, a common setting used in current NAS research. Experimental setup is further described in Appendix. In only 6000 samples, our best performing architecture (LaNet) demonstrates an average accuracy of 97.47% (#filters = 32, #params = 3.22M) and 98.01% (#filters = 128, #params = 38.7M), which is better than all existing NASbased results. It is worth noting that we achieved this accuracy with 4.5x fewer samples than AmoebaNet. Since AmoebaNet and LaNet share the same search space, the saving is purely from our sampleefficient search algorithm. Gradient based methods, and their weight sharing variants, e.g. DARTs and NAO, exhibit weaker performance. We suspect that they are easily trapped into a local optimal, which is observed in sciuto2019evaluating .
4.4 Transfer Learning
Transfer LaNet to ImageNet: Transferring the best performing architecture (found through searching) from CIFAR10 to ImageNet has already been a standard technique. Following the mobile setting zoph2018learning , Table. LABEL:exp:imagenettransferresults shows that LaNet found on CIFAR10, when transferred to ImageNet mobile setting (FLOPs are constrained under 600M), achieves competitive performance.
Intratask latent action transfer: We learn actions from a subset (1%, 5%, 10%, 20% 60%) of NASBench and test their transferability to the remaining dataset, as shown in Fig. (a)a. Interestingly, the improvement remain steady after 10%. Consistent with Fig. 4, it is enough to use 10% of the samples to learn the action space.
4.5 Ablation studies
The effect of tree height and #selects: Fig. (a)a relates tree height () and the number of selects (#selects) to the search performance. In Fig. (a)a, each entry represents #samples to achieve optimality on NASBench, averaged over 100 runs. A deeper tree leads to better performance, since the model space is partitioned by more leaves. Similarly, small #select results in more frequent updates of action space, and thus leads to improvement. On the other hand, the number of classifiers increases exponentially as the tree goes deeper, and a small #selects requires frequent action updates. Therefore, both can significantly increase the computation cost.
Choice of classifiers: Fig.(b)b
shows that using a linear classifier performs better than an multilayer perceptron (MLP) classifier. This indicates that adding complexity to decision boundary of actions may not help with the performance. Conversely, performance can get degraded due to potentially higher difficulties in optimization.
5 Future Work
Recent work on shared model luo2018neural ; pham2018efficient improves the training efficiency by reusing trained components from the similar previously explored architectures, e.g., weight sharing liu2018darts ; luo2018neural ; pham2018efficient . Our work focuses on sampleefficiency and is complementary to the above techniques.
To encourage reproducibility in NAS research, various of architecture search baselines have been discussed in sciuto2019evaluating ; li2019random . We will also open source the proposed LaNAS framework, together with the three NAS benchmark datasets used in our experiments.
References
 [1] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 [2] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 [3] Golnaz Ghiasi, TsungYi Lin, Ruoming Pang, and Quoc V Le. Nasfpn: Learning scalable feature pyramid architecture for object detection. arXiv preprint arXiv:1904.07392, 2019.
 [4] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Chunhong Pan, and Jian Sun. Detnas: Neural architecture search on object detection. arXiv preprint arXiv:1903.10979, 2019.
 [5] Chenxi Liu, LiangChieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, and Li FeiFei. Autodeeplab: Hierarchical neural architecture search for semantic image segmentation. arXiv preprint arXiv:1901.02985, 2019.
 [6] MinhThang Luong, David Dohan, Adams Wei Yu, Quoc V Le, Barret Zoph, and Vijay Vasudevan. Exploring neural architecture search for language tasks. arXiv preprint arXiv:1901.02985, 2018.
 [7] David R So, Chen Liang, and Quoc V Le. The evolved transformer. arXiv preprint arXiv:1901.11117, 2019.

[8]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le.
Learning transferable architectures for scalable image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 8697–8710, 2018.  [9] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 [10] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and TieYan Liu. Neural architecture optimization. In Advances in neural information processing systems, pages 7816–7827, 2018.
 [11] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.

[12]
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu,
Jie Tan, Quoc V Le, and Alexey Kurakin.
Largescale evolution of image classifiers.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 2902–2911. JMLR. org, 2017.  [13] Linnan Wang, Yiyang Zhao, Yuu Jinnai, Yuandong Tian, and Rodrigo Fonseca. Alphax: exploring neural architectures with deep neural networks and monte carlo tree search. arXiv preprint arXiv:1903.11059, 2019.
 [14] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792, 2017.
 [15] Christian Sciuto, Kaicheng Yu, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142, 2019.
 [16] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638, 2019.
 [17] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569, 2019.
 [18] Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, and Frank Hutter. Nasbench101: Towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635, 2019.
 [19] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [20] Edward I George and Robert E McCulloch. Variable selection via gibbs sampling. Journal of the American Statistical Association, 88(423):881–889, 1993.
 [21] Radford M Neal et al. Slice sampling. The annals of statistics, 31(3):705–767, 2003.
 [22] Colin Mallows. Letters to the editor. The American Statistician, 45(3):256–262, 1991.
 [23] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
 [24] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, pages 4092–4101, 2018.
 [25] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443, 2018.
6 Appendix
6.1 NAS Datasets and Experiment Setup:
NAS dataset enables directly querying a model’s performance, e.g. accuracy. This allows for truly evaluating a search algorithm by repeating hundreds of independent searches without involving the actual training. NASBench101 ying2019bench is the only publicly available NAS dataset that contains over 420K DAG style networks for image recognition. However, a search algorithm might overfit NASBench101, losing the generality. This motivates us to collect another 2 NAS datasets, one is for image recognition using sequential CNN and the other is for language modeling using LSTM.
Collecting ConvNet60K dataset: following a similar set in collecting 1,364 networks in sec.2, free structural parameters that can vary are: network depth , number of filter channels and kernel size . We train every possible architecture 100 epochs, and collect their final test accuracy in the dataset.
Collecting LSTM10K dataset: following a similar LSTM cell definition in pham2018efficient , we represent a LSTM cell with a connectivity matrix and a node list of operators. The choice of operators is limited to either a fully connected layer followed by RELU or an identity layer, while the connectivity is limited to 6 nodes. We randomly sampled 10K architectures from this constrained search domain, and training them following the exact same setup in liu2018darts . Then we keep the test perplexity as the performance metric for NAS.
These 3 datasets, 420K NASBench101, 60K CNN, and 10K LSTM, enable us to fairly and truly evaluate the search efficiency of LaNAS against different algorithms. In each task, we perform an independent search for 100 times with different random seeds. The mean performance, along with the 25% to 75% error range, is shown in Fig.5. The structure of search tree is consistent across all 3 tasks, using a tree with height = 5, and #selects = 100. We randomly pick 2000 networks in initializing LaNAS.
6.2 Open Domain Search and Experiment Setup:
It is not sufficient to verify the search efficiency exclusively on NAS datasets, as the search space in the real setting contains networks far more than a dataset. This motivates us to design an open domain search to test LaNAS from a search domain with billions of potential architectures. For a truly effective search algorithm, it is expected to find a good architecture with fewer samples.
Our search space is consistent with the widely adopted NASNetzoph2018learning
. We allow for 6 types of layers in the search, which are 3x3 max pool, 5x5 max pool, 3x3 depthseparable conv, 5x5 depthseparable conv, 7x7 depthseparable conv, identity. Inside a Cell, we allow for up to 7 blocks. The search architecture is still consistent with the one used on NAS dataset except that we increase the tree height from 5 to 8. Similar to PNAS
liu2018progressive , we use the same cell architecture for both “normal” and “reduction” layer. The best convolutional cell found by LaNAS is visualized below.