Sample-Efficient Neural Architecture Search by Learning Action Space

06/17/2019 ∙ by Linnan Wang, et al. ∙ 6

Neural Architecture Search (NAS) has emerged as a promising technique for automatic neural network design. However, existing NAS approaches often utilize manually designed action space, which is not directly related to the performance metric to be optimized (e.g., accuracy). As a result, using manually designed action space to perform NAS often leads to sample-inefficient explorations of architectures and thus can be sub-optimal. In order to improve sample efficiency, this paper proposes Latent Action Neural Architecture Search (LaNAS) that learns the action space to recursively partition the architecture search space into regions, each with concentrated performance metrics (i.e., low variance). During the search phase, as different architecture search action sequences lead to regions of different performance, the search efficiency can be significantly improved by biasing towards the regions with good performance. On the largest NAS dataset NasBench-101, our experimental results demonstrated that LaNAS is 22x, 14.6x and 12.4x more sample-efficient than random search, regularized evolution, and Monte Carlo Tree Search (MCTS) respectively. When applied to the open domain, LaNAS finds an architecture that achieves SoTA 98.0 (mobile setting), after exploring only 6,000 architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Latent Action Neural Architecture Search. Starting from the entire model space, at each search stage we learn an action (or a set of linear constraints) to separate good from bad models, providing distinctive rewards for better searching.

During the past two years, there has been a growing interest in Neural Architecture Search (NAS) that aims to automate the laborious process of designing neural networks. Architectures found by NAS have achieved remarkable results in image classification zoph2016neural ; real2018regularized , object detection and segmentation ghiasi2019fpn ; chen2019detnas ; liu2019auto , as well as other domains such as language tasks luong2018exploring ; so2019evolved .

Starting from hand-designed discrete model space and action space, NAS utilizes search techniques to explore the search space and find the best performing architectures with respect to a single or multiple objectives (e.g., accuracy, latency, or memory), and preferably with minimal search cost.

However, one common issue faced by the previous works on NAS is that the action space needs to be manually designed. The action space proposed by Zoph et al. zoph2018learning involves sequential actions to construct a network, such as selecting two nodes, and choosing their operations. Other prior works including gradient-based liu2018darts ; luo2018neural ; cai2018proxylessnas

, reinforcement learning based 

zoph2016neural ; baker2016designing , evolution-based real2018regularized ; real2017large , and MCTS-based wang2019alphax ; negrinho2017deeparchitect approaches, all use manually designed action spaces. As suggested in sciuto2019evaluating ; li2019random ; xie2019exploring , action space design alone can be critical to network performance. Furthermore, it is often the case that manually designed action space is not related to the performance that needs to be optimized. In Sec 2, we demonstrate an example where subtly different action space can lead to significantly different search efficiency. Finally, unlike games that generally have a predefined action space (e.g., Atari, Chess and Go), in NAS, it is the final network that matters rather than the specific path of search, which gives a large playground for action space learning.

Based on above observations, we propose Latent Action Neural Architecture Search(LaNAS) that learns the action space to maximize search efficiency for a given performance metric. While previous methods typically construct an architecture from an empty network by sequentially applying predefined actions, LaNAS takes a dual approach and treats each action as a linear constraint which intersects with the current model space to yield a smaller region. Our goal is to find a high-performance sub-region, once multiple actions are applied to the entire model space. To achieve this goal, LaNAS iterates between learning and exploration stage. In the learning stage, each action in LaNAS is learned to partition the model space into high-performance and low-performance regions, to achieve accurate performance prediction. In exploration stage, LaNAS applies MCTS on the learned action space to get more model architectures and the corresponding performance. The learned action space provides an informed guide for the search algorithm, while the exploration in MCTS collects more data to progressively bias the learned space towards more performance-promising regions. The iterative process is jump-started by first collecting a few random models.

We show that LaNAS yields a tremendous acceleration on a diverse set of benchmark tasks, including publicly available NASBench-101 (420,000 NASNet models trained on CIFAR-10)ying2019bench , our self-curated ConvNet-60K (60,000 plain VGG-style ConvNets trained on CIFAR-10), and LSTM-10K (10,000 LSTM cells trained on PTB). Our algorithm consistently finds the best performing architecture on all three tasks with an average of at least an order of fewer samples than random search, vanilla MCTS, and Regularized Evolution. In the open domain search scenario, our algorithm finds a network that achieves 98.0% accuracy on CIFAR-10 and 75.0% top1 accuracy (mobile setting) on ImageNet in only 6,000 samples, using 4.4x fewer samples and achieving higher accuracy than AmoebaNet real2018regularized . Moreover, we empirically demonstrate that the learned latent actions can transfer to a new search task to further boost efficiency. Finally, we provide empirical observations to illustrate the search dynamics and analyze the behavior of our approach. We also conduct various ablation studies in together with a partition analysis to provide guidance in determining search hyper-parameters and deploying LaNAS framework in practice.

2 A Motivating Example

Figure 2: Illustration of motivation: (a) visualizes the MCTS search trees using sequential and global action space. The node value (i.e. accuracy) is higher if the color is darker. (b) For a given node, the reward distributions for its children. is the average distance over all nodes. global better separates the search space by network quality, and provides distinctive reward in recognizing a promising path. (c) As a result, global finds the best network much faster than sequential.

To demonstrate the importance of action space in NAS, we start with a motivating example. Consider a simple scenario of designing a plain Convolutional Neural Network (CNN) for CIFAR-10 image classification. The primitive operation is a Conv-ReLU layer. Free structural parameters that can vary include network depth

, number of filter channels and kernel size . This configuration results in a search space of 1,364 networks. To perform the search, there are two natural choices of the action space: sequential and global. sequential comprises actions in the following order: adding a layer , setting kernel size , setting filter channel . The actions are repeated times. On the other hand, global uses the following actions instead: {Setting network depth , setting kernel size , setting filter channel }. For these two action spaces, MCTS is employed to perform the search. Note that both action spaces can cover the entire search space but have very different search trajectories.

Fig. 2(a) visualizes the search for these two action spaces. Actions in global clearly separates desired and undesired network clusters, while actions in sequential lead to network clusters with a mixture of good or bad networks in terms of performance. As a result, the overall distribution of accuracy along the search path (Fig. 2(b)) shows concentration behavior for global, which is not the case for sequential. We also demonstrate the overall search performance in Fig.2(c). As shown in the figure, global finds desired networks much faster than sequential.

This observation suggests that changing the action space can lead to very different search behavior and thus potentially better sample efficiency. In other words, an early exploration on the network depth is critical. Increasing the depth is an optimization direction that can potentially lead to better model accuracy. One might come across a natural question from this motivating example. Is it possible to find a principle way to distinguish a good action space from a bad action space in NAS? Is it possible to learn an action space such that it can best fit the performance metric to be optimized?

3 Learning Latent Action Space

In this section, we present LaNAS, which comprises two phrases: (1) search phase (2) learning phase. LaNAS iteratively learns an action space and explores with the current action space. In Fig. 3, we illustrate a high level description of LaNAS, of which the corresponding algorithms are further described in Alg.1.

3.1 Learning Phrase

In the learning phrase at iteration , we have a fixed dataset obtained from previous explorations. Each data point in has two components: represents network attributes (e.g., depth, number of filters, kernel size, connectivity, etc) and

represents the performance metric estimated from training (or from pre-trained dataset like NASBench-101). Our goal is to learn a good action space from the dataset to guide future exploration as well as to find the model with the desired performance metric efficiently.

Starting from the entire model/architecture space , we recursively (and greedily) split it into smaller regions such that the estimation of performance metric becomes more accurate. This helps us prune away poor regions as soon as possible and increase the sample efficiency of architecture search.

In particular, we model the recursive splitting process as a tree. The root node corresponds to the entire model space , while each tree node corresponds to a sub-region (Fig. 1). At each tree node , we partition into disjoint regions , such that on each child region , the estimation of performance metric is the most accurate (or equivalently, has lowest variance).

At each node

, we learn a classifier to split the model space. The linear classifier takes the portion of the dataset that falls into its own region

, and output different outcomes, each corresponding to one possible action at the current node . To minimize the variance of for all child nodes, we learn a linear regressor that minimizes . Once learned, we use to predict an estimated metric for attributes , sort them and partition them into parts. For convenience, we always send network attributes with the best performance to the leftmost child, and so on. The partition thresholds, combined with , become the classifier at node .

Note that we cannot use directly for partition, since during the search phrase, new architecture won’t have available until it has been trained and evaluated. Instead, we use to decide which child node falls into, and explore branches of the subtree that is likely to give higher performance.

3.2 Search Phrase

Once action space is learned, the search phase follows. It uses the learned action space to explore more architectures as well as their performance. Note that in the learning phrase, we use a fixed (and static) tree structure, and what we learn is the different decisions at each node. Therefore, during the search phrase we need to decide which tree branches to try first.

Following the construction of the classifier at each node, a trivial search strategy can be used to evaluate at each node , and send it to different child node according to the thresholds. However, this is not a good strategy since it only exploits the current action space, which is learned from the current dataset and may not be optimal. There can be good model regions that are hidden in the right (or bad) leaves that need to be explored.

In order to overcome this issue, we use Monte Carlo Tree Search (MCTS) as the search method, which has the characteristics of adaptive exploration, and has shown superior efficiency in various tasks. MCTS keeps track of visiting statistics at each node to achieve the balance between exploitation of existing good children and exploration of new children. In lieu of MCTS, our search phase also has select, sampling and backpropagate stages. LaNAS skips the expansion stage in regular MCTS since the connectivity of our search tree is static. Note that we reset all the visitation counts when a new search phase starts, since the counts from the last search phase corresponds to the old action space. In the first iteration, when there is no learned action space, we random sample the model space to get jump started.

Figure 3: An overview of LaNAS: LaNAS is an iterative algorithm in which each iteration comprises a search phase and learning phase. The search phase uses MCTS to samples networks, while the learning phase learns a linear model between network hyper-parameters and their accuracies.

3.3 LaNAS Procedures

Data: specifications of the search domain
1 Function  (, , )
2       
3        return if else
4begin
5        while  acc < target  do
6               for  do
7                     
8                     
9                     
10                     
11                     
12                     
13              for   do
14                     
15              
16       
17
Algorithm 1 LaNAS search procedure.
1 Function  ()
2        = []
3        for   do
4               ,
5               if on left then else
6              
7       return
8Function  ()
9       
10        while  not  do
11              
12               ,
13               if then else
14              
15       return
Algorithm 2 Subroutines in Alg. 1.

Search phase: 1) select w.r.t UCB: the calculation of UCB follows the standard definition in auer2002finite . As illustrated in Alg. 1 as the search procedure; the input is the number of visits of current and next state, and the next state value . The selecting policy takes the node with the largest UCB score. Starting from , we follow to traverse down to a leaf (Alg. 2 line 7-13). 2) sampling from a leaf: in Fig. 3, the sequence of actions impose several linear constraints on the original model space , yield a polytope region for the leaf. Alg. 2 line 1-6 explains how to obtain constraints from an action sequence. There are various techniques george1993variable ; neal2003slice

to perform uniform sampling in a polytope. Here we use a variant of Markov Chain Monte Carlo (MCMC) sampler to get uniformly distributed samples. 3)

back-propagate reward: after training the sampled network, LaNAS back-propagates the reward, i.e. accuracy, to update the node statistics, e.g. , . It also back-propagates the sampled network so that every parent node keeps the network in for training.

Learning phase: the learning phase is to train at every node . With more samples, becomes more accurate, and becomes closer to . However, as MCTS biases towards selecting good samples, LaNAS will zoom into the promising hyper-space, goes up, resulting in .

3.4 Partition Analysis

The sample efficiency, i.e. the number of samples to find the global optimal, is closely related to the quality of partition from tree nodes. Here we seek an upper bound for the number of samples in the leftmost node (the most promising sub-domain) to estimate the sampling efficiency.

Assumption 1

Given a search domain having finite samples , there exists a probabilistic density , where is the network accuracy.

Therefore, gives the number of networks having accuracies in . Since the accuracy distribution has finite variance, the following holds median_mean

(1)

is the mean accuracy over , and is the median accuracy. Note , and let’s denote . Therefore, the maximal distance from to is ; and the number of networks falling between and is , denoted as . Therefore, the root partitions into two sets that have architectures.

Theorem 1

Given a search tree with height = , the sub-domain represented by the leftmost leaf contains at most architectures, and is the largest partition error from the node on the leftmost path.

Proof: let’s assume the left child is always assigned with the larger partition, and let’s recursively apply this all the way down to the leftmost leaf times, resulting in . is related to and ; note with more samples as as samples , and can be estimated from samples. The analysis indicates that LaNAS is approximating a good search region at the speed of , suggesting 1) the performance improvement will remain near plateau as , while the computational costs ( nodes) exponentially increase; 2) the performance improvement is limited when is small. These two points are empirically verified in Fig.(a)a and Fig.5, respectively.

4 Experiment

We performed extensive experiments on both offline collected benchmark datasets (e.g., NASBench ying2019bench ) and open search domain to validate the effectiveness of LaNAS.

(a)
(b)
Figure 4: Evaluations of search dynamics: (a) sample distribution approximates dataset distribution when the number of samples . The search algorithm then zooms into the promising sub-domain, as shown by the growth of when . (b) KL-divergence of and dips and bounces back. continues to grow, showing the average metric over different nodes becomes higher when the search progresses.

4.1 Analysis of Search Algorithm

We analyze LaNAS using NASBench-101, which contains more than 400K models. The dataset provides us with the true distribution of model accuracy, given any subset of model specifications, or equivalently a collection of actions (or constraints). By construction, left nodes contain regions of the good metric while right nodes contain regions of the poor metric. Therefore, at each node , we can construct reference distribution from the entire dataset, by sorting them with respect to the metric, and partition them into different buckets of even size. We compare with the empirical distribution estimated from the learned action space, where is the number of accumulated samples at the node . To compare and , we use KL-divergence , and their mean value and .

In our experiments, we use a complete binary tree with the height of . We label nodes 0-14 as internal nodes, and nodes 15-29 as leaves. By definition, . At the beginning of the search (), where belongs to good sub-domains, e.g. , are expected to be different from due to their random initialization. With more samples (), starts to approximate , manifested by the increased similarity between and , the transition from to in Fig. (b)b, and the decreasing in Fig. (a)a. This is because MCTS explores the under-explored regions. As the search continues (), LaNAS explores deeper into promising sub-domains and is biased toward the region with good performance, deviated from even partition which is used to construct . As a result, bounces back. These search dynamics demonstrate that our algorithm can adapt to different stages during the course of the search.

4.2 Performance on NAS Datasets

Evaluating on NAS Datasets: We use NASBench-101 ying2019bench

as one benchmark that contains over 420K NASNet CNN models. For each network, it records the architecture information and the associated accuracy for fast retrieval by NAS algorithm, avoiding time-consuming model retraining. In addition, we construct two more datasets for benchmarking, ConvNet-60K (plain ConvNet models, VGG-style, no residual connections, trained on CIFAR-10) and LSTM-10K (LSTM cells trained on PTB) to further validate the effectiveness of the proposed LaNAS framework.

Baselines: We compare LaNAS with a few baseline methods that can obtain optimal solution given sufficient explorations. Random Search can find the global optimal in expected samples from a dataset of size , and is dataset-agnostic. Regularized Evolution empirically finds the global optimal, and is applied in AmoebaNet real2018regularized that achieves SoTA performance for image recognition. MCTS is an anytime search algorithm used in NAS wang2019alphax with the global optimality guarantees.

Figure 5: Evaluations of sample-efficiency on NASBench, ConvNet-60K and LSTM-10K. Each search algorithm is repeated 100 times on each datasets with different random seeds. The top row shows the time-course of the search algorithms (current best accuracy with interquartile range), while the bottom row illustrates the number of samples to reach the global optimal.

Analysis of Results: Fig. 5 demonstrates that LaNAS consistently outperforms the baselines by significant margins on three separate tasks. Particularly, on NASBench. LaNAS is on average using 22x, 14.6x, and 12.4x fewer samples than Random Search, Regularized Evolution, and MCTS, respectively, to find the global optimal. On LSTM, LaNAS still performs the best despite that the dataset is small.

Fig. 4 shows that LaNAS minimizes the variance of reward on a search path, making good networks located on good paths, thereby drastically improving the search efficiency. In contrast, Random Search relies on blind search and leads to the worst performance. Regularized Evolution utilizes a static exploration strategy that maintains a pool of top 500 architectures for random mutations, which is not guided by previous search experience. MCTS builds online models of both performance and visitation counts for adaptive exploration. However, without a good action space, the performance model at each node cannot be highly selective, leading to inefficient search (Fig. 2).

4.3 Performance on Open Domain Search

Model Params top1 err M
NASNet-A zoph2018learning 3.3 M 2.65 20000
NASNet-A zoph2018learning 27.6 M 2.40 20000
AmoebaNet-B real2018regularized 3.2 M 27000
AmoebaNet-B real2018regularized 34.9 M 27000
PNASNet-5 liu2018progressive 3.2 M 1160
NAO luo2018neural 10.6 M 3.18 1000
NAO luo2018neural 128.0 M 2.11 1000
ENAS pham2018efficient 4.6 M 2.89 N/A
DARTS liu2018darts 3.3 M N/A
LaNet 3.2 M 6000
LaNet 38.7 M 6000
  • trained with cutout.

  • M: number of samples selected.

Table 2: Results on ImageNet (mobile setting)
Model FLOPs Params top1 / top5 err
NASNet-A zoph2018learning 564M 5.3 M 26.0 / 8.4
NASNet-B zoph2018learning 488M 5.3 M 27.2 / 8.7
NASNet-C zoph2018learning 558M 4.9 M 27.5 / 9.0
AmoebaNet-A real2018regularized 555M 5.1 M 25.5 / 8.0
AmoebaNet-B real2018regularized 555M 5.3 M 26.0 / 8.5
AmoebaNet-C real2018regularized 570M 6.4 M 24.3 / 7.6
PNASNet-5 liu2018progressive 588M 5.1 M 25.8 / 8.1
DARTS liu2018darts 574M 4.7 M 26.7 / 8.7
FBNet-C FBNet 375M 5.5 M 25.1 / -
RandWire-WS xie2019exploring 583M 5.6 M 25.3 / 7.8
LaNet 570M 5.1 M 25.0 / 7.7
Table 1: Results on CIFAR-10.

Table. 2 compares our results in the context of searching NASNet style architecture on CIFAR-10, a common setting used in current NAS research. Experimental setup is further described in Appendix. In only 6000 samples, our best performing architecture (LaNet) demonstrates an average accuracy of 97.47% (#filters = 32, #params = 3.22M) and 98.01% (#filters = 128, #params = 38.7M), which is better than all existing NAS-based results. It is worth noting that we achieved this accuracy with 4.5x fewer samples than AmoebaNet. Since AmoebaNet and LaNet share the same search space, the saving is purely from our sample-efficient search algorithm. Gradient based methods, and their weight sharing variants, e.g. DARTs and NAO, exhibit weaker performance. We suspect that they are easily trapped into a local optimal, which is observed in sciuto2019evaluating .

4.4 Transfer Learning

Transfer LaNet to ImageNet: Transferring the best performing architecture (found through searching) from CIFAR10 to ImageNet has already been a standard technique. Following the mobile setting zoph2018learning , Table. LABEL:exp:imagenet-transfer-results shows that LaNet found on CIFAR-10, when transferred to ImageNet mobile setting (FLOPs are constrained under 600M), achieves competitive performance.

(a) (b)
Figure 6: Ablation study: (a) the effect of different tree heights and #select in MCTS. Number in each entry is #samples to reach global optimal. (b) the choice of predictor for splitting search space.
(a) (b)
Figure 7: Latent actions transfer: learned latent actions can generalize within the same task or across different tasks, to further boost search efficiency.

Intra-task latent action transfer: We learn actions from a subset (1%, 5%, 10%, 20% 60%) of NASBench and test their transferability to the remaining dataset, as shown in Fig. (a)a. Interestingly, the improvement remain steady after 10%. Consistent with Fig. 4, it is enough to use 10% of the samples to learn the action space.

Inter-task latent action transfer: We compare 100 architectures selected by LaNAS from sec.4.3

(on CIFAR10) with 100 random trials. Networks are trained for 100 epochs and their performances are compared. Fig. 

(b)b indicates that inter-task action transfer is also beneficial.

4.5 Ablation studies

The effect of tree height and #selects: Fig. (a)a relates tree height () and the number of selects (#selects) to the search performance. In Fig. (a)a, each entry represents #samples to achieve optimality on NASBench, averaged over 100 runs. A deeper tree leads to better performance, since the model space is partitioned by more leaves. Similarly, small #select results in more frequent updates of action space, and thus leads to improvement. On the other hand, the number of classifiers increases exponentially as the tree goes deeper, and a small #selects requires frequent action updates. Therefore, both can significantly increase the computation cost.

Choice of classifiers: Fig.(b)b

shows that using a linear classifier performs better than an multi-layer perceptron (MLP) classifier. This indicates that adding complexity to decision boundary of actions may not help with the performance. Conversely, performance can get degraded due to potentially higher difficulties in optimization.

5 Future Work

Recent work on shared model luo2018neural ; pham2018efficient improves the training efficiency by reusing trained components from the similar previously explored architectures, e.g., weight sharing liu2018darts ; luo2018neural ; pham2018efficient . Our work focuses on sample-efficiency and is complementary to the above techniques.

To encourage reproducibility in NAS research, various of architecture search baselines have been discussed in sciuto2019evaluating ; li2019random . We will also open source the proposed LaNAS framework, together with the three NAS benchmark datasets used in our experiments.

References

6 Appendix

6.1 NAS Datasets and Experiment Setup:

NAS dataset enables directly querying a model’s performance, e.g. accuracy. This allows for truly evaluating a search algorithm by repeating hundreds of independent searches without involving the actual training. NASBench-101 ying2019bench is the only publicly available NAS dataset that contains over 420K DAG style networks for image recognition. However, a search algorithm might overfit NASBench-101, losing the generality. This motivates us to collect another 2 NAS datasets, one is for image recognition using sequential CNN and the other is for language modeling using LSTM.

Collecting ConvNet-60K dataset: following a similar set in collecting 1,364 networks in sec.2, free structural parameters that can vary are: network depth , number of filter channels and kernel size . We train every possible architecture 100 epochs, and collect their final test accuracy in the dataset.

Collecting LSTM-10K dataset: following a similar LSTM cell definition in pham2018efficient , we represent a LSTM cell with a connectivity matrix and a node list of operators. The choice of operators is limited to either a fully connected layer followed by RELU or an identity layer, while the connectivity is limited to 6 nodes. We randomly sampled 10K architectures from this constrained search domain, and training them following the exact same setup in liu2018darts . Then we keep the test perplexity as the performance metric for NAS.

These 3 datasets, 420K NASBench-101, 60K CNN, and 10K LSTM, enable us to fairly and truly evaluate the search efficiency of LaNAS against different algorithms. In each task, we perform an independent search for 100 times with different random seeds. The mean performance, along with the 25% to 75% error range, is shown in Fig.5. The structure of search tree is consistent across all 3 tasks, using a tree with height = 5, and #selects = 100. We randomly pick 2000 networks in initializing LaNAS.

6.2 Open Domain Search and Experiment Setup:

It is not sufficient to verify the search efficiency exclusively on NAS datasets, as the search space in the real setting contains networks far more than a dataset. This motivates us to design an open domain search to test LaNAS from a search domain with billions of potential architectures. For a truly effective search algorithm, it is expected to find a good architecture with fewer samples.

Our search space is consistent with the widely adopted NASNetzoph2018learning

. We allow for 6 types of layers in the search, which are 3x3 max pool, 5x5 max pool, 3x3 depth-separable conv, 5x5 depth-separable conv, 7x7 depth-separable conv, identity. Inside a Cell, we allow for up to 7 blocks. The search architecture is still consistent with the one used on NAS dataset except that we increase the tree height from 5 to 8. Similar to PNAS

liu2018progressive , we use the same cell architecture for both “normal” and “reduction” layer. The best convolutional cell found by LaNAS is visualized below.