AceNAS: Learning to Rank Ace Neural Architectures with Weak Supervision of Weight Sharing

08/06/2021
by   Yuge Zhang, et al.
Microsoft
17

Architecture performance predictors have been widely used in neural architecture search (NAS). Although they are shown to be simple and effective, the optimization objectives in previous arts (e.g., precise accuracy estimation or perfect ranking of all architectures in the space) did not capture the ranking nature of NAS. In addition, a large number of ground-truth architecture-accuracy pairs are usually required to build a reliable predictor, making the process too computationally expensive. To overcome these, in this paper, we look at NAS from a novel point of view and introduce Learning to Rank (LTR) methods to select the best (ace) architectures from a space. Specifically, we propose to use Normalized Discounted Cumulative Gain (NDCG) as the target metric and LambdaRank as the training algorithm. We also propose to leverage weak supervision from weight sharing by pretraining architecture representation on weak labels obtained from the super-net and then finetuning the ranking model using a small number of architectures trained from scratch. Extensive experiments on NAS benchmarks and large-scale search spaces demonstrate that our approach outperforms SOTA with a significantly reduced search cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

04/12/2021

Landmark Regularization: Ranking Guided Super-Net Training in Neural Architecture Search

Weight sharing has become a de facto standard in neural architecture sea...
02/21/2021

Weak NAS Predictors Are All You Need

Neural Architecture Search (NAS) finds the best network architecture by ...
04/27/2022

PRE-NAS: Predictor-assisted Evolutionary Neural Architecture Search

Neural architecture search (NAS) aims to automate architecture engineeri...
10/04/2021

An Analysis of Super-Net Heuristics in Weight-Sharing NAS

Weight sharing promises to make neural architecture search (NAS) tractab...
11/21/2019

Data Proxy Generation for Fast and Efficient Neural Architecture Search

Due to the recent advances on Neural Architecture Search (NAS), it gains...
07/28/2021

Homogeneous Architecture Augmentation for Neural Predictor

Neural Architecture Search (NAS) can automatically design well-performed...
08/18/2021

RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving

Predictor-based algorithms have achieved remarkable performance in the N...

Code Repositories

AceNAS

Open source implementation of AceNAS: https://arxiv.org/abs/2108.03001


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Architecture Search (NAS) has shown its effectiveness on various tasks including computer vision 

[39, 68, 10]

, natural language processing 

[29, 13, 48]

, and is increasingly spanning to more domains and tasks. Because of the advantage in dealing with the complexity to manually design neural architectures, NAS is becoming a powerful tool to facilitate the design of new deep learning models.

Early NAS [67, 68, 50]

adopt Reinforcement Learning (RL) as their search strategy. Although they have outperformed hand-crafted designed architectures on vision tasks, the excessively expensive cost (

e.g., 1,800 GPU days) makes those methods impractical. Recent works are dedicated to improving the search efficiency, e.g., evolutionary algorithms (EA) [43], and weight-sharing based methods [4, 39, 32, 11, 10].

In particular, performance predictor based methods [3, 19, 18, 35, 55, 12, 57, 56] are gaining popularity because of their simplicity, effectiveness [55], and easiness to integrate into other search algorithms like EA to get even better performance [57, 54]. However, little attention is paid onto the optimization objective of performance predictor. Current research focus either on the precise accuracy of each architecture [18, 55], or the perfect rank of all architectures [12, 57, 36], but they are not fully aligned with the nature of NAS, which is to identify the best architecture. Thus, those objectives tend to be unnecessarily over-strict, making the algorithms difficult to optimize and less efficient.

In this paper, we propose a novel algorithm named AceNAS. We review NAS from the perspective of Learning to Rank (LTR) [33] and rethink NAS compared to the document ranking problem in Information Retrieval (IR). With the insight that the goal of NAS and LTR are similar by their essence, we borrow techniques from LTR and relax the objective to emphasize on finding those best-performing architectures. Specifically, we propose to optimize the Normalized Discounted Cumulative Gain (NDCG) [25]. Unlike correlation metrics such as Kendall’s tau, NDCG attaches larger weights to the best architectures, and thus encourages identifying the top-performing ones, rather than considering each architecture equally. We utilize LambdaRank [9], a list-wise LTR approach, to directly optimize a ranking model that maximizes NDCG.

Another novelty of our method is that we leverage the weak supervision obtained from weight-sharing trained super-net to reduce the demand for architecture-accuracy annotations dramatically. Typical weight sharing based NAS algorithms [4, 39, 32, 11, 10] construct the search space into a super-net and optimize the super-net by iteratively training a sampled architecture per iteration. The shared weights are inherited to estimate architectures’ performance in the search phase. Because of the mutual interference in super-net training [37, 6], such estimations are very weak (inaccurate), which makes finding the best architectures based on these estimations challenging [65]. Instead of using the inaccurate weak performance estimations directly in searching, we pretrain the ranking model on massive number of inaccurate but easily-obtained weight-sharing labels and then finetune the model with only a few number of samples with accurate labels obtained from scratch training.

We conduct extensive experiments on 12 combinations of search spaces and datasets with benchmarks (e.g., NAS-Bench-101 [60], NAS-Bench-201 [21]), and ProxylessNAS search space [11]. The results demonstrate that AceNAS consistently outperforms the state-of-the-art performance predictors. Specifically, we achieve a same-level accuracy with only 110 architectures trained from scratch on NAS-Bench-101, which reduces the search cost by 18 than NAS-GBDT, 8

than RE, RL, BOHB, SemiNAS and BONAS. On NAS-Bench-201, we achieve an improvement of up to 3.67% in accuracy under similar costs. On ProxylessNAS, AceNAS is twice faster than Neural Predictor, and is comparable to the state-of-the-art under mobile settings (ImageNet top-1 75.13%, 84ms). Remarkably, AceNAS surpasses two GCN-based accuracy predictors on all benchmarked search spaces with even smaller costs.

To sum up, our main contributions are listed as follows:

  • We view NAS from a Learning to Rank perspective, relax the objective from considering each architecture equally to finding those best-performing architectures, and propose a novel algorithm named AceNAS.

  • To the best of our knowledge, we are the first work to transfer implicit knowledge from weight sharing by utilizing the weak labels produced by the super-net.

  • We comprehensively evaluated our approach on various search spaces and datasets. The results demonstrate the superiority of our approach on performance and efficiency over state-of-the-art NAS. We will open-source the whole code base to facilitate future NAS research.

2 Related works

Learning to Rank and Information Retrieval. Learning to rank (LTR) [33]

refers to a family of methods that leverage machine learning technologies to build effective ranking models. LTR is widely used in solving ranking problem in Information Retrieval (IR). Over the past decades, many LTR methods have been proposed and shown their effectiveness in modern IR systems like search engines. They can be categorized into pointwise 

[16, 15], pairwise [8, 52] and listwise [9]. Pointwise methods reduce the problem into an ordinal regression or relevant/irrelevant classification problem, but sometimes the problem becomes unnecessarily too hard to solve [8] and does not directly optimizes for rank itself. Successive works therefore proposes pairwise methods, which care more about the relative order among all the items instead of the exact item scores. However, they also suffer from the burden that the ranking model tries to improve the ranking orders at the bottom of the list instead of the top. Naturally, listwise approaches consider a list during training and are able to put more weights on the top scores. They often use ranking metrics focusing on top, e.g., NDCG [25] or Mean Reciprocal Rank (MRR) [41], as their optimization goal. According to [33], listwise approaches are usually more practical and perform better on large-scale experiments.

Performance predictor in NAS. Early NAS methods use Reinforcement Learning (RL) [67, 2, 66], Evolutionary Algorithms (EA) [44, 31] or Bayesian Optimization (BO) [7, 20, 26] to explore a huge search space. This comes with a huge cost because even the training of one single architecture takes hours to complete. Therefore, a good predictor of neural architecture performance becomes the key component to help quickly filter out the bad-performing architectures and identify the best. Various works use different machine learning techniques to build an accurate predictor, e.g., random forests [3], LSTM [19, 35], Gaussian Process [18] and GCN [51, 12, 55, 54, 14, 36]. Notably, the usage of Graph Convolutional Network (GCN) [28] in NAS has been a trend in recent works, due to the power of GCN in extracting features from a neural architecture, which is by its essence a graph structure. However, most of these methods still require thousands of architectures to be trained, which is still too expensive for large-scale search space on large-scale datasets.

Weight sharing in NAS. Weight sharing [4, 39, 32, 11, 10] is a commonly-used technique in NAS to reduce the massive computational cost to train each architecture independently from scratch. Despite its efficiency, its effectiveness is still a question under debate [61, 30, 59, 64, 1, 47, 5]. Zhang  [65] conducts extensive experiments on 5 search spaces to conclude that weight sharing is useful in distinguishing relatively good architectures from bad, but fails to identify the top ones. This inspires us to treat weight sharing accuracy as weakly supervised labels and transfer the knowledge from weight sharing to our performance predictor. Most recent works have shown the potentials to leverage a cheap metric (e.g., FLOPs [17] and latency [12]) to improve the accuracy predictor. These techniques are also shown to be helpful to us.

3 Methodology

Figure 1: Overall illustration of AceNAS. (Top) AceNAS first sample weak labels on super-net to train the ranking model with multi-task loss. (Bottom) The trained GCN is transferred to optimize the LTR model.

We design AceNAS which incorporates techniques from information retrieval and combines new NAS-specific designs and adaptations. In this section, we first formulate NAS as a Learning to Rank problem and justify NDCG as a new optimization objective for NAS (§ 3.1). Then we design a new NAS ranking model based on LambdaRank, in which we use GCN to capture the representation of neural architectures (§ 3.2). As it is hard and expensive to collect sufficient architecture-accuracy pairs, we pretrain the GCN with weak labels obtained from a well-trained super-net to greatly reduce architecture-accuracy pairs needed (§ 3.3).

The overall illustration of AceNAS is shown in Figure 1, which divides into two stages. First, pretraining a GCN-based ranking model with weakly supervised labels from a well-trained super-net. Second, transferring the pretrained GCN into another ranking model and training it with LambdaRank with limited architecture-accuracy pairs.

3.1 NDCG: a new optimization goal on NAS

Given a search space , where is the search space size and is the -th architecture. The goal of NAS is to find the top- ( is the hyper-parameter indicating the threshold) architectures with highest scores , where is the ground truth, i.e., , estimation of test accuracy111We will use the term “accuracy” throughout this paper. Note that this can be easily replaced with other metrics. of architecture ().

We find that this optimization goal is quite similar to that in Information Retrieval (IR) which retrieves the best matched documents from a large number of documents, where the ranking quality of high relevant documents is more important than that of low relevant documents. Correspondingly, in NAS, model developers care more about identifying the top architecture among relatively good-performing models, while distinguishing which one is worse among bad-performing models is of less importance.

Instead of using rank correlation (e.g., Kendall’s tau) on the whole search space like many NAS solutions do [12, 4, 57, 36], we propose to use Normalized Discounted Cumulative Gain (NDCG) [25], a metric which has been proved effective and widely adopted in IR [33], as the optimization objective of NAS.

Normalized Discounted Cumulative Gain.

NDCG is a measure of ranking quality and often used to measure effectiveness of a ranking model. It takes into account the graded relevance values and encourages the highly relevant items to come up into the top of recommended lists. In applications of IR, NDCG has shown its effectiveness in improving top-ranked results and has been well studied in quite a few previous works, both theoretically and empirically [27, 53].

Given a query, the ranking model generates a sequence of items in descending relevance order, while the real relevance scores are . NDCG is computed as:

(1)

where DCG is defined as:

(2)

Ideal Discounted Cumulative Gain (IDCG) is DCG in the ideal case, in which is the descendingly sorted relevance list (). NDCG is thus a normalized DCG between 0 and 1. If an item with high relevance score gets ranked poorly, DCG gets penalized. The “” part emphasizes on items with higher relevance, thus encourages the model to retrieve more of them, rather than focus on the global rank.

Adapt NDCG to NAS.

It is non-trivial to apply NDCG to NAS, because accuracy values of architectures have broader range (e.g., 

0-100) than the range of relevance scores in IR, and more importantly, the distribution of accuracy is highly skewed on the range.

Figure 2

shows a typical distribution where 80% accuracy values are located between 60% and 70%, while the other 20% accuracy values span from 0 to 60% (we call them outliers). Dealing with the whole range vanishes the ability of distinguishing the accuracy of those top 80% architectures. To alleviate the negative effect of outliers, we clip the distribution into a smaller range. Concretely, we compute the 20%-quantile of the original distribution obtained from training data (

i.e., architecture-accuracy pairs) as lower bound, and directly use the maximum accuracy in the training data as upper bound. The values in this range are further linearly mapped to ( is 20 in our experiments) for computing NDCG.

Figure 2: Test accuracy distribution in NAS-Bench-201.
NDCG vs. rank correlation.

We use a simple experiment to compare NDCG and Kendall’s tau in Figure 3. In the experiment, we train a vanilla accuracy predictor [55] using two different training configurations. The left figure shows higher Kendall’s tau, but its ability of identifying top architectures is weaker. The accuracy of the architectures whose predicted accuracy are around 95% spans from 85% to 95%, and its NDCG is very low. In contrast, the right figure has a much higher NDCG. Accordingly, top architectures are more accurately identified. Therefore, NDCG is a better optimization objective and metric for NAS than rank correlation. More concrete experiments are in § 4.2.1.

Better k-tau
Better NDCG
Figure 3: The prediction ground-truth scatter plot of two predictors. Although two figures have a close Kendall’s tau, in the left figure, the top architectures look more scattered, resulting in the failure of finding the best architectures.

3.2 NAS ranking model

To optimize NDCG, we design a ranking model as the main body of AceNAS. LambdaRank is the key of this ranking model to focus on good-performing architectures. Below, we briefly introduce LambdaRank, and then elaborate the design of the ranking model.

LambdaRank.

Different from the ranking models that optimize the whole rank (e.g., pair-wise ranking loss in RankNet [8]), LambdaRank [9] uses NDCG to put higher emphasis on good-performing architectures. It takes the position of an item (e.g., a document in IR or an architecture in NAS) in the ranking distribution into consideration. Model gradients are directly computed with LambdaRank. Specifically, for a pair whose ranking score is , the gradient of parameter is computed as follows:

(3)
(4)

measures the change of NDCG when positions of and get swapped. Swapping higher ranked items gets more penalty, leading to larger gradient.

Ranking model architecture.

A crucial step to build a ranking model for neural architectures, is to get an appropriate embedding of architectures, i.e., 

converting dynamically constructed graphs with different depth and width into a fixed-length vector. Graph Convolutional Network (GCN) 

[28] is a natural fit for generating the embedding due to its advantage in dealing with graph-structured data, thus it has been adopted in recent works [12, 55, 54, 14, 36, 51, 46]

. We also use GCN for the embedding. Specifically we choose Deep Graph Convolutional Neural Network (DGCNN) 

[63] which performs well in our model. It has four directed graph convolution layers followed by sort-pooling and 1D convolution as shown in Figure 4.

Figure 4: Architecture of our ranking model.

Another component in the ranking model is ranking head which is a Multi-layer Perceptron, which in our case are two fully-connected layers with ReLU and dropout in between. It predicts ranking score

for an architecture by taking ’s embedding from DGCNN and corresponding hyper-parameters. With the ranking score, we use Equation 3 and 4 to optimize the model.

The ranking model is trained using iterative sampling which has been widely used in AutoML algorithms (e.g., BOHB [22] and BRP-NAS [12]). We split the training process into multiple rounds. In each round, we train architectures and get architecture-accuracy pairs, which are used to train the ranking model. Then, the ranking model is used to sample architectures for the next round. To balance exploration and exploitation, in each round, we sample best architectures with our ranking model, while the other are sampled randomly.

3.3 Weak supervision of weight sharing

As it is computationally expensive to obtain sufficient number of architecture-accuracy pairs in ground-truth, the training of ranking model becomes particularly challenging and unstable. We propose to use weak supervision from a well-trained super-net (i.e., accuracy evaluated using the weights from super-net) to pretrain the ranking model. The super-net is trained using uniform sampling, i.e., each mini-batch trains a sampled architecture in the super-net [23], and thus the computation cost is similar to training a single architecture. The design of treating weight sharing accuracy as weakly supervised labels is inspired by the observation in previous research [65] that weight sharing super-net is capable of differentiating good architectures from bad ones, with relatively high rank correlation (e.g., Kendall’s tau could be higher than 0.6 on many search spaces).

To empower our ranking model with knowledge from super-net, we replace ranking head in the model (Figure 4) with a WS-accuracy head, which is another two-layer MLP to predict weight sharing accuracy. This model is trained with mean-squared-error (MSE) loss instead of using LambdaRank, because weight sharing labels are not qualified for identifying the best architectures from good-performing ones.

To further boost the effectiveness of pretraining, we incorporate multi-task training in the ranking model. Apart from WS-accuracy head, additional two heads are introduced to predict FLOPs and number of parameters, respectively. The ranking model is trained to minimize the following multi-task mean-squared-error (MSE) loss:

(5)

where variables marked with stars () are predictions. Empirically we find that the training is not sensitive to and

, after we normalize the ground truth labels by subtracting mean and dividing by standard deviation. Therefore we simply set

.

4 Experiments

4.1 Experiment setup

Search space in NAS benchmarks. As shown in Table 1

, we evaluate AceNAS on 10 different benchmarked search spaces, and three datasets including CIFAR-10, CIFAR-100, and ImageNet16-120. Apart from NAS-Bench-101 

[60] and NAS-Bench-201 [21], which has been evaluated by many prior works, we leverage 8 more benchmarks from NDS [42]. These search spaces are more practical compared to NAS-Bench-101 and NAS-Bench-201, as they originate from SOTA NAS works (e.g., NASNet [68]), search for more dimensions (e.g., up to 13 op types, width and depth) and hence contain even more architectures compared to commonly-used spaces in NAS literature.

ProxylessNAS search space. ProxylessNAS [11] is essentially different from search spaces provided in benchmarks. It is a popular chain-wise search space that consists of 21 sequential MB-Conv choice blocks and candidates. We used the latency lookup table provided by [5] to constrain the search space so that models have comparable latency to ProxylessNAS-mobile (83ms – 85ms).

Weight sharing super-net. To obtain the shared-weights models, we build the full search space into a super-net and adopt the widely-used uniform random sampling approach [23] to train the super-net. We follow [39, 23, 49, 10]

for handling the dynamic channels and depths during super-net training in NAS-Bench-101 and NDS. On evaluation, we calculate the batch normalization statistics on the fly with a batch size of 512.

GCN model and weakly supervised pretraining.

Our GCN predictor has four graph convolution layers, with 128 hidden units in each layer, followed by a sort pooling layer for aggregating all node-level information, and a fully-connected head with 128 hidden units. We use PyTorch

222We refer to [24] when implementing LTR algorithms. for GCN implementation. For the pretraining stage, we obtain 4k weak labels, i.e., validation accuracy evaluated on each weight sharing super-net. FLOPs and parameter size are also obtained for the 4k architectures. More hyper-parameter settings can be found in supplementary materials.

LTR training. After pretraining, the ground truths (i.e., architecture-accuracy pairs) are used to train the LambdaRank ranking model. The training budget is split into 5 rounds. For example, if 100 architectures are sampled in total, 20 architectures are sampled in each round. We take = 0.5, which means half are sampled with ranking model and half are sampled randomly. Then architectures top-ranked by the ranking model are retrained from scratch. We select the model with highest validation accuracy, and report its test accuracy as the final result (i.e., top- accuracy).

width= Search space # Cells # SD # Benchmarked NAS-Bench-101 423,624 1 423,624 NAS-Bench-201 15,625 1 15,625 DARTS 3 5,000 DARTS-fixwd 3 5,000 ENAS 3 4,999 ENAS-fixwd 3 5,000 PNAS 3 4,999 PNAS-fixwd 3 4,599 Amoeba 3 4,983 NASNet 3 4,846

Table 1: Characteristics of search spaces in benchmarks: available datasets, number of different cells in total, number of search dimension, and number of architectures available in the benchmark. all search spaces have CIFAR-10 data; means it has CIFAR-100 data;

means it has ImageNet data (NAS-Bench-201 has ImageNet 16-120 data).

width= Method NAS-Bench-101 NAS-Bench-201 Budget Test Acc. Budget Test Acc. Oracle 423,624 94.34 15,625 73.48 Random 1,000 93.42 100 69.94 NAS-GBDT [34] 2,000 94.14 - - RE [43] 1,000 93.72 100 70.69 RL [67] 1,000 93.58 100 70.68 BOHB [22] 1,000 93.72 100 69.71 SemiNAS [51] 1,000 94.01 - - BONAS [45] 1,000 94.24 - - Unsup. encoding [58] 400 94.10 - 73.37 Neural Predictor [55] 219 94.04 - - BRP-NAS [12] 110 94.05 110 72.79 AceNAS 110 94.10 110 73.38 AceNAS (large) 1,000 94.32 500 73.47

Table 2: Comparison against SOTA NAS methods. Budget here refers to the number of architectures sampled. We reproduced BRP-NAS using FLOPs for pretraining.
Figure 5: AceNAS consistently surpasses Vanilla and BRP-NAS on all search spaces.

4.2 AceNAS on NAS benchmarks

Comparison with state-of-the-art results. We first evaluate AceNAS by comparing to prior works on NAS-Bench-101 and 201. We list out the results in Table 2. Compared to prior performance predictors, we achieve a comparable accuracy with only 110 ground truth architectures on NAS-Bench-101. This cost is 18 smaller than NAS-GBDT, and 8 smaller than RE, RL, BOHB, SemiNAS and BONAS. On NAS-Bench-201, we achieve 0.59% – 3.67% higher accuracy when the same number of ground truths are sampled. When we increase the number of samples to 1,000 on NAS-Bench-101 and 500 on NAS-Bench-201, AceNAS significantly outperforms other predictors. It achieves similar accuracy with the best architecture (oracle), with only 0.02% and 0.01% gap on NAS-Bench-101 and NAS-Bench-201, respectively.

Results on different search spaces. We now further demonstrate the effectiveness of AceNAS on other search spaces. We implement two most related GCN-based approaches: (i) Vanilla, a basic GCN predictor proposed in [55] and (ii) BRP-NAS, a binary-relation predictor that also leverages FLOPs and latency in pretraining. To show our effectiveness with fewer samples, we train all the predictors with 20, 40, 60, 80, and 100 architectures. We report the highest accuracy of top 10 models returned by predictors and repeat each experiment for 50 times.

We show the results in Figure 5. On all 10 search spaces, AceNAS consistently find a higher accuracy model than Vanilla and BRP-NAS under the same number of samples. Remarkably, AceNAS also outperforms Vanilla and BRP-NAS under 20 samples, where we can see that AceNAS achieves higher accuracy than Vanilla (up to 0.62%) and BRP-NAS (up to 0.11%).

4.2.1 Improvement in NDCG

Correlation between NDCG and end-to-end accuracy. To answer the question whether NDCG is a good indicator, we collect Kendall’s tau and NDCG of all experiments and runs covered in § 4.2, and calculate their Pearson correlation coefficient with top-10 accuracy respectively (Figure 6). Compared to the Kendall’s tau, the correlation coefficients of NDCG are significantly higher than that of Kendall’s tau (1.3 - 1.8 times), proving that it is much better correlated with the end-to-end performance of NAS.

Figure 6: Pearson correlation coefficients of different metrics with respect to top-10 accuracy.

NDCG improved by AceNAS. To further prove the effectiveness of NDCG and our methods, the top-10 accuracy vs. NDCG for each methods on different search spaces are illustrated in Figure 7. In this figure, each point corresponds to an experiment (i.e., a ranking model trained with a specific seed and budget). The upward trends in the figure show the correlation between NDCG and accuracy. The distribution of different methods illustrates how AceNAS improves NDCG and accuracy. Compared to points from Vanilla and BRP-NAS, most points from AceNAS are at the top-right corner, meaning that they enjoy both a better NDCG and a better accuracy.

Figure 7: AceNAS achieves better NDCG and accuracy on different search spaces. The upward trends show the correlation between NDCG and accuracy.

4.2.2 Ablation study

Effectiveness of ranking model. We first evaluate the effectiveness to train a ranking model. We implement a weight sharing guided search [40] as our baseline for comparison, which selects top-100 architectures with the highest accuracy on super-net. For AceNAS, we also sample 100 architectures, but the first 80 are selected by iterative sampling, and the rest 20 are selected by the final ranking model. As shown in Figure 8, AceNAS outperforms the baseline with the same search cost (100 samples). Remarkably, we reduce test regret333Test regret: the gap between test accuracy and the best test accuracy on search space. by 3% on NAS-Bench-201.

Figure 8: Comparison of test regret between weight-sharing-guided greedy search and AceNAS on 100 samples.

Effectiveness of LambdaRank. We replace LambdaRank in AceNAS with either MSE loss or RankNet loss. RankNet [8] loss is essentially the same as LambdaRank but does not take into account the changes in NDCG and treat all pairs equally. We report the test regret of top-10 architectures. As shown in Figure 9 (top), on all search spaces, AceNAS with LambdaRank achieves much smaller test regrets than MSE and RankNet. This echoes the findings in § 3 and demonstrates the effectiveness of LambdaRank.

Effectiveness of weak supervision of weight-sharing. Finally, we investigate the effectiveness of using weight sharing labels to pretrain GCN. Our comparison baseline is Parameters & FLOPs, that removes weight sharing from AceNAS (i.e., Params. & FLOPs & Weight-sharing). Figure 9 (bottom) shows test regrets on 12 search spaces and datasets. With the weak supervision of weight sharing labels, we significantly reduce the test regret by up to 0.39%.

Figure 9: Comparison among top-10 test regrets when using different LTR loss (top) and when pretrained with different labels (bottom). Number of sampled architectures is fixed to 100. Each bar is an average of 50 runs.

4.3 AceNAS on ProxylessNAS

To run AceNAS on ProxylessNAS, we first train a super-net and pretrain our ranking model on weight sharing accuracy. After this, we randomly sample 80 architectures and fine-tune the model with LambdaRank. In the following stages, we run iterative sampling, with to sample 10 random and 20 best architectures every iteration. The 20 best is taken from as random set of 100,000 architectures, following [55]. In the final stage, we follow [54] to run evolution algorithm with the ranking model functioning as a surrogate function. We use an exactly same training setup as [5]. During the full search process (including super-net training), architectures are evaluated on a 50,046-image validation set selected from the original training set, and we used the original validation set only in the final test.

Test Acc. (%) Latency (ms)
Neur. Predictor [55] 74.75 84.95
TuNAS [5] 75.0 84.0
ProxylessNAS [11] 74.6 84.4
MnasNet-B1 [50] 74.5 84.5
AceNAS (Stage 1) 74.73 83.30
AceNAS (Final) 75.13 84.59
Table 3: Experiment results on ProxylessNAS. : [5] reports 74.8% with an improved training setup.

Results are shown in Table 3. With 234 architectures trained and evaluated (in the final stage), AceNAS reaches 75.93% on validation set and 75.13% on test set (an average over 3 runs that report 75.29%, 75.09%, 75.00% respectively), outperforming ProxylessNAS by as much as 0.53%. Remarkably, at the first stage, with only 100 architectures trained, AceNAS has reached 74.73%, which is comparable to Neural Predictor but at least twice faster.

We further understand why AceNAS works by running a qualitative ablation study on the two key components of AceNAS, i.e., NDCG and weak supervision of weight sharing. We run training and validation on 250 random architecture-accuracy pairs provided by [5] and test whether it can distinguish the top-performing architectures reported in MnasNet, TuNAS and ProxylessNAS. The test architectures are of higher accuracy than training and validation architectures, so they are isolated in y-axis. A good model should be able to distinguish them by prediction scores (x-axis). As shown in the middle of Figure 10, without NDCG, the predictor performs well (score is linearly to true accuracy) in a wide range on training and validation data, but fails to isolate test data in x-axis (overlaps between test data and others in x-axis). When empowered by NDCG (left), AceNAS is encouraged to emphasize the top-ranked architectures other than the whole range, and the testing architectures seem to be more isolated instead of mixed up with the random samples. Pretraining also plays an important role. Without pretraining on a large number of weak labels, learning on hundreds of samples is harder and points in Figure 10 (right) look more scattered than the others.

Figure 10: Qualitative ablation study of AceNAS on ProxylessNAS. The x-axis is prediction score of an architecture (in the range of , greater is better) and y-axis is the validation accuracy (ground-truth). In the middle figure, we exclude NDCG and replace it with a MSE loss. In the right figure, we exclude the whole pretraining process.

5 Conclusion

We observe and empirically prove that NDCG is a better optimization goal for NAS. Based on this, we introduce AceNAS, a GCN ranking model that directly optimizes NDCG via LambdaRank and incorporates weak supervision of weight sharing. AceNAS demonstrates consistent improvement over various settings. We hope our work will encourage future NAS researches to think of NAS from a Learning to Rank perspective, and even incorporate with research in other directions.

References

  • [1] G. Adam and J. Lorraine (2019) Understanding neural architecture search techniques. External Links: 1904.00438 Cited by: §2.
  • [2] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016)

    Designing neural network architectures using reinforcement learning

    .
    arXiv preprint arXiv:1611.02167. Cited by: §2.
  • [3] B. Baker, O. Gupta, R. Raskar, and N. Naik (2017) Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823. Cited by: §1, §2.
  • [4] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018-10–15 Jul) Understanding and simplifying one-shot architecture search. J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 550–559. External Links: Link Cited by: §1, §1, §2, §3.1.
  • [5] G. Bender, H. Liu, B. Chen, G. Chu, S. Cheng, P. Kindermans, and Q. V. Le (2020) Can weight sharing outperform random architecture search? an investigation with tunas. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 14323–14332. Cited by: §B.2, §B.2, §2, §4.1, §4.3, §4.3, Table 3.
  • [6] Y. Benyahia, K. Yu, K. B. Smires, M. Jaggi, A. C. Davison, M. Salzmann, and C. Musat (2019-09–15 Jun) Overcoming multi-model forgetting. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 594–603. External Links: Link Cited by: §1.
  • [7] J. Bergstra, D. Yamins, and D. Cox (2013)

    Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures

    .
    In International conference on machine learning, pp. 115–123. Cited by: §2.
  • [8] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender (2005) Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pp. 89–96. Cited by: §2, §3.2, §4.2.2.
  • [9] C. J. Burges, R. Ragno, and Q. V. Le (2007) Learning to rank with nonsmooth cost functions. In Advances in neural information processing systems, pp. 193–200. Cited by: §1, §2, §3.2.
  • [10] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han (2019) Once-for-all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791. Cited by: §1, §1, §1, §2, §4.1.
  • [11] H. Cai, L. Zhu, and S. Han (2019) ProxylessNAS: direct neural architecture search on target task and hardware. External Links: 1812.00332 Cited by: §1, §1, §1, §2, §4.1, Table 3.
  • [12] T. Chau, Ł. Dudziak, M. S. Abdelfattah, R. Lee, H. Kim, and N. D. Lane (2020) Brp-nas: prediction-based nas using gcns. arXiv preprint arXiv:2007.08668. Cited by: §B.1.1, §1, §2, §2, §3.1, §3.2, §3.2, Table 2.
  • [13] J. Chen, K. Chen, X. Chen, X. Qiu, and X. Huang (2018) Exploring shared structures and hierarchies for multiple nlp tasks. arXiv preprint arXiv:1808.07658. Cited by: §1.
  • [14] H. Cheng, T. Zhang, S. Li, F. Yan, M. Li, V. Chandra, H. Li, and Y. Chen (2020) Nasgem: neural architecture search via graph embedding method. arXiv preprint arXiv:2007.04452. Cited by: §2, §3.2.
  • [15] W. S. Cooper, F. C. Gey, and D. P. Dabney (1992)

    Probabilistic retrieval based on staged logistic regression

    .
    In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 198–210. Cited by: §2.
  • [16] K. Crammer and Y. Singer (2001) Pranking with ranking. Advances in neural information processing systems 14, pp. 641–647. Cited by: §2.
  • [17] X. Dai, A. Wan, P. Zhang, B. Wu, Z. He, Z. Wei, K. Chen, Y. Tian, M. Yu, P. Vajda, and J. E. Gonzalez (2020) FBNetV3: joint architecture-recipe search using neural acquisition function. External Links: 2006.02049 Cited by: §2.
  • [18] X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia, P. Vajda, M. Uyttendaele, and N. K. Jha (2018) ChamNet: towards efficient network design through platform-aware model adaptation. External Links: 1812.08934 Cited by: §1, §2.
  • [19] B. Deng, J. Yan, and D. Lin (2017) Peephole: predicting network performance before training. arXiv preprint arXiv:1712.03351. Cited by: §1, §2.
  • [20] T. Domhan, J. T. Springenberg, and F. Hutter (2015) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In

    Twenty-Fourth International Joint Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • [21] X. Dong and Y. Yang (2020) NAS-bench-201: extending the scope of reproducible neural architecture search. External Links: 2001.00326 Cited by: §1, §4.1.
  • [22] S. Falkner, A. Klein, and F. Hutter (2018) BOHB: robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774. Cited by: §3.2, Table 2.
  • [23] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2020) Single path one-shot neural architecture search with uniform sampling. External Links: 1904.00420 Cited by: §B.2, §3.3, §4.1.
  • [24] haowei01 (2020) Pytorch-examples. GitHub. Note: https://github.com/haowei01/pytorch-examples Cited by: footnote 2.
  • [25] K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 422–446. Cited by: §1, §2, §3.1.
  • [26] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in neural information processing systems, pp. 2016–2025. Cited by: §2.
  • [27] E. Kanoulas and J. A. Aslam (2009) Empirical justification of the gain and discount function for ndcg. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, New York, NY, USA, pp. 611–620. External Links: ISBN 9781605585123, Link, Document Cited by: §3.1.
  • [28] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2, §3.2.
  • [29] N. Klyuchnikov, I. Trofimov, E. Artemova, M. Salnikov, M. Fedorov, and E. Burnaev (2020) Nas-bench-nlp: neural architecture search benchmark for natural language processing. arXiv preprint arXiv:2006.07116. Cited by: §1.
  • [30] L. Li and A. Talwalkar (2019) Random search and reproducibility for neural architecture search. External Links: 1902.07638 Cited by: §2.
  • [31] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §2.
  • [32] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §1, §2.
  • [33] T. Liu (2011) Learning to rank for information retrieval. Springer Science & Business Media. Cited by: §1, §2, §3.1.
  • [34] R. Luo, X. Tan, R. Wang, T. Qin, E. Chen, and T. Liu (2020) Neural architecture search with gbdt. arXiv preprint arXiv:2007.04785. Cited by: Table 2.
  • [35] R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In Advances in neural information processing systems, pp. 7816–7827. Cited by: §1, §2.
  • [36] X. Ning, Y. Zheng, T. Zhao, Y. Wang, and H. Yang (2020) A generic graph-based neural architecture encoding scheme for predictor-based nas. arXiv preprint arXiv:2004.01899. Cited by: §1, §2, §3.1, §3.2.
  • [37] S. Niu, J. Wu, Y. Zhang, Y. Guo, P. Zhao, J. Huang, and M. Tan (2020) Disturbance-immune weight sharing for neural architecture search. External Links: 2003.13089 Cited by: §1.
  • [38] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun (2018-06) MegDet: a large mini-batch object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §B.2.
  • [39] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1, §1, §1, §2, §4.1.
  • [40] A. Pourchot, A. Ducarouge, and O. Sigaud (2020) To share or not to share: a comprehensive appraisal of weight-sharing. External Links: 2002.04289 Cited by: §4.2.2.
  • [41] D. R. Radev, H. Qi, H. Wu, and W. Fan (2002) Evaluating web-based question answering systems.. In LREC, Cited by: §2.
  • [42] I. Radosavovic, J. Johnson, S. Xie, W. Lo, and P. Dollár (2019) On network design spaces for visual recognition. External Links: 1905.13214 Cited by: Table 4, §4.1.
  • [43] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    .
    External Links: 1802.01548 Cited by: §1, Table 2.
  • [44] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041. Cited by: §2.
  • [45] H. Shi, R. Pi, H. Xu, Z. Li, J. T. Kwok, and T. Zhang (2020) Bridging the gap between sample-based and one-shot neural architecture search with bonas. External Links: 1911.09336 Cited by: Table 2.
  • [46] J. Siems, L. Zimmer, A. Zela, J. Lukasik, M. Keuper, and F. Hutter (2020) NAS-bench-301 and the case for surrogate benchmarks for neural architecture search. External Links: 2008.09777 Cited by: §3.2.
  • [47] P. Singh, T. Jacobs, S. Nicolas, and M. Schmidt (2019) A study of the learning progress in neural architecture search techniques. External Links: 1906.07590 Cited by: §2.
  • [48] D. R. So, C. Liang, and Q. V. Le (2019) The evolved transformer. arXiv preprint arXiv:1901.11117. Cited by: §1.
  • [49] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyantha, J. Liu, and D. Marculescu (2019) Single-path nas: designing hardware-efficient convnets in less than 4 hours. External Links: 1904.02877 Cited by: §4.1.
  • [50] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §1, Table 3.
  • [51] Y. Tang, Y. Wang, Y. Xu, H. Chen, B. Shi, C. Xu, C. Xu, Q. Tian, and C. Xu (2020) A semi-supervised assessor of neural architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1810–1819. Cited by: §2, §3.2, Table 2.
  • [52] M. Tsai, T. Liu, T. Qin, H. Chen, and W. Ma (2007) FRank: a ranking method with fidelity loss. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 383–390. Cited by: §2.
  • [53] Y. Wang, L. Wang, Y. Li, D. He, T. Liu, and W. Chen (2013) A theoretical analysis of ndcg type ranking measures. External Links: 1304.6480 Cited by: §3.1.
  • [54] C. Wei, C. Niu, Y. Tang, and J. Liang (2020) NPENAS: neural predictor guided evolution for neural architecture search. arXiv preprint arXiv:2003.12857. Cited by: §1, §2, §3.2, §4.3.
  • [55] W. Wen, H. Liu, Y. Chen, H. Li, G. Bender, and P. Kindermans (2020) Neural predictor for neural architecture search. In European Conference on Computer Vision, pp. 660–676. Cited by: §B.1.1, §B.1.2, §1, §2, §3.1, §3.2, §4.2, §4.3, Table 2, Table 3.
  • [56] J. Wu, X. Dai, D. Chen, Y. Chen, M. Liu, Y. Yu, Z. Wang, Z. Liu, M. Chen, and L. Yuan (2021) Weak nas predictors are all you need. External Links: 2102.10490 Cited by: §1.
  • [57] Y. Xu, Y. Wang, K. Han, Y. Tang, S. Jui, C. Xu, and C. Xu (2021) ReNAS:relativistic evaluation of neural architecture search. External Links: 1910.01523 Cited by: §1, §3.1.
  • [58] S. Yan, Y. Zheng, W. Ao, X. Zeng, and M. Zhang (2020) Does unsupervised architecture representation learning help neural architecture search?. External Links: 2006.06936 Cited by: Table 2.
  • [59] A. Yang, P. M. Esperança, and F. M. Carlucci (2019) NAS evaluation is frustratingly hard. External Links: 1912.12522 Cited by: §2.
  • [60] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter (2019) Nas-bench-101: towards reproducible neural architecture search. In International Conference on Machine Learning, pp. 7105–7114. Cited by: §1, §4.1.
  • [61] K. Yu, C. Sciuto, M. Jaggi, C. Musat, and M. Salzmann (2019) Evaluating the search phase of neural architecture search. External Links: 1902.08142 Cited by: §2.
  • [62] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. External Links: 1803.08904 Cited by: §B.2.
  • [63] M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.2.
  • [64] Y. Zhang, Z. Lin, J. Jiang, Q. Zhang, Y. Wang, H. Xue, C. Zhang, and Y. Yang (2020) Deeper insights into weight sharing in neural architecture search. External Links: 2001.01431 Cited by: §2.
  • [65] Y. Zhang, Q. Zhang, and Y. Yang (2020) How does supernet help in neural architecture search?. External Links: 2010.08219 Cited by: §1, §2, §3.3.
  • [66] Z. Zhong, J. Yan, W. Wu, J. Shao, and C. Liu (2018) Practical block-wise neural network architecture generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2423–2432. Cited by: §2.
  • [67] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2, Table 2.
  • [68] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §1, §1, §4.1.

Appendix A Pseudo-code of AceNAS

In Algorithm 1, we present the pseudo-code for our algorithm.

0:  Search Space , budget for each round , budget after training , number of rounds , exploration-exploitation factor .
0:  The Ranking Model , the best architecture , the best test accuracy .
   Pre-training
  Build a weight sharing super-net based on .
  while not converged do
     Random sample one sub-net from super-net.
     Optimize weights corresponding to and update in .
  end while
  Sample sufficient architectures from and evaluate accuracy, FLOPs and number of parameters on
  Optimize to minimize .
  
   Fine-tuning
  Initialize sampled architectures
  for  do
     if  then
         randomly sampled architectures.
     else
         best architectures predicted by and random architectures.
     end if
     .   This is the most costly step.
     Fine-tune ranking model on with LambdaRank.
  end for
   top- architectures and their validation accuracy predicted by .
  .
   architecture with best validation accuracy on .
   accuracy of on test dataset.
Algorithm 1 AceNAS

Appendix B Implementation details

b.1 Experiments on NAS benchmarks

b.1.1 Hyper-parameter settings

We list important hyper-parameters used in super-net training in Table 4. We run our training on a single Nvidia Tesla V100 with 16GB memory.

Batch size 192

Number of epochs

600
Optimizer SGD
Initial learning rate 0.05
Ending learning rate 0
Learning rate schedule Cosine decay
Weight decay 0.0001
Gradient clip 5
Evaluate batch size 512
Table 4: Important hyper-parameters used in super-net training. : In search spaces provided by NDS [42], we set batch size to 128 due to limited GPU memory.

We list important hyper-parameters used to train the ranking model in AceNAS in Table 5. For BRP-NAS [12] and Neural predictor [55], we follow the hyper-parameters used in their paper.

Batch size 20
Number of epochs 300
Optimizer Adam
Initial learning rate 0.005
Ending learning rate 0
Learning rate schedule Cosine decay
Weight decay 0.0005
Early stop patience 50
Table 5: Important hyper-parameters used in ranking model training. In pre-training stage, we use the same hyper-parameters, except initial learning rate = 0.001, weight decay = and early stop is disabled.

b.1.2 Handling neural networks as graphs

Following [55]

, we encode one type of cell into a directed graph. The type of operator is encoded into a one-hot tensor that is treated as node attributes, and the connections between operators are encoded as edges. Some other pseudo-nodes are necessary to make the graph connected, for example the nodes that are labeled as add/input/output/concatenate. In search spaces provided by NDS, the neural networks search for multiple different types of cells and a series of architecture hyper-parameters (

e.g., number of cells stacked, channel size multiplier). The graphs are then feeded into GCN and the embedded features are concatenated with hyper-parameter features.

b.2 Experiments on ProxylessNAS

To run experiment on ProxylessNAS, we first train a super-net with Single-path One Shot [23]. The hyper-parameters used are slightly different from those listed in Table 5. We list them in Table 6. We followed the implementation in [5]

to sample skip connection at 0.5 probability, although we did not apply other tricks,

e.g., merging convolution kernels. After super-net training is done, we sampled 10000 architectures, where half of them satisfy the latency constraint (83 – 85ms) and the other half are randomly sampled from the distribution used in super-net training phase.

Batch size 2048
Number of epochs 360
Warm-up epochs 5
Optimizer SGD
Initial learning rate 0.48
Ending learning rate 0
Learning rate schedule Cosine decay
Weight decay 0.00005
Accelerator 16 GPUs
Table 6: Important hyper-parameters used in super-net training of ProxylessNAS. : We split the 2048 batch size into 16 GPUs and in each mini-batch every GPU samples architectures independently.

To train GCN, we used hyper-parameters identical to Table 5, except that initial learning rate is decreased to 0.001.

To train the searched architecture (both in the validation setting and test validation), we followed the settings proposed by [5] but re-implement it with PyTorch as the original implementation supports TPU only. To align the batch size (4096) with the original setting, we use 16 V100 GPU so that each GPU takes a mini-batch of 256 samples. Ideally, Sync-BN [38, 62] should be applied to synchronize batch normalization on all GPUs, however, we find that it harms the training speed by about 50%. To balance training speed and performance, we used Distribute-BN that synchronizes running statistics of batch normalization at the end of each epoch.

Appendix C Iterative sampling visualization

In Figure 11, we visualize the whole process of AceNAS on NAS benchmarks. As the 100 architectures are iteratively sampled in 5 folds, there are jumps at the point of 20, 40, 60, 80 and 100. Compared with baselines, the results have shown consistent improvements.

Figure 11: The test accuracy of the architecture with best validation accuracy, with respect to number of architectures trained. The vertical black line (100 architectures) indicates the ending of ranking model training. The budget of 100 architectures is splitted into 5 rounds. Each line is an average of 50 runs.

Similarly, we shown the sampling process on ProxylessNAS search space. In Figure 12, we shown an example where we sample 80 architectures in initialization phase, 20-greedy-10-random in the following 3 stages, and 30-greedy in stage 4. The 20 “greedy architectures” are selected by the ranking model, from 100,000 random architectures satisfying the latency constraint. In the final stage, we exploit the ranking model by using evolution to find the architectures with the best predicted scores.

Figure 12: AceNAS optimization process on ProxylessNAS.

Appendix D Quality of searched architectures

For NAS benchmarks, we show in Table 7 the test accuracy, test regret and rank of our searched architecture. AceNAS is approximately able to find the best architecture out of one thousand architectures.

Remarkably, on NAS-Bench-201-ImageNet, it almost finds the best validation architecture on the search space (the negative test regret estimation is caused by variance). Despite that the best validation architecture has been located, it still ranks 2.14 out of 1000, which means that some architectures have a even better test accuracy, which is due to the gap between validation dataset and test dataset, and the best on validation is not necessarily the best on test.

Search space Test acc. Test regret Rank (‰)
NAS-Bench-101 94.100.20 0.24 0.33
NB201-CIFAR100 73.380.51 0.10 0.16
NB201-CIFAR10 94.520.15 0.04 0.05
NB201-ImageNet 46.340.56 -0.08 2.14
Amoeba 94.840.07 0.09 0.80
DARTS 94.920.07 0.14 1.20
DARTS-fix-w-d 94.160.12 0.16 1.00
ENAS 94.970.11 0.20 0.60
ENAS-fix-w-d 94.130.04 0.06 0.40
NASNet 95.020.15 0.25 0.41
PNAS 95.060.06 0.13 0.60
PNAS-fix-w-d 94.340.09 0.16 1.54
Table 7: Test accuracy and rank of the best architecture, averaged over 50 runs. For test accuracy, we report the mean and standard deviation. For test regret and rank, we only report the mean. : the gap between test accuracy of searched architecture and the best validation architecture. : the average rank of test accuracy within all test accuracies.

For ProxylessNAS search space, we show the architectures found by 3 different runs, and name them AceNAS-M1, AceNAS-M2, and AceNAS-M3, respectively (M for Mobile). The network structures are shown in Figure 13 and the accuracy and latency on ImageNet test set (commonly called validation set for historical reasons) are shown in Table 8. Pretrained checkpoints of these models will be released.

Architecture Test acc. (%) Latency (ms)
AceNAS-M1 75.25 84.60
AceNAS-M2 75.07 84.59
AceNAS-M3 75.11 84.92
Table 8: Test accuracy and latency of architectures searched on ProxylessNAS (shown in Figure 13).
(a) AceNAS-M1
(b) AceNAS-M2
(c) AceNAS-M3
Figure 13: Searched architecture on ProxylessNAS search space.

Appendix E Pretraining quality

In Figure 14, we show how our ranking model is good at capturing the information (i.e., WS-accuracy, FLOPs and number of parameters) in the pretraining stage. Apart from 4k architectures that were used in training, we sampled extra 1k architectures per search space to evaluate the model. We compute scores (coefficient of determination), which can be as good as 1. We found that for parameters and FLOPs, our model hits for all search spaces, which means that the model is very good at predicting the model size. On WS-accuracy, scores vary between 0.4 and 1, implying that it always learn information from super-net, at least to some extent. Clearly, on some benchmarks (e.g., NAS-Bench-101 and NAS-Bench-201), it looks better than others. We conjecture that search spaces with lower diversity are easier to learn.

Figure 14: The validation score in pretraining stage, indicating how good our ranking model is good at predicting weight sharing accuracy, number of parameters and FLOPs.