Neural architecture search (NAS) has received growing attention in the past few years, yielding state-of-the-art performance on several machine learning tasks[1, 2, 3, 4]. One of the milestones that led to the popularity of NAS is weight sharing [5, 6], which, by allowing all possible network architectures to share the same parameters, has reduced the computational requirements from thousands of GPU hours to just a few. Figure 1 shows the two phases that are common to weight-sharing NAS (WS-NAS) algorithms: the search phase, including the design of the search space and the search algorithm; and the evaluation phase, which encompasses the final training protocol on the target task 111Target task refers to the tasks that neural architecture search aims to optimize on..
, they tend to overlook or gloss over important factors related to the design and training of the shared-weight backbone network, i.e. the super-net. For example, the literature encompasses significant variations of learning hyper-parameter settings, batch normalization and dropout usage, capacities for the initial layers of the network, and depth of the super-net. Furthermore, some of these heuristics are directly transferred from standalone network training to super-net training without carefully studying their impact in this drastically different scenario. For example, the fundamental assumption of batch normalization that the input data follows a slowly changing distribution whose statistics can be tracked during training is violated in WS-NAS, but nonetheless typically assumed to hold.
In this paper, we revisit and systematically evaluate commonly-used super-net design and training heuristics and uncover the strong influence of certain factors on the success of super-net training. To this end, we leverage three benchmark search spaces, NASBench-101 , NASBench-201 , and DARTS-NDS , for which the ground-truth stand-alone performance of a large number of architectures is available. We report the results of our experiments according to two sets of metrics: i) metrics that directly measure the quality of the super-net, such as the widely-adopted super-net accuracy 222The mean accuracy over a small set of randomly sampled architectures during super-net training. and a modified Kendall-Tau correlation between the searched architectures and their ground-truth performance, which we refer to as sparse Kendall-Tau; ii) proxy metrics such as the ability to surpass random search and the stand-alone accuracy of the model found by the WS-NAS algorithm.
Via our extensive experiments (over 700 GPU days), we uncover that (i) the training behavior of a super-net drastically differs from that of a standalone network, e.g., in terms of feature statistics and loss landscape, thus allowing us to define training settings, e.g., for batch-normalization (BN) and learning rate, that are better suited for super-nets; (ii) while some neglected factors, such as the number of training epochs, have a strong impact on the final performance, others, believed to be important, such as path sampling, only have a marginal effect, and some commonly-used heuristics, such as the use of low-fidelity estimates, negatively impact it; (iii) the commonly-adopted super-net accuracy is unreliable to evaluate the super-net quality.
Altogether, our work is the first to systematically analyze the impact of the diverse factors of super-net design and training, and we uncover the factors that are crucial to design a super-net, as well as the non-important ones. Aggregating these findings allows us to boost the performance of simple weight-sharing random search to the point where it reaches that of complex state-of-the-art NAS algorithms across all tested search spaces. Our code is available at https://github.com/kcyu2014/nas-supernet, and we will release our trained models so as to establish a solid baseline to facilitate further research.
2 Preliminaries and Related Work
We first introduce the necessary concepts that will be used throughout the paper. As shown in Figure 1(a), weight-sharing NAS algorithms consist of three key components: a search algorithm that samples an architecture from the search space in the form of an encoding, a mapping function
that maps the encoding into its corresponding neural network, and a training protocol for a proxy taskfor which the network is optimized.
To train the search algorithm, one needs to additionally define the mapping function that generates the shared-weight network. Note that the mapping frequently differs from , since in practice the final model contains many more layers and parameters so as to yield competitive results on the proxy task. After fixing , a training protocol is required to learn the super-net. In practice, often hides factors that are critical for the final performance of an approach, such as hyper-parameter settings or the use of data augmentation strategies to achieve state-of-the-art performance [6, 15, 9]. Again, may differ from , which is used to train the architecture that has been found by the search. For example, our experiments reveal that the learning rate and the total number of epochs frequently differ due to the different training behavior of the super-net and stand-alone architectures.
Many strategies have been proposed to implement the search algorithm, such as reinforcement learning[16, 17]18, 19, 20, 21, 22], gradient-based optimization [6, 23, 11], Bayesian optimization [24, 25, 26, 27], and separate performance predictors [21, 28]. Until very recently, the common trend to evaluate NAS consisted of reporting the searched architecture’s performance on the proxy task [8, 29, 4]. This, however, hardly provides real insights about the NAS algorithms themselves, because of the many components involved in them. Many factors that differ from one algorithm to another can influence the performance. In practice, the literature even commonly compares NAS methods that employ different protocols to train the final model.
 and  were the first to systematically compare different algorithms with the same settings for the proxy task and using several random initializations. Their surprising results revealed that many NAS algorithms produce architectures that do not significantly outperform a randomly-sampled architecture.  highlighted the importance of the training protocol
. They showed that optimizing the training protocol can improve the final architecture performance on the proxy task by three percent on CIFAR-10. This non-trivial improvement can be achieved regardless of the chosen sampler, which provides clear evidence for the importance of unifying the protocol to build a solid foundation for comparing NAS algorithms.
In parallel to this line of research, the recent series of “NASBench” works [12, 33, 13] proposed to benchmark NAS approaches by providing a complete, tabular characterization of a search space. This was achieved by training every realizable stand-alone architecture using a fixed protocol . Similarly, other works proposed to provide a partial characterization by sampling and training a sufficient number of architectures in a given search space using a fixed protocol [14, 9, 27].
While recent advances for systematic evaluation are promising, no work has yet thoroughly studied the influence of the super-net training protocol and the mapping function . Previous works [9, 30] performed hyper-parameter tuning to evaluate their own algorithms, and focused only on a few parameters. We fill this gap by benchmarking different choices of and and by proposing novel variations to improve the super-net quality.
Recent works have shown that sub-networks of super-net training can surpass some human designed models without retraining [34, 35] and that reinforcement learning can surpass the performance of random search . However, these findings are still only shown on MobileNet-like search spaces, where one only searches for the size of the convolution kernels and the channel ratio for each layer. This is an effective approach to discover a compact network, but it does not change the fact that on more complex, cell-based search spaces the super-net quality remains low.
3 Evaluation Methodology
We first isolate 14 factors that need to be considered during the design and training of a super-net, and then introduce the metrics to evaluate the quality of the trained super-net. Note that these factors are agnostic to the search policy that is used after training the super-net.
3.1 Disentangling the super-net design heuristics from the search algorithm
Our goal is to evaluate the influence of the super-net mapping and weight-sharing training protocol . As shown in Figure 3.1, translates an architecture encoding, which typically consists of a discrete number of choices or parameters, into a neural network. Based on a well-defined mapping, the super-net is a network in which every sub-path has a one-to-one mapping with an architecture encoding . Recent works [23, 11, 12] separate the encoding into cell parameters, which define the basic building blocks of a network, and macro parameters, which define how cells are assembled into a complete architecture.
Weight-sharing mapping . To make the search space manageable, all cell and macro parameters are fixed during the search, except for the topology of the cell and its possible operations. However, the exact choices for each of these fixed factors differ between algorithms and search spaces. We report the common factors in the left part of Table I. They include various implementation choices, e.g., the use of convolutions with a dynamic number of channels (Dynamic Channeling), super-convolutional layers that support dynamic kernel sizes (OFA Kernel) , weight-sharing batch-normalization (WSBN) that tracks independent running statistics and affine parameters for different incoming edges , and path and global dropout [5, 28, 6]. They also include the use of low-fidelity estimates  to reduce the complexity of super-net training, e.g., by reducing the number of layers  and channels [32, 38], the portion of the training set used for super-net training , or the batch size [6, 5, 32].
|WS Mapping||WS Protocol|
|OFA Conv||train portion||learning rate||Random-NAS|
|Op on Node/Edge|
Weight-sharing protocol Given a mapping , different training protocols can be employed to train the super-net. Protocols can differ in the training hyper-parameters and the sampling strategies they rely on. We will evaluate the different hyper-parameter choices listed in the right part of Table I. This includes the initial learning rate, the hyper-parameters of batch normalization, the total number of training epochs, and the amount of weight decay.
We randomly sample one path to train the super-net , which is also known as single-path one-shot (SPOS) or Random-NAS . The reason for this choice is that Random-NAS is equivalent to the initial state of many search algorithms [6, 5, 28], some of which even freeze the sampler training so as to use random sampling to warm-up the super-net [23, 40]. Note that we also evaluated two variants of Random-NAS, but found their improvement to be only marginal.
In our experiments, for the sake of reproducibility, we ensure that and , as well as and , are as close to each other as possible. For the hyper-parameters of , we cross-validate each factor following the order in Table I, and after each validation, use the value that yields the best performance in . For all other factors, we change one factor at a time.
3.1.1 Search spaces
We employ three commonly-used cell-based search spaces, for which a large number of stand-alone architectures have been trained and evaluated on CIFAR-10  to obtain their ground-truth performance. In particular, we use NASBench-101 , which consists of architectures and is compatible with weight-sharing NAS [31, 33]; NASBench-201 , which contains more operations than NASBench-101 but fewer nodes; and DARTS-NDS  that contains over architectures, of which a subset of 5000 models was sampled and trained in a stand-alone fashion. A summary of these search spaces and their properties is shown in Table II. The search spaces differ in the number of architectures that have known stand-alone accuracy (# Arch.), the number of possible operations (# Op.), how the channels are handled in the convolution operations (Channel), where dynamic means that the number of super-net channels might change based on the sampled architecture, and the type of optimum that is known for the search space (Optimal). We further provide the maximum number of nodes (), excluding the input and output nodes, in each cell, as well as a bound on the number of shared weights (Param.) and edge connections (Edges). Finally, the search spaces differ in how the nodes aggregate their inputs if they have multiple incoming edges (Merge).
have been shown to be effective for computer vision tasks. However, to the best of our knowledge no benchmark space currently exists for such search spaces. We nonetheless construct a simplified chain-like space based on NASBench-101. See Appendix for more details.
3.2 Sparse Kendall-Tau - A novel super-net evaluation metric
We define a novel super-net metric, which we name sparse Kendall-Tau. It is inspired by the Kendall-Tau metric used by  to measure the discrepancy between the ordering of stand-alone architectures and the ordering that is implied by the trained super-net. An ideal super-net should yield the same ordering of architectures as the stand-alone one and thus would lead to a high Kendall-Tau. However, Kendall-Tau is not robust to negligible performance differences between architectures (c.f. Figure 2). To robustify this metric, we share the rank between two architectures if their stand-alone accuracies differ by less than a threshold ( here). Since the resulting ranks are sparse, we call this metric sparse Kendall-Tau (s-KdT). See Appendix for implementation details.
Sparse Kendall-Tau threshold. This value should be chosen according to what is considered a significant improvement for a given task. For CIFAR-10, where accuracy is larger than 90%, we consider a 0.1% performance gap to be sufficient. For tasks with smaller state-of-the-art performance, larger values might be better suited.
Number of architectures. In practice, we observed that the sparse Kendall-Tau metric became stable and reliable when using at least architectures. We used in our experiments to guarantee stability and fairness of the comparison of the different factors.
Limitation of Sparse Kendall-Tau We nonetheless acknowledge that our sparse Kendall-Tau has some limitations. For example, a failure case of using sparse Kendall-Tau for super-net evaluation may occur when the top 10% architectures are perfectly ordered, while the bottom 90% architectures are purely randomly distributed. In this case, the Kendall Tau will be close to 0. However, the search algorithm will always return the best model, as desired.
Nevertheless, while this corner case would indeed be problematic for the standard Kendall Tau, it can be circumvented by tuning the threshold of our sKdT. A large threshold value will lead to a small number of groups, whose ranking might be more meaningful. For instance in some randomly-picked NASBench-101 search processes, setting the threshold to 0.1% merges the top 3000 models into 9 ranks, but still yields an sKdT of only 0.2. Increasing the threshold to 10% clusters the 423K models into 3 ranks, but still yields an sKdT of only 0.3. This indicates the stability of our metric.
In Figure 3, we randomly picked 12 settings and show the corresponding bipartite graphs relating the super-net and ground-truth rankings to investigate where disorder occurs. In practice, the corner case discussed above virtually never occurs; the ranking disorder is typically spread uniformly across the architectures.
3.2.1 Other metrics
Although, sparse Kendall-Tau captures the super-net quality well, it may fail in extreme cases, such as when the top-performing architectures are ranked perfectly while poor ones are ordered randomly. To account for such rare situations and ensure the soundness of our analysis, we also report additional metrics. We define two groups of metrics to holistically evaluate different aspects of a trained super-net.
The first group of metrics directly evaluates the quality of the super-net, including sparse Kendall-Tau and the widely-adopted super-net accuracy. For the super-net accuracy, we report the average accuracy of 200 architectures on the validation set of the dataset of interest. We will refer to this metric simply as accuracy. It is frequently used [39, 15] to assess the quality of the trained super-net, but we will show later that it is in fact a poor predictor of the final stand-alone performance. The metrics in the second group evaluate the search performance of a trained super-net. The first metric is the probability to surpass random search: Given the ground-truth rank of the best architecture found after runs and the maximum rank
, equal to the total number of architectures, the probability that the best architecture found is better than a randomly searched one is given by.
Finally, where appropriate, we report the stand-alone accuracy of the model, a.k.a. final performance, that was found by the complete WS-NAS algorithm. Concretely, we randomly sample 200 architectures, select the 3 best models based on the super-net accuracy and query the ground-truth performance. We then take the mean of these architectures as stand-alone accuracy. Note that the same architectures are used to compute the sparse Kendall-Tau.
We provide an analysis on the impact of the factors that are shown in Table I across three different search spaces. In addition, we report the complete numerical results of all metrics in Section 4.6.
We use PyTorch
for our experiments. Since NASBench-101 was constructed in TensorFlow we implement a mapper that translates TensorFlow parameters into our PyTorch model. We exploit two large-scale experiment management tools, SLURM and Kubernetes , to deploy our experiments. We use various GPUs throughout our project, including NVIDIA Tesla V100, RTX 2080 Ti, GTX 1080 Ti and Quadro 6000 with CUDA 10.1. Depending on the number of training epochs, parameter sizes and batch-size, most of the super-net training finishes within 12 to 24 hours, with the exception of FairNAS, whose training time is longer, as discussed earlier. We split the data into training/validation using a 90/10 ratio for all experiments, except those involving validation on the training portion. Please consult our submitted code for more details.
Reproducing the Ground Truth from Tensorflow.
As the ground-truth performance used by NASBench-101 are obtained on Tensorflow with TPU computation structure. We firstly reproduce these results in Pytorch with our implementation to make sure the re-implementation is trustworthy. We uniformly random sampled 10 architectures, and repeat 3 times. It results 30 architectures and covers the spectrum of performance from 82% to 93%. We adopted the optimizer and hyper-parameter setting according to the code release of NASBench-101, repeated with 3 random initializations and take the mean performance. Note that we copied the initialization from the released Tensorflow model into PyTorch format to minimize frameworks discrepancies. We plot the performance comparison in Figure 4. The Kendall Tau metric is 0.81, and should be considered as the upper-bound of super-net training. It clearly indicates that even the reproducing results are not perfectly aligned with the Tensorflow original, it cannot explain why the significant drop to 0.2 after using weight sharing .
|Spearman Corr.||Acc. - Perf.||S-KdT - Perf.||Acc. - S-KdT|
4.1 Evaluation of a super-net
The standalone performance of the architecture that is found by a NAS algorithm is clearly the most important metric to judge its merits. However, in practice, one cannot access this metric—we wouldn’t need NAS if standalone performance was easy to query (the cost of computing stand-alone performance is discussed in Section 4.1.2).
Furthermore, stand-alone performance inevitably depends the sampling policy, and does not directly evaluate the quality of the super-net (see Section 4.1.2). Consequently, it is important to rely on metrics that are well correlated with the final performance but can be queried efficiently. To this end, we collect all our experiments and plot the pairwise correlation between final performance, sparse Kendall-Tau, and super-net accuracy. As shown in Figure 5, the super-net accuracy has a low correlation with the final performance on NASBench-101 and DARTS-NDS. Only on NASBench-201 does it reach a correlation of 0.52. The sparse Kendall-Tau yields a consistently higher correlation with the final performance. This is evidence that one should not focus too strongly on improving the super-net accuracy. While this metric remains computationally heavy, it serves as a middle ground that is feasible to evaluate in real-world applications.
In the following experiments, we thus mainly rely on sparse Kendall-Tau, and use final search performance as a reference only.
4.1.1 Kendall Tau v.s. Spearman ranking correlation
Kendall-tau is not the only metric to evaluate the ranking correlation. Spearman ranking correlation is also widely adopted in this field [39, 13]. Note that our idea of sparsity also applies to SpR. In Table 4.1.1, we compare the performance of Kendall Tau(KdT), Spearman ranking correlation (SpR) and their sparse variants, in the same setting as Figure 5. Note that SpR and KdT performs similarly but that their sparse variants effectively improve the correlation on all search spaces.
4.1.2 Stand-alone Accuracy v.s. Sparse Kendall-Tau
A common misconception is that the super-net quality is well reflected by stand-alone accuracy of the final selected architecture. Neither sparse Kendall-Tau (sKdT) nor stand-alone accuracy are perfect. Both are tools to measure different aspects of a super-net.
Let us consider a completely new search space in which we have no prior knowledge about performance. As depicted by Figure 6
, if we only rely on the stand-alone accuracy, the following situation might happen: Due to the lack of knowledge, the ranking of the super-net is purely random, and the search space accuracy is uniformly distributed. When trying different settings, there will be 1 configuration that ‘outperforms’ the others in terms of stand-alone accuracy. However, this configuration will be selected by pure chance. By only measuring stand-alone accuracy, it is technically impossible to realize that the ranking is random. By contrast, if one measures the sKdT (which is close to 0 in this example), an ill-conditioned super-net can easily be identified. In other words, purely relying on stand-alone accuracy could lead to pathological outcomes that can be avoided using sparse Kendall-Tau.
Additionally, stand-alone accuracy is related to both the super-net and the search algorithm. sparse Kendall-Tau allows us to judge super-net accuracy independently from the search algorithm. As an example, consider the use of a reinforcement learning algorithm, instead of random sampling, on top of the super-net. When observing a poor stand-alone accuracy, one cannot conclude if the problem is due to a poor super-net or to a poor performance of the RL algorithm. Prior to our work, people relied on the super-net accuracy to analyze the super-net quality. This is not a reliable metric, as shown in Figure 5. We believe that sparse Kendall-Tau is a better alternative.
Computational Cost. Computing the final accuracy is more expensive than training the super-net. Despite the low-fidelity heuristics reducing the weight-sharing costs, training a stand-alone network to convergence has higher cost, e.g., DARTS searches for 50 epochs but trains from scratch for 600 epochs . Furthermore, debugging and hyper-parameter tuning typically require training thousands of stand-alone models. Note that, as one typically evaluates a random subset of architectures to understand the design space , sparse Kendall-Tau can be computed without additional costs. In any event, the budget for sparse Kendall-Tau is bounded with .
4.2 Weight-sharing Protocol – Hyperparameters
4.2.1 Batch normalization in the super-net
Batch normalization (BN) is commonly used in standalone networks to allow for faster and more stable training. It is thus also employed in most CNN search spaces. However, BN behaves differently in the context of WS-NAS, and special care has to be taken when using it. In a standalone network (c.f. Figure 8 (Top)), a BN layer during training computes the batch statistics and , normalizes the activations as , and finally updates the population statistics using a moving average. For instance, the mean statistics is updated as
. At test time, the stored population statistics are used to normalize the feature map. In the standalone setting, both batch and population statistics are unbiased estimators of the population distribution.
By contrast, when training a super-net (Figure 8 (Bottom)) the population statistics that are computed based on the running average are not unbiased estimators of the population distribution, because the effective architecture before the BN layer varies in each epoch. More formally, let denote the -th architecture. During training, the batch statistics are computed as , and the output feature follows the distribution , where the superscript indicates that the current batch statistics depends on only. The population mean statistics is then updated as . However, during training, different architecture from the super-net are sampled. Therefore, the population mean statistics essentially becomes a weighted combination of means from different architectures, i.e., , where is the sampling frequency of the -th architecture. When evaluating a specific architecture at test time, the estimated population statistics thus depend on the other architectures in the super-net. This leads to a train-test discrepancy. One solution to mitigate this problem is to re-calibrate the batch statistics by recomputing the statistics on the entire training set before the the final evaluation . While the cost of doing so is negligible for a standalone network, NAS algorithms typically sample architectures for evaluation, which makes this approach intractable.
In contrast to  and  that use the training mode also during testing, we formalize a simple, yet effective, approach to tackle the train-test discrepancy of BN in super-net training: we leave the normalization based on batch statistics during training unchanged, but use batch statistics also during testing. Since super-net evaluation is always conducted over a complete dataset, we are free to perform inference in mini-batches of the same size as the ones used during training. This allows us to compute the batch statistics on the fly in the exact same way as during training.
Figure 8 compares standard BN to our proposed modification. Using the tracked population statistics leads to many architectures with an accuracy around 10%, i.e., performing no better than random guessing. Our proposed modification allows us to significantly increase the fraction of high-performing architectures. Our results also show that the choice of fixing vs. learning an affine transformation in batch normalization should match the standalone protocol .
4.2.2 Learning rate
The training loss of the super-net encompasses the task losses of all possible architectures. We suspect that the training difficulty increases with the number of architectures represented by the super-net. To better study this, we visualize the loss landscape  of the standalone network and a super-net with architectures. Concretely, the landscape is computed over the super-net training loss under the single-path one-shot sampling method, i.e.,
Figure 9 shows that the loss landscape of the super-net is less smooth than that of a standalone architecture, which confirms our intuition. A smoother landscape indicates that optimization will converge more easily to a good local optimum. With a smooth landscape, one can thus use a relatively large learning rate. By contrast, a less smooth landscape requires using a smaller one.
Our experiments further confirm this observation. In the standalone protocol , the learning rate is set to for NASBench-101, and to for NASBench-201 and DARTS-NDS, respectively. All protocols use a cosine learning rate decay. Figure 10 shows that super-net training requires lower learning rates than standalone training. The same trend is shown for other search spaces in Section 4.6 Table VIII. We set the learning rate to 0.025 to be consistent across the three search spaces.
4.2.3 Number of epochs
Since the cosine learning rate schedule decays the learning rate to zero towards the end of training, we evaluate the impact of the number of training epochs. In stand-alone training, the number of epochs was set to 108 for NASBench-101, 200 for NASBench-201, and 100 for DARTS-NDS. Figure 11 shows that increasing the number of epochs significantly improves the accuracy in the beginning, but eventually decreases the accuracy for NASBench-101 and DARTS-NDS. Interestingly, the number of epochs impacts neither the correlation of the ranking nor the final selected model performance after 400 epochs. We thus use 400 epochs for the remaining experiments.
4.2.4 Weight decay
Weight decay is used to reduce overfitting. For WS-NAS, however, overfitting does not occur because there are billions of architectures sharing the same set of parameters, which in fact rather causes underfitting. Based on this observation,  propose to disable weight decay during super-net training. Figure 12, however, shows that the behavior of weight decay varies across datasets. While on DARTS-NDS weight decay is indeed harmful, it improves the results on NASBench 101 and 201. We conjecture that this is due to the much larger number of architectures in DARTS-NDS (243 billion) than in the NASBench series (less than 500,000).
4.3 Weight-sharing Protocol – Sampling
Aside from the Random-NAS described in Section 3.1, we additionally include two variants of Random-NAS: 1) As pointed out by , two super-net architectures might be topologically equivalent in the stand-alone network by simply swapping operations. We thus include architecture-aware random sampling that ensures equal probability for unique architectures . We name this variant Random-A; 2) We evaluate a variant called FairNAS , which ensures that each operation is selected with equal probability during super-net training. Although FairNAS was designed for a search space where only operations are searched but not the topology, we adapt it to our setting.
Adaptation of FairNAS. Originally, FairNAS  was proposed in a search space with a fixed sequential topology, as depicted by Figure 13 (a), where every node is sequentially connected to the previous one, and only the operations on the edges are subject to change. However, our benchmark search spaces exploit a more complex dynamic topology, as illustrated in Figure 13 (b), where one node can connect to one or more previous nodes. Before generalizing to a dynamic topology search space, we simplify the original approach into a 2-node scenario: for each input batch, FairNAS will first randomly generate a sequence of all possible operations. It then samples one operation at a time, computes gradients for the fixed input batch, and accumulates the gradients across the operations. Once all operations have been sampled, the super-net parameters are updated with the average gradients. This ensures that all possible paths are equally exploited . With this simplification, FairNAS can be applied regardless of the topology. For a sequential-topology search space, we repeat the 2-node policy for every consecutive node pair. Naturally, for a dynamic topology space, FairNAS can be adopted in a similar manner, i.e., one first samples a topology, then applies the 2-node strategy for all connected node pairs. Note that adapting FairNAS increases the training time by a factor .
Results. With the hyper-parameters fixed, we now compare three path-sampling techniques. Since DARTS-NDS does not contain enough samples trained in a stand-alone manner, we only report results on NASBench-101 and 201. In Figure 14, we show the sampling distributions of different approaches and the impact on the super-net in terms of sparse Kendall-Tau. These experiments reveal that, on NASBench-101, uniformly randomly sampling one architecture, as in [30, 31], is strongly biased in terms of accuracy and ranking. This can be observed from the peaks around rank 0, 100,000, and 400,000. The reason is that a single architecture can have multiple encodings, and uniform sampling thus oversamples such architectures with equivalent encodings. FairNAS samples architectures more evenly and yields consistently better sparse Kendall-Tau values, albeit by a small margin.
On NASBench-201, the three sampling policies have a similar coverage. This is because, in NASBench-201, topologically-equivalent encodings were not pruned. In this case, Random-NAS performs better than in NASBench-101, and FairNAS yields good early performance but quickly saturates. In short, using different sampling strategies might in general be beneficial, but we advocate for FairNAS in the presence of a limited training budget.
4.4 Weight-sharing Mapping – Lower Fidelity Estimates lower the ranking correlation
|S-KdT||0.740 0.07||0.728 0.16||0.703 0.16|
|SAA||92.92 0.48||92.37 0.61||92.35 0.34|
|S-KdT||0.740 0.07||0.677 0.10||0.691 0.15|
|SAA||92.92 0.48||92.32 0.37||92.79 0.85|
|S-KdT||0.751 0.09||0.692 0.18||0.502 0.21|
|SAA||91.91 0.09||91.95 0.10||90.30 0.71|
|S-KdT||0.751 0.11||0.742 0.12||0.693 0.13|
|SAA||92.13 0.51||92.74 0.43||91.47 0.81|
Reducing memory foot-print and training time by proposing smaller super-nets has been an active research direction, and the resulting super-nets are referred to as lower fidelity estimates . The impact of this approach on the super-net quality, however, has never been studied systematically over multiple search spaces. We compare four popular strategies in Table III. We deliberately prolong the training epochs inversely proportionally to the computational budget that would be saved by the low-fidelity estimates. For example, if the number of channels is reduced by half, we train the model for two times more epochs. Note that this provides an upper bound to the performance of low-fidelity estimates.
A commonly-used approach to reduce memory requirements is to decrease the batch size . Surprisingly, lowering the batch size from 256 to 64 has limited impact on the accuracy, but decreases sparse Kendall-Tau and the final searched model’s performance, the most important metric in practice.
As an alternative, one can decrease the number of repeated cells [5, 15]. This reduces the total number of parameters proportionally, since the number of cells in consecutive layers depends on the first one. Table III shows that this decreases the sparse Kendall-Tau from 0.7 to 0.5. By contrast, reducing the number of channels in the first layer  has little impact. Hence, to train a good super-net, one should avoid changes between and , but one can reduce the batch size by a factor 0.5 and use only one repeated cell.
The last lower-fidelity factor is the portion of training data that is used [6, 23]. Surprisingly, reducing the training portion only marginally decreases the sparse Kendall-Tau for all three search spaces. On NASBench-201, keeping only 25% of the CIFAR-10 dataset results in a 0.1 drop in sparse Kendall-Tau. This explains why DARTS-based methods typically use only 50% of the data to train the super-net but can still produce reasonable results.
4.5 Weight-sharing Mapping - Implementation
4.5.1 Dynamic channeling hurts super-net quality
|Edges||Accuracy||S-KdT||P R||Stand-alone Acc.|
|Baseline: random sampling sub-spaces with dynamic channeling.|
|Average||76.905 10.05||0.223||0.865||89.435 4.30|
|Disable dynamic channels by fixing the edges to the output node.|
|Average||76.95 8.29||0.460||0.949||93.65 0.73|
Dynamic channeling is an implicit factor in many search spaces [12, 7, 39, 40]. It refers to the fact that the number of channels of the intermediate layers depends on the number of incoming edges to the output node. This is depicted by Figure 4.5.1 (a): for a search cell with intermediate nodes, where and are the input and output node with and channels, respectively. When there are edges (c.f. Figure 4.5.1 (b)), the associated channel numbers decrease so that their sum equals . That is, the intermediate nodes have channels. In the general case, shown in Figure 4.5.1 (c), the number of channels in intermediate nodes is thus for incoming edges. A weight sharing approach has to cope with this architecture-dependent fluctuation of the number of channels during training.
Let denote the number of channels of a given architecture, and the maximum number of channels for a node across the entire search space. All existing approaches allocate channels and, during training, extract a subset of these channels. The existing methods then differ in how they extract the channels:  use a fixed chunk of channels, e.g., ;  randomly shuffle the channels before extracting a fixed chunk; and  linearly interpolate the channels into channels using a moving average across neighboring channels.
Instead of sharing the channels between architectures, we propose to disable dynamic channelling completely. As the channel number only depends on the incoming edges, we separate the search space into a discrete number of sub-spaces, each with a fixed number of incoming edges. As shown in Table 4.5.1, disabling dynamic channeling improves the sparse Kendall-Tau and the final search performance by a large margin and yields a new state of the art on NASBench101.
Ablation study. Since each sub-space now encompasses fewer architectures, it is not fair to perform a comparison with the full NASBench 101 search space. Therefore, for each sub-space, we construct a baseline space where we drop architectures uniformly at random until the number of remaining architectures matches the size of the sub-space. We repeat this process with 3 different initializations, while keeping all other factors unchanged when training the super-net.
We also provide additional results in Table IV for each individual sub-space and show that the sparse Kendall-Tau remains similar to that of the baseline using the full search space, which clearly evidences the effectiveness of our approach to disable the dynamic channeling.
Furthermore, we evaluate the effect of disabling dynamic channels during the super-net training and test phase individually in Table 4.5.1. Disabling dynamic channeling during both phases yields the best results.
4.5.2 WS on Edges or Nodes?
|Baseline||0.236 92.32||0.740 92.92||0.159 93.59|
|Op-Edge||N/A||as Baseline||0.189 93.97|
|Op-Node||as Baseline||0.738 92.36||as Baseline|
figure(a) Consider a search space with 2 intermediate nodes, 1, 2, with one input (I) and output (O) node. This yields 5 edges. Let us assume that we have 4 possible operations to choose from, as indicated as the purple color code. (b) When the operations are on the nodes, there are 2 4 ops to share, i.e., I2 and 12 share weights on node 2. (c) If the operations are on the edges, then we have 5 4 ops to share.
Most existing works build to define the shared operations on the graph nodes rather than on the edges. This is because, if maps to the edges, the parameter size increases from to , where is the number of intermediate nodes. We provide a concrete example in Figure V. However, the high sparse Kendall-Tau on NASBench-201 in the top part of Table V, which is obtained by mapping to the edges, may suggest that sharing on the edges is beneficial. Here we investigate if this is truly the case.
On NASBench-101, by design, each node merges the previous nodes’ outputs and then applies parametric operations. This makes it impossible to build an equivalent sharing on the edges. We therefore construct sharing on the edges for DARTS-NDS and sharing on the nodes for NASBench-201. As shown in Table VI, for both spaces, sharing on the edges yields a marginally better super-net than sharing on the nodes. Such small differences might be due to the fact that, in both spaces, the number of nodes is 4, while the number of edges is 6, thus mapping to edges will not drastically affect the number of parameters. Nevertheless, this indicates that one should consider having a larger number of shared weights when the resources are not a bottleneck.
4.5.3 Other mapping factors
|Baseline||0.236 92.32||0.740 92.92||0.159 93.59|
|WSBN||0.056 91.33||0.675 92.04||0.331 92.95|
|Global-Dropout||0.179 90.95||0.676 91.76||0.102 92.30|
|Path-Dropout||0.128 91.19||0.431 91.42||0.090 91.90|
|OFA Kernel||0.132 92.01||0.574 91.83||0.112 92.83|
|ENAS ||91.83 0.42||54.30 0.00||94.45 0.09||97.11|
|DARTS-V2 ||92.21 0.61||54.30 0.00||94.79 0.11||97.37|
|NAO ||92.59 0.59||-||-||97.10|
|GDAS ||-||93.51 0.13||-||96.23|
|Random NAS ||89.89 3.89||87.66 1.69||91.33 0.12||96.74|
|Random NAS (Ours)||93.12 0.06||92.71 0.15||94.26 0.05||97.08|
|Results from |
|Trained according to  for 600 epochs.|
|On NASBench-201, both random NAS and our approach sample 100 final|
|architectures to follow .|
We evaluate the weight-sharing batch normalization (WSBN) of  , which keeps an independent set of parameters for each incoming edge. Furthermore, we test the two commonly-used dropout strategies: right before global pooling (global dropout); and at all edge connections between the nodes (path dropout). Note that path dropout has been widely used in WS-NAS [28, 6, 5]. For both dropout strategies, we set the dropout rate to 0.2. Finally, we evaluate the super convolution layer of , referred to as OFA kernel, which accounts for the fact that, in CNN search spaces, convolution operations appear as groups, and thus merges the convolutions within the same group, keeping only the largest kernel parameters and performing a parametric projection to obtain the other kernels. The results in Table VI show that all these factors negatively impact the search performances and the super-net quality.
|settings||Accuracy||S-KdT||P R||Performance||Accuracy||S-KdT||P R||Performance||Accuracy||S-KdT||P R||Performance|
|affine F track F||0.6510.05||0.161||0.996||0.9160.13||0.6600.13||0.783||0.997||92.671.21||0.7350.18||0.056||0.224||93.140.28|
|affine T track F||0.7100.04||0.240||0.996||0.9240.01||0.7130.14||0.718||0.707||91.711.05||0.2650.21||-0.071||0.213||91.892.01|
|affine F track T||0.1440.09||0.084||0.112||0.8820.02||0.1820.15||-0.171||0.583||86.414.84||0.3590.25||-0.078||0.023||90.330.76|
|affine T track T||0.1530.10||-0.008||0.229||0.9050.01||0.1340.09||-0.417||0.274||90.770.40||0.2160.18||-0.050||0.109||90.490.32|
|settings||Accuracy||S-KdT||P R||Performance||Accuracy||S-KdT||P R||Performance||Accuracy||S-KdT||P R||Performance|
|Number of Layer (-X indicates the baseline minus X)|
|Batch size (/ X indicates the baseline divide by X)|
|# channel (/ X indicates the baseline divide by X)|
|settings||Accuracy||S-KdT||P R||Performance||Accuracy||S-KdT||P R||Performance||Accuracy||S-KdT||P R||Performance|
|OFA Kernel||0.7080.08||0.132||0.203||92.010.19||0.6720.18||0.574||0.605||91.83 0.86||0.7820.05||0.112||0.399||93.220.43|
|Path dropout rate|
|Please refer to Section 4.5.2 for mapping on the node or edge and Section 4.5.1 for dynamic channel factor results.|
4.6 Results for All Factors
5 How should you train your super-net?
Figure 15 summarizes the influence of all tested factors on the final performance. It stands out that properly tuned hyper-parameters lead to the biggest improvements by far. Surprisingly, most other factors and techniques either have a hardly measurable effect or in some cases even lead to worse performance. Based on these findings, here is how you should train your super-net:
Do not use super-net accuracy to judge the quality of your super-net. The sparse Kendall-Tau has much higher correlation with the final search performance.
When batch normalization is used, do not use the moving average statistics during evaluation. Instead, compute the statistics on the fly over a batch of the same size as used during training.
The loss landscape of super-nets is less smooth than that of standalone networks. Start from a smaller learning rate than standalone training.
Do not use other low-fidelity estimates than moderately reducing the training set size to decrease the search time.
Do not use dynamic channeling in search spaces that have a varying number of channels in the intermediate nodes. Break the search space into multiple sub-spaces such that dynamic channeling is not required.
Comparison to the state of the art. Table VII shows that carefully controlling the relevant factors and adopting the techniques proposed in Section 4 allow us to considerably improve the performance of Random-NAS. Thanks to our evaluation, we were able to show that simple Random-NAS together with an appropriate training protocol and mapping function yields results that are competitive to and sometimes even surpass state-of-the-art algorithms. Our results provide a strong baseline upon which future work can build.
We also report the best settings in Table XI.
|Search Space||implementation||low fidelity||hyperparam.||sampling|
|Dynamic Conv||OFA Conv||WSBN||Dropout||Op map||layer||portion||batch-size||channels||batch-norm||learning rate||epochs||weight decay|
|For batch-norm, we report Track statistics (Tr) and Affine (A) setting with True (T) or False (F).|
|For other notation, Y = Yes, N = No.|
This work was partially done during an internship at Intel and at Abacus.AI, and supported in part by the Swiss National Science Foundation.
-  C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and L. Fei-Fei, “Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation,” CVPR, 2019.
-  B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search,” CVPR, 2019.
-  Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun, “DetNAS: Neural Architecture Search on Object Detection,” NeurIPS, 2019.
-  M. S. Ryoo, A. Piergiovanni, M. Tan, and A. Angelova, “Assemblenet: Searching for multi-stream neural connectivity in video architectures,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=SJgMK64Ywr
-  H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” ICML, 2018.
-  H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” ICLR, 2019.
-  H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware,” in ICLR, 2019.
-  S. Xie, H. Zheng, C. Liu, and L. Lin, “SNAS: Stochastic neural architecture search,” ICLR, 2019.
-  A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter, “Understanding and robustifying differentiable architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=H1gDNyrKDS
-  N. Nayman, A. Noy, T. Ridnik, I. Friedman, R. Jin, and L. Zelnik, “Xnas: Neural architecture search with expert advice,” in NeurIPS, 2019.
-  X. Li, C. Lin, C. Li, M. Sun, W. Wu, J. Yan, and W. Ouyang, “Improving one-shot NAS by suppressing the posterior fading,” CVPR, 2020.
-  C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and F. Hutter, “NAS-Bench-101: Towards reproducible neural architecture search,” ICLR, 2019.
-  X. Dong and Y. Yang, “NAS-Bench-201: Extending the scope of reproducible neural architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=HJxyZkBKDr
-  I. Radosavovic, J. Johnson, S. Xie, W.-Y. Lo, and P. Dollár, “On Network Design Spaces for Visual Recognition,” in ICCV, 2019.
-  X. Chu, B. Zhang, R. Xu, and J. Li, “FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search,” arXiv:, 2019.
-  B. Zoph and Q. V. Le, “Neural Architecture Search with Reinforcement Learning,” ICLR, 2017.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in CVPR, 2018.
-  E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin, “Large-scale evolution of image classifiers,” ICML, 2017.
-  R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al., “Evolving deep neural networks,” in Artificial Intelligence in the Age of Neural Networks and Brain Computing, 2019, pp. 293–312.
-  D. R. So, C. Liang, and Q. V. Le, “The evolved transformer,” in ICML, 2019.
-  C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in ECCV, 2018.
-  Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and W. Banzhaf, “NSGA-NET: A multi-objective genetic algorithm for neural architecture search,” arXiv:1810.03522, 2018.
-  Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and H. Xiong, “PC-DARTS: Partial channel connections for memory-efficient architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=BJlS634tPr
-  K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing, “Neural architecture search with bayesian optimisation and optimal transport,” in NeurIPS, 2018.
H. Jin, Q. Song, and X. Hu, “Auto-Keras: An efficient neural architecture search system,” inInternational Conference on Knowledge Discovery & Data Mining, 2019.
-  H. Zhou, M. Yang, J. Wang, and W. Pan, “BayesNAS: A bayesian approach for neural architecture search,” in ICML, 2019.
-  L. Wang, S. Xie, T. Li, R. Fonseca, and Y. Tian, “Neural architecture search by learning action space for monte carlo tree search,” AAAI, 2020. [Online]. Available: https://openreview.net/forum?id=SklR6aEtwH
-  R. Luo, F. Tian, T. Qin, E.-H. Chen, and T.-Y. Liu, “Neural architecture optimization,” in NeurIPS, 2018.
-  E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” AAAI, 2019.
-  L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” UAI, 2019.
-  K. Yu, C. Sciuto, M. Jaggi, C. Musat, and M. Salzmann, “Evaluating the search phase of neural architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=H1loF2NFwr
-  A. Yang, P. M. Esperança, and F. M. Carlucci, “NAS evaluation is frustratingly hard,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=HygrdpVKvr
-  A. Zela, J. Siems, and F. Hutter, “NAS-Bench-1Shot1: Benchmarking and dissecting one-shot neural architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=SJx9ngStPH
-  J. Yu, P. Jin, H. Liu, G. Bender, P.-J. Kindermans, M. Tan, T. Huang, X. Song, R. Pang, and Q. Le, “BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models,” ECCV, 2020.
-  H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once for all: Train one network and specialize it for efficient deployment,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=HylxE1HKwS
-  G. Bender, H. Liu, B. Chen, G. Chu, S. Cheng, P.-J. Kindermans, and Q. V. Le, “Can weight sharing outperform random architecture search? an investigation with tunas,” in CVPR, 2020.
-  T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” Journal of Machine Learning Research, vol. 20, no. 55, pp. 1–21, 2019.
-  X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in ICCV, 2019.
-  Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Single Path One-Shot Neural Architecture Search with Uniform Sampling,” ECCV, 2019.
-  X. Dong and Y. Yang, “Searching for a robust neural architecture in four gpu hours,” in CVPR, 2019.
-  A. Krizhevsky, V. Nair, and G. Hinton, “CIFAR-10 (canadian institute for advanced research),” 2009.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” inNeurIPS. Curran Associates, Inc., 2019, pp. 8026–8037.
-  Slurm, “Slurm workload manager,” 2020. [Online]. Available: https://slurm.schedmd.com/documentation.html
-  Kubernetes, “kubernetes.io,” 2020. [Online]. Available: https://kubernetes.io/docs/reference/
-  J. Yu and T. S. Huang, “Network slimming by slimmable networks: Towards one-shot architecture search for channel numbers,” ICLR, 2019.
-  H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in NeurIPS, 2018.
X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” inCVPR, 2018.
-  X. Dong and Y. Yang, “Network pruning via transformable architecture search,” in Advances in Neural Information Processing Systems, 2019, pp. 760–771.
-  R. Luo, F. Tian, T. Qin, E.-H. Chen, and T.-Y. Liu, “Weight sharing batch normalization code,” 2018. [Online]. Available: https://www.github.com/renqianluo/NAO_pytorch/NAO_V2/operations.py##L144