1 Introduction
Neural architecture search (NAS) has received growing attention in the past few years, yielding stateoftheart performance on several machine learning tasks
[1, 2, 3, 4]. One of the milestones that led to the popularity of NAS is weight sharing [5, 6], which, by allowing all possible network architectures to share the same parameters, has reduced the computational requirements from thousands of GPU hours to just a few. Figure 1 shows the two phases that are common to weightsharing NAS (WSNAS) algorithms: the search phase, including the design of the search space and the search algorithm; and the evaluation phase, which encompasses the final training protocol on the target task ^{1}^{1}1Target task refers to the tasks that neural architecture search aims to optimize on..While most works focus on developing a good sampling algorithm [7, 8] or improving existing ones [9, 10, 11]
, they tend to overlook or gloss over important factors related to the design and training of the sharedweight backbone network, i.e. the supernet. For example, the literature encompasses significant variations of learning hyperparameter settings, batch normalization and dropout usage, capacities for the initial layers of the network, and depth of the supernet. Furthermore, some of these heuristics are directly transferred from standalone network training to supernet training without carefully studying their impact in this drastically different scenario. For example, the fundamental assumption of batch normalization that the input data follows a slowly changing distribution whose statistics can be tracked during training is violated in WSNAS, but nonetheless typically assumed to hold.
In this paper, we revisit and systematically evaluate commonlyused supernet design and training heuristics and uncover the strong influence of certain factors on the success of supernet training. To this end, we leverage three benchmark search spaces, NASBench101 [12], NASBench201 [13], and DARTSNDS [14], for which the groundtruth standalone performance of a large number of architectures is available. We report the results of our experiments according to two sets of metrics: i) metrics that directly measure the quality of the supernet, such as the widelyadopted supernet accuracy ^{2}^{2}2The mean accuracy over a small set of randomly sampled architectures during supernet training. and a modified KendallTau correlation between the searched architectures and their groundtruth performance, which we refer to as sparse KendallTau; ii) proxy metrics such as the ability to surpass random search and the standalone accuracy of the model found by the WSNAS algorithm.
Via our extensive experiments (over 700 GPU days), we uncover that (i) the training behavior of a supernet drastically differs from that of a standalone network, e.g., in terms of feature statistics and loss landscape, thus allowing us to define training settings, e.g., for batchnormalization (BN) and learning rate, that are better suited for supernets; (ii) while some neglected factors, such as the number of training epochs, have a strong impact on the final performance, others, believed to be important, such as path sampling, only have a marginal effect, and some commonlyused heuristics, such as the use of lowfidelity estimates, negatively impact it; (iii) the commonlyadopted supernet accuracy is unreliable to evaluate the supernet quality.
Altogether, our work is the first to systematically analyze the impact of the diverse factors of supernet design and training, and we uncover the factors that are crucial to design a supernet, as well as the nonimportant ones. Aggregating these findings allows us to boost the performance of simple weightsharing random search to the point where it reaches that of complex stateoftheart NAS algorithms across all tested search spaces. Our code is available at https://github.com/kcyu2014/nassupernet, and we will release our trained models so as to establish a solid baseline to facilitate further research.
2 Preliminaries and Related Work
We first introduce the necessary concepts that will be used throughout the paper. As shown in Figure 1(a), weightsharing NAS algorithms consist of three key components: a search algorithm that samples an architecture from the search space in the form of an encoding, a mapping function
that maps the encoding into its corresponding neural network, and a training protocol for a proxy task
for which the network is optimized.To train the search algorithm, one needs to additionally define the mapping function that generates the sharedweight network. Note that the mapping frequently differs from , since in practice the final model contains many more layers and parameters so as to yield competitive results on the proxy task. After fixing , a training protocol is required to learn the supernet. In practice, often hides factors that are critical for the final performance of an approach, such as hyperparameter settings or the use of data augmentation strategies to achieve stateoftheart performance [6, 15, 9]. Again, may differ from , which is used to train the architecture that has been found by the search. For example, our experiments reveal that the learning rate and the total number of epochs frequently differ due to the different training behavior of the supernet and standalone architectures.
Many strategies have been proposed to implement the search algorithm, such as reinforcement learning
[16, 17][18, 19, 20, 21, 22], gradientbased optimization [6, 23, 11], Bayesian optimization [24, 25, 26, 27], and separate performance predictors [21, 28]. Until very recently, the common trend to evaluate NAS consisted of reporting the searched architecture’s performance on the proxy task [8, 29, 4]. This, however, hardly provides real insights about the NAS algorithms themselves, because of the many components involved in them. Many factors that differ from one algorithm to another can influence the performance. In practice, the literature even commonly compares NAS methods that employ different protocols to train the final model.[30] and [31] were the first to systematically compare different algorithms with the same settings for the proxy task and using several random initializations. Their surprising results revealed that many NAS algorithms produce architectures that do not significantly outperform a randomlysampled architecture. [32] highlighted the importance of the training protocol
. They showed that optimizing the training protocol can improve the final architecture performance on the proxy task by three percent on CIFAR10. This nontrivial improvement can be achieved regardless of the chosen sampler, which provides clear evidence for the importance of unifying the protocol to build a solid foundation for comparing NAS algorithms.
In parallel to this line of research, the recent series of “NASBench” works [12, 33, 13] proposed to benchmark NAS approaches by providing a complete, tabular characterization of a search space. This was achieved by training every realizable standalone architecture using a fixed protocol . Similarly, other works proposed to provide a partial characterization by sampling and training a sufficient number of architectures in a given search space using a fixed protocol [14, 9, 27].
While recent advances for systematic evaluation are promising, no work has yet thoroughly studied the influence of the supernet training protocol and the mapping function . Previous works [9, 30] performed hyperparameter tuning to evaluate their own algorithms, and focused only on a few parameters. We fill this gap by benchmarking different choices of and and by proposing novel variations to improve the supernet quality.
Recent works have shown that subnetworks of supernet training can surpass some human designed models without retraining [34, 35] and that reinforcement learning can surpass the performance of random search [36]. However, these findings are still only shown on MobileNetlike search spaces, where one only searches for the size of the convolution kernels and the channel ratio for each layer. This is an effective approach to discover a compact network, but it does not change the fact that on more complex, cellbased search spaces the supernet quality remains low.
3 Evaluation Methodology
We first isolate 14 factors that need to be considered during the design and training of a supernet, and then introduce the metrics to evaluate the quality of the trained supernet. Note that these factors are agnostic to the search policy that is used after training the supernet.
3.1 Disentangling the supernet design heuristics from the search algorithm
Our goal is to evaluate the influence of the supernet mapping and weightsharing training protocol . As shown in Figure 3.1, translates an architecture encoding, which typically consists of a discrete number of choices or parameters, into a neural network. Based on a welldefined mapping, the supernet is a network in which every subpath has a onetoone mapping with an architecture encoding [5]. Recent works [23, 11, 12] separate the encoding into cell parameters, which define the basic building blocks of a network, and macro parameters, which define how cells are assembled into a complete architecture.
Weightsharing mapping . To make the search space manageable, all cell and macro parameters are fixed during the search, except for the topology of the cell and its possible operations. However, the exact choices for each of these fixed factors differ between algorithms and search spaces. We report the common factors in the left part of Table I. They include various implementation choices, e.g., the use of convolutions with a dynamic number of channels (Dynamic Channeling), superconvolutional layers that support dynamic kernel sizes (OFA Kernel) [35], weightsharing batchnormalization (WSBN) that tracks independent running statistics and affine parameters for different incoming edges [28], and path and global dropout [5, 28, 6]. They also include the use of lowfidelity estimates [37] to reduce the complexity of supernet training, e.g., by reducing the number of layers [6] and channels [32, 38], the portion of the training set used for supernet training [6], or the batch size [6, 5, 32].
WS Mapping  WS Protocol  

implementation  low fidelity  hyperparam.  sampling 
Dynamic Channeling  layer  batchnorm  FairNAS 
OFA Conv  train portion  learning rate  RandomNAS 
WSBN  batch size  epochs  RandomA 
Dropout  channels  weight decay  
Op on Node/Edge 
Weightsharing protocol Given a mapping , different training protocols can be employed to train the supernet. Protocols can differ in the training hyperparameters and the sampling strategies they rely on. We will evaluate the different hyperparameter choices listed in the right part of Table I. This includes the initial learning rate, the hyperparameters of batch normalization, the total number of training epochs, and the amount of weight decay.
We randomly sample one path to train the supernet [39], which is also known as singlepath oneshot (SPOS) or RandomNAS [30]. The reason for this choice is that RandomNAS is equivalent to the initial state of many search algorithms [6, 5, 28], some of which even freeze the sampler training so as to use random sampling to warmup the supernet [23, 40]. Note that we also evaluated two variants of RandomNAS, but found their improvement to be only marginal.
In our experiments, for the sake of reproducibility, we ensure that and , as well as and , are as close to each other as possible. For the hyperparameters of , we crossvalidate each factor following the order in Table I, and after each validation, use the value that yields the best performance in . For all other factors, we change one factor at a time.
3.1.1 Search spaces
We employ three commonlyused cellbased search spaces, for which a large number of standalone architectures have been trained and evaluated on CIFAR10 [41] to obtain their groundtruth performance. In particular, we use NASBench101 [12], which consists of architectures and is compatible with weightsharing NAS [31, 33]; NASBench201 [13], which contains more operations than NASBench101 but fewer nodes; and DARTSNDS [14] that contains over architectures, of which a subset of 5000 models was sampled and trained in a standalone fashion. A summary of these search spaces and their properties is shown in Table II. The search spaces differ in the number of architectures that have known standalone accuracy (# Arch.), the number of possible operations (# Op.), how the channels are handled in the convolution operations (Channel), where dynamic means that the number of supernet channels might change based on the sampled architecture, and the type of optimum that is known for the search space (Optimal). We further provide the maximum number of nodes (), excluding the input and output nodes, in each cell, as well as a bound on the number of shared weights (Param.) and edge connections (Edges). Finally, the search spaces differ in how the nodes aggregate their inputs if they have multiple incoming edges (Merge).
Recently, chainlike search spaces [7, 39, 23, 2]
have been shown to be effective for computer vision tasks. However, to the best of our knowledge no benchmark space currently exists for such search spaces. We nonetheless construct a simplified chainlike space based on NASBench101. See Appendix for more details.
NASBench101  NASBench201  DARTSNDS  
Arch.  423,624  15,625  ¿ 
Op.  3  5  8 
Channel  Dynamic  Fix  Fix 
Optimal  Global  Global  Sample 
Nodes=()  5  4  4 
Param.      
Edges  
Merge  Concat.  Sum  Sum 
3.2 Sparse KendallTau  A novel supernet evaluation metric
We define a novel supernet metric, which we name sparse KendallTau. It is inspired by the KendallTau metric used by [31] to measure the discrepancy between the ordering of standalone architectures and the ordering that is implied by the trained supernet. An ideal supernet should yield the same ordering of architectures as the standalone one and thus would lead to a high KendallTau. However, KendallTau is not robust to negligible performance differences between architectures (c.f. Figure 2). To robustify this metric, we share the rank between two architectures if their standalone accuracies differ by less than a threshold ( here). Since the resulting ranks are sparse, we call this metric sparse KendallTau (sKdT). See Appendix for implementation details.
Sparse KendallTau threshold. This value should be chosen according to what is considered a significant improvement for a given task. For CIFAR10, where accuracy is larger than 90%, we consider a 0.1% performance gap to be sufficient. For tasks with smaller stateoftheart performance, larger values might be better suited.
Number of architectures. In practice, we observed that the sparse KendallTau metric became stable and reliable when using at least architectures. We used in our experiments to guarantee stability and fairness of the comparison of the different factors.
Limitation of Sparse KendallTau We nonetheless acknowledge that our sparse KendallTau has some limitations. For example, a failure case of using sparse KendallTau for supernet evaluation may occur when the top 10% architectures are perfectly ordered, while the bottom 90% architectures are purely randomly distributed. In this case, the Kendall Tau will be close to 0. However, the search algorithm will always return the best model, as desired.
Nevertheless, while this corner case would indeed be problematic for the standard Kendall Tau, it can be circumvented by tuning the threshold of our sKdT. A large threshold value will lead to a small number of groups, whose ranking might be more meaningful. For instance in some randomlypicked NASBench101 search processes, setting the threshold to 0.1% merges the top 3000 models into 9 ranks, but still yields an sKdT of only 0.2. Increasing the threshold to 10% clusters the 423K models into 3 ranks, but still yields an sKdT of only 0.3. This indicates the stability of our metric.
In Figure 3, we randomly picked 12 settings and show the corresponding bipartite graphs relating the supernet and groundtruth rankings to investigate where disorder occurs. In practice, the corner case discussed above virtually never occurs; the ranking disorder is typically spread uniformly across the architectures.
3.2.1 Other metrics
Although, sparse KendallTau captures the supernet quality well, it may fail in extreme cases, such as when the topperforming architectures are ranked perfectly while poor ones are ordered randomly. To account for such rare situations and ensure the soundness of our analysis, we also report additional metrics. We define two groups of metrics to holistically evaluate different aspects of a trained supernet.
The first group of metrics directly evaluates the quality of the supernet, including sparse KendallTau and the widelyadopted supernet accuracy. For the supernet accuracy, we report the average accuracy of 200 architectures on the validation set of the dataset of interest. We will refer to this metric simply as accuracy. It is frequently used [39, 15] to assess the quality of the trained supernet, but we will show later that it is in fact a poor predictor of the final standalone performance. The metrics in the second group evaluate the search performance of a trained supernet. The first metric is the probability to surpass random search: Given the groundtruth rank of the best architecture found after runs and the maximum rank
, equal to the total number of architectures, the probability that the best architecture found is better than a randomly searched one is given by
.Finally, where appropriate, we report the standalone accuracy of the model, a.k.a. final performance, that was found by the complete WSNAS algorithm. Concretely, we randomly sample 200 architectures, select the 3 best models based on the supernet accuracy and query the groundtruth performance. We then take the mean of these architectures as standalone accuracy. Note that the same architectures are used to compute the sparse KendallTau.
4 Analysis
We provide an analysis on the impact of the factors that are shown in Table I across three different search spaces. In addition, we report the complete numerical results of all metrics in Section 4.6.
Training Details.
We use PyTorch
[42]for our experiments. Since NASBench101 was constructed in TensorFlow we implement a mapper that translates TensorFlow parameters into our PyTorch model. We exploit two largescale experiment management tools, SLURM
[43] and Kubernetes [44], to deploy our experiments. We use various GPUs throughout our project, including NVIDIA Tesla V100, RTX 2080 Ti, GTX 1080 Ti and Quadro 6000 with CUDA 10.1. Depending on the number of training epochs, parameter sizes and batchsize, most of the supernet training finishes within 12 to 24 hours, with the exception of FairNAS, whose training time is longer, as discussed earlier. We split the data into training/validation using a 90/10 ratio for all experiments, except those involving validation on the training portion. Please consult our submitted code for more details.Reproducing the Ground Truth from Tensorflow.
As the groundtruth performance used by NASBench101 are obtained on Tensorflow with TPU computation structure. We firstly reproduce these results in Pytorch with our implementation to make sure the reimplementation is trustworthy. We uniformly random sampled 10 architectures, and repeat 3 times. It results 30 architectures and covers the spectrum of performance from 82% to 93%. We adopted the optimizer and hyperparameter setting according to the code release of NASBench101, repeated with 3 random initializations and take the mean performance. Note that we copied the initialization from the released Tensorflow model into PyTorch format to minimize frameworks discrepancies. We plot the performance comparison in Figure 4. The Kendall Tau metric is 0.81, and should be considered as the upperbound of supernet training. It clearly indicates that even the reproducing results are not perfectly aligned with the Tensorflow original, it cannot explain why the significant drop to 0.2 after using weight sharing [31].
Spearman Corr.  Acc.  Perf.  SKdT  Perf.  Acc.  SKdT 

NASBench101  0.09  0.45  0.23 
NASBench201  0.52  0.55  0.94 
DARTSNDS  0.07  0.19  0.47 
4.1 Evaluation of a supernet
The standalone performance of the architecture that is found by a NAS algorithm is clearly the most important metric to judge its merits. However, in practice, one cannot access this metric—we wouldn’t need NAS if standalone performance was easy to query (the cost of computing standalone performance is discussed in Section 4.1.2).
Furthermore, standalone performance inevitably depends the sampling policy, and does not directly evaluate the quality of the supernet (see Section 4.1.2). Consequently, it is important to rely on metrics that are well correlated with the final performance but can be queried efficiently. To this end, we collect all our experiments and plot the pairwise correlation between final performance, sparse KendallTau, and supernet accuracy. As shown in Figure 5, the supernet accuracy has a low correlation with the final performance on NASBench101 and DARTSNDS. Only on NASBench201 does it reach a correlation of 0.52. The sparse KendallTau yields a consistently higher correlation with the final performance. This is evidence that one should not focus too strongly on improving the supernet accuracy. While this metric remains computationally heavy, it serves as a middle ground that is feasible to evaluate in realworld applications.
In the following experiments, we thus mainly rely on sparse KendallTau, and use final search performance as a reference only.
4.1.1 Kendall Tau v.s. Spearman ranking correlation
Kendalltau is not the only metric to evaluate the ranking correlation. Spearman ranking correlation is also widely adopted in this field [39, 13]. Note that our idea of sparsity also applies to SpR. In Table 4.1.1, we compare the performance of Kendall Tau(KdT), Spearman ranking correlation (SpR) and their sparse variants, in the same setting as Figure 5. Note that SpR and KdT performs similarly but that their sparse variants effectively improve the correlation on all search spaces.
4.1.2 Standalone Accuracy v.s. Sparse KendallTau
A common misconception is that the supernet quality is well reflected by standalone accuracy of the final selected architecture. Neither sparse KendallTau (sKdT) nor standalone accuracy are perfect. Both are tools to measure different aspects of a supernet.
Let us consider a completely new search space in which we have no prior knowledge about performance. As depicted by Figure 6
, if we only rely on the standalone accuracy, the following situation might happen: Due to the lack of knowledge, the ranking of the supernet is purely random, and the search space accuracy is uniformly distributed. When trying different settings, there will be 1 configuration that ‘outperforms’ the others in terms of standalone accuracy. However, this configuration will be selected by pure chance. By only measuring standalone accuracy, it is technically impossible to realize that the ranking is random. By contrast, if one measures the sKdT (which is close to 0 in this example), an illconditioned supernet can easily be identified. In other words, purely relying on standalone accuracy could lead to pathological outcomes that can be avoided using sparse KendallTau.
Additionally, standalone accuracy is related to both the supernet and the search algorithm. sparse KendallTau allows us to judge supernet accuracy independently from the search algorithm. As an example, consider the use of a reinforcement learning algorithm, instead of random sampling, on top of the supernet. When observing a poor standalone accuracy, one cannot conclude if the problem is due to a poor supernet or to a poor performance of the RL algorithm. Prior to our work, people relied on the supernet accuracy to analyze the supernet quality. This is not a reliable metric, as shown in Figure 5. We believe that sparse KendallTau is a better alternative.
Computational Cost. Computing the final accuracy is more expensive than training the supernet. Despite the lowfidelity heuristics reducing the weightsharing costs, training a standalone network to convergence has higher cost, e.g., DARTS searches for 50 epochs but trains from scratch for 600 epochs [6]. Furthermore, debugging and hyperparameter tuning typically require training thousands of standalone models. Note that, as one typically evaluates a random subset of architectures to understand the design space [14], sparse KendallTau can be computed without additional costs. In any event, the budget for sparse KendallTau is bounded with .
4.2 Weightsharing Protocol – Hyperparameters
4.2.1 Batch normalization in the supernet
Batch normalization (BN) is commonly used in standalone networks to allow for faster and more stable training. It is thus also employed in most CNN search spaces. However, BN behaves differently in the context of WSNAS, and special care has to be taken when using it. In a standalone network (c.f. Figure 8 (Top)), a BN layer during training computes the batch statistics and , normalizes the activations as , and finally updates the population statistics using a moving average. For instance, the mean statistics is updated as
. At test time, the stored population statistics are used to normalize the feature map. In the standalone setting, both batch and population statistics are unbiased estimators of the population distribution
.By contrast, when training a supernet (Figure 8 (Bottom)) the population statistics that are computed based on the running average are not unbiased estimators of the population distribution, because the effective architecture before the BN layer varies in each epoch. More formally, let denote the th architecture. During training, the batch statistics are computed as , and the output feature follows the distribution , where the superscript indicates that the current batch statistics depends on only. The population mean statistics is then updated as . However, during training, different architecture from the supernet are sampled. Therefore, the population mean statistics essentially becomes a weighted combination of means from different architectures, i.e., , where is the sampling frequency of the th architecture. When evaluating a specific architecture at test time, the estimated population statistics thus depend on the other architectures in the supernet. This leads to a traintest discrepancy. One solution to mitigate this problem is to recalibrate the batch statistics by recomputing the statistics on the entire training set before the the final evaluation [45]. While the cost of doing so is negligible for a standalone network, NAS algorithms typically sample architectures for evaluation, which makes this approach intractable.
In contrast to [13] and [36] that use the training mode also during testing, we formalize a simple, yet effective, approach to tackle the traintest discrepancy of BN in supernet training: we leave the normalization based on batch statistics during training unchanged, but use batch statistics also during testing. Since supernet evaluation is always conducted over a complete dataset, we are free to perform inference in minibatches of the same size as the ones used during training. This allows us to compute the batch statistics on the fly in the exact same way as during training.
Figure 8 compares standard BN to our proposed modification. Using the tracked population statistics leads to many architectures with an accuracy around 10%, i.e., performing no better than random guessing. Our proposed modification allows us to significantly increase the fraction of highperforming architectures. Our results also show that the choice of fixing vs. learning an affine transformation in batch normalization should match the standalone protocol .
4.2.2 Learning rate
The training loss of the supernet encompasses the task losses of all possible architectures. We suspect that the training difficulty increases with the number of architectures represented by the supernet. To better study this, we visualize the loss landscape [46] of the standalone network and a supernet with architectures. Concretely, the landscape is computed over the supernet training loss under the singlepath oneshot sampling method, i.e.,
(1) 
Figure 9 shows that the loss landscape of the supernet is less smooth than that of a standalone architecture, which confirms our intuition. A smoother landscape indicates that optimization will converge more easily to a good local optimum. With a smooth landscape, one can thus use a relatively large learning rate. By contrast, a less smooth landscape requires using a smaller one.
Our experiments further confirm this observation. In the standalone protocol , the learning rate is set to for NASBench101, and to for NASBench201 and DARTSNDS, respectively. All protocols use a cosine learning rate decay. Figure 10 shows that supernet training requires lower learning rates than standalone training. The same trend is shown for other search spaces in Section 4.6 Table VIII. We set the learning rate to 0.025 to be consistent across the three search spaces.
4.2.3 Number of epochs
Since the cosine learning rate schedule decays the learning rate to zero towards the end of training, we evaluate the impact of the number of training epochs. In standalone training, the number of epochs was set to 108 for NASBench101, 200 for NASBench201, and 100 for DARTSNDS. Figure 11 shows that increasing the number of epochs significantly improves the accuracy in the beginning, but eventually decreases the accuracy for NASBench101 and DARTSNDS. Interestingly, the number of epochs impacts neither the correlation of the ranking nor the final selected model performance after 400 epochs. We thus use 400 epochs for the remaining experiments.
4.2.4 Weight decay
Weight decay is used to reduce overfitting. For WSNAS, however, overfitting does not occur because there are billions of architectures sharing the same set of parameters, which in fact rather causes underfitting. Based on this observation, [10] propose to disable weight decay during supernet training. Figure 12, however, shows that the behavior of weight decay varies across datasets. While on DARTSNDS weight decay is indeed harmful, it improves the results on NASBench 101 and 201. We conjecture that this is due to the much larger number of architectures in DARTSNDS (243 billion) than in the NASBench series (less than 500,000).
4.3 Weightsharing Protocol – Sampling
Aside from the RandomNAS described in Section 3.1, we additionally include two variants of RandomNAS: 1) As pointed out by [12], two supernet architectures might be topologically equivalent in the standalone network by simply swapping operations. We thus include architectureaware random sampling that ensures equal probability for unique architectures [31]. We name this variant RandomA; 2) We evaluate a variant called FairNAS [15], which ensures that each operation is selected with equal probability during supernet training. Although FairNAS was designed for a search space where only operations are searched but not the topology, we adapt it to our setting.
Adaptation of FairNAS. Originally, FairNAS [15] was proposed in a search space with a fixed sequential topology, as depicted by Figure 13 (a), where every node is sequentially connected to the previous one, and only the operations on the edges are subject to change. However, our benchmark search spaces exploit a more complex dynamic topology, as illustrated in Figure 13 (b), where one node can connect to one or more previous nodes. Before generalizing to a dynamic topology search space, we simplify the original approach into a 2node scenario: for each input batch, FairNAS will first randomly generate a sequence of all possible operations. It then samples one operation at a time, computes gradients for the fixed input batch, and accumulates the gradients across the operations. Once all operations have been sampled, the supernet parameters are updated with the average gradients. This ensures that all possible paths are equally exploited . With this simplification, FairNAS can be applied regardless of the topology. For a sequentialtopology search space, we repeat the 2node policy for every consecutive node pair. Naturally, for a dynamic topology space, FairNAS can be adopted in a similar manner, i.e., one first samples a topology, then applies the 2node strategy for all connected node pairs. Note that adapting FairNAS increases the training time by a factor .
Results. With the hyperparameters fixed, we now compare three pathsampling techniques. Since DARTSNDS does not contain enough samples trained in a standalone manner, we only report results on NASBench101 and 201. In Figure 14, we show the sampling distributions of different approaches and the impact on the supernet in terms of sparse KendallTau. These experiments reveal that, on NASBench101, uniformly randomly sampling one architecture, as in [30, 31], is strongly biased in terms of accuracy and ranking. This can be observed from the peaks around rank 0, 100,000, and 400,000. The reason is that a single architecture can have multiple encodings, and uniform sampling thus oversamples such architectures with equivalent encodings. FairNAS samples architectures more evenly and yields consistently better sparse KendallTau values, albeit by a small margin.
On NASBench201, the three sampling policies have a similar coverage. This is because, in NASBench201, topologicallyequivalent encodings were not pruned. In this case, RandomNAS performs better than in NASBench101, and FairNAS yields good early performance but quickly saturates. In short, using different sampling strategies might in general be beneficial, but we advocate for FairNAS in the presence of a limited training budget.
4.4 Weightsharing Mapping – Lower Fidelity Estimates lower the ranking correlation
Metrics  Settings  

Batchsize  256  128  64 
SKdT  0.740 0.07  0.728 0.16  0.703 0.16 
SAA  92.92 0.48  92.37 0.61  92.35 0.34 
Init Channel  16  8  4 
SKdT  0.740 0.07  0.677 0.10  0.691 0.15 
SAA  92.92 0.48  92.32 0.37  92.79 0.85 
Repeated cells  3  2  1 
SKdT  0.751 0.09  0.692 0.18  0.502 0.21 
SAA  91.91 0.09  91.95 0.10  90.30 0.71 
Train portion  0.75  0.5  0.25 
SKdT  0.751 0.11  0.742 0.12  0.693 0.13 
SAA  92.13 0.51  92.74 0.43  91.47 0.81 
Reducing memory footprint and training time by proposing smaller supernets has been an active research direction, and the resulting supernets are referred to as lower fidelity estimates [37]. The impact of this approach on the supernet quality, however, has never been studied systematically over multiple search spaces. We compare four popular strategies in Table III. We deliberately prolong the training epochs inversely proportionally to the computational budget that would be saved by the lowfidelity estimates. For example, if the number of channels is reduced by half, we train the model for two times more epochs. Note that this provides an upper bound to the performance of lowfidelity estimates.
A commonlyused approach to reduce memory requirements is to decrease the batch size [32]. Surprisingly, lowering the batch size from 256 to 64 has limited impact on the accuracy, but decreases sparse KendallTau and the final searched model’s performance, the most important metric in practice.
As an alternative, one can decrease the number of repeated cells [5, 15]. This reduces the total number of parameters proportionally, since the number of cells in consecutive layers depends on the first one. Table III shows that this decreases the sparse KendallTau from 0.7 to 0.5. By contrast, reducing the number of channels in the first layer [6] has little impact. Hence, to train a good supernet, one should avoid changes between and , but one can reduce the batch size by a factor 0.5 and use only one repeated cell.
The last lowerfidelity factor is the portion of training data that is used [6, 23]. Surprisingly, reducing the training portion only marginally decreases the sparse KendallTau for all three search spaces. On NASBench201, keeping only 25% of the CIFAR10 dataset results in a 0.1 drop in sparse KendallTau. This explains why DARTSbased methods typically use only 50% of the data to train the supernet but can still produce reasonable results.
4.5 Weightsharing Mapping  Implementation
4.5.1 Dynamic channeling hurts supernet quality
Edges  Accuracy  SKdT  P R  Standalone Acc. 

Baseline: random sampling subspaces with dynamic channeling.  
1  70.048.15  0.173  0.797  91.192.01 
2  78.2910.51  0.206  0.734  82.031.50 
3  79.929.42  0.242  0.576  92.201.19 
4  79.3717.34  0.270  0.793  92.321.10 
Average  76.905 10.05  0.223  0.865  89.435 4.30 
Disable dynamic channels by fixing the edges to the output node.  
1  76.927.87  0.435  0.991  93.940.22 
2  74.328.21  0.426  0.925  93.340.01 
3  77.249.18  0.487  0.901  93.660.07 
4  79.317.04  0.493  0.978  93.650.07 
Average  76.95 8.29  0.460  0.949  93.65 0.73 
Dynamic channeling is an implicit factor in many search spaces [12, 7, 39, 40]. It refers to the fact that the number of channels of the intermediate layers depends on the number of incoming edges to the output node. This is depicted by Figure 4.5.1 (a): for a search cell with intermediate nodes, where and are the input and output node with and channels, respectively. When there are edges (c.f. Figure 4.5.1 (b)), the associated channel numbers decrease so that their sum equals . That is, the intermediate nodes have channels. In the general case, shown in Figure 4.5.1 (c), the number of channels in intermediate nodes is thus for incoming edges. A weight sharing approach has to cope with this architecturedependent fluctuation of the number of channels during training.
Let denote the number of channels of a given architecture, and the maximum number of channels for a node across the entire search space. All existing approaches allocate channels and, during training, extract a subset of these channels. The existing methods then differ in how they extract the channels: [39] use a fixed chunk of channels, e.g., ; [47] randomly shuffle the channels before extracting a fixed chunk; and [48] linearly interpolate the channels into channels using a moving average across neighboring channels.
Instead of sharing the channels between architectures, we propose to disable dynamic channelling completely. As the channel number only depends on the incoming edges, we separate the search space into a discrete number of subspaces, each with a fixed number of incoming edges. As shown in Table 4.5.1, disabling dynamic channeling improves the sparse KendallTau and the final search performance by a large margin and yields a new state of the art on NASBench101.
Ablation study. Since each subspace now encompasses fewer architectures, it is not fair to perform a comparison with the full NASBench 101 search space. Therefore, for each subspace, we construct a baseline space where we drop architectures uniformly at random until the number of remaining architectures matches the size of the subspace. We repeat this process with 3 different initializations, while keeping all other factors unchanged when training the supernet.
We also provide additional results in Table IV for each individual subspace and show that the sparse KendallTau remains similar to that of the baseline using the full search space, which clearly evidences the effectiveness of our approach to disable the dynamic channeling.
Furthermore, we evaluate the effect of disabling dynamic channels during the supernet training and test phase individually in Table 4.5.1. Disabling dynamic channeling during both phases yields the best results.
4.5.2 WS on Edges or Nodes?
NASBench101  NASBench201  DARTSNDS  

Baseline  0.236 92.32  0.740 92.92  0.159 93.59 
OpEdge  N/A  as Baseline  0.189 93.97 
OpNode  as Baseline  0.738 92.36  as Baseline 
figure(a) Consider a search space with 2 intermediate nodes, 1, 2, with one input (I) and output (O) node. This yields 5 edges. Let us assume that we have 4 possible operations to choose from, as indicated as the purple color code. (b) When the operations are on the nodes, there are 2 4 ops to share, i.e., I2 and 12 share weights on node 2. (c) If the operations are on the edges, then we have 5 4 ops to share.
Most existing works build to define the shared operations on the graph nodes rather than on the edges. This is because, if maps to the edges, the parameter size increases from to , where is the number of intermediate nodes. We provide a concrete example in Figure V. However, the high sparse KendallTau on NASBench201 in the top part of Table V, which is obtained by mapping to the edges, may suggest that sharing on the edges is beneficial. Here we investigate if this is truly the case.
On NASBench101, by design, each node merges the previous nodes’ outputs and then applies parametric operations. This makes it impossible to build an equivalent sharing on the edges. We therefore construct sharing on the edges for DARTSNDS and sharing on the nodes for NASBench201. As shown in Table VI, for both spaces, sharing on the edges yields a marginally better supernet than sharing on the nodes. Such small differences might be due to the fact that, in both spaces, the number of nodes is 4, while the number of edges is 6, thus mapping to edges will not drastically affect the number of parameters. Nevertheless, this indicates that one should consider having a larger number of shared weights when the resources are not a bottleneck.
4.5.3 Other mapping factors
NASBench101  NASBench201  DARTSNDS  

Baseline  0.236 92.32  0.740 92.92  0.159 93.59 
WSBN  0.056 91.33  0.675 92.04  0.331 92.95 
GlobalDropout  0.179 90.95  0.676 91.76  0.102 92.30 
PathDropout  0.128 91.19  0.431 91.42  0.090 91.90 
OFA Kernel  0.132 92.01  0.574 91.83  0.112 92.83 
Method  NASBench  NASBench  DARTS  DARTS 

101 (n=7)  201  NDS  NDS  
ENAS [5]  91.83 0.42  54.30 0.00  94.45 0.09  97.11 
DARTSV2 [6]  92.21 0.61  54.30 0.00  94.79 0.11  97.37 
NAO [28]  92.59 0.59      97.10 
GDAS [40]    93.51 0.13    96.23 
Random NAS [30]  89.89 3.89  87.66 1.69  91.33 0.12  96.74 
Random NAS (Ours)  93.12 0.06  92.71 0.15  94.26 0.05  97.08 
Results from [30]  
Trained according to [6] for 600 epochs.  
On NASBench201, both random NAS and our approach sample 100 final  
architectures to follow [13]. 
We evaluate the weightsharing batch normalization (WSBN) of [49] , which keeps an independent set of parameters for each incoming edge. Furthermore, we test the two commonlyused dropout strategies: right before global pooling (global dropout); and at all edge connections between the nodes (path dropout). Note that path dropout has been widely used in WSNAS [28, 6, 5]. For both dropout strategies, we set the dropout rate to 0.2. Finally, we evaluate the super convolution layer of [35], referred to as OFA kernel, which accounts for the fact that, in CNN search spaces, convolution operations appear as groups, and thus merges the convolutions within the same group, keeping only the largest kernel parameters and performing a parametric projection to obtain the other kernels. The results in Table VI show that all these factors negatively impact the search performances and the supernet quality.
Factor  NASBench101  NASBench201  DARTSNDS  

and  Supernet  Final  Supernet  Final  Supernet  Final  
settings  Accuracy  SKdT  P R  Performance  Accuracy  SKdT  P R  Performance  Accuracy  SKdT  P R  Performance 
Batchnorm.  
affine F track F  0.6510.05  0.161  0.996  0.9160.13  0.6600.13  0.783  0.997  92.671.21  0.7350.18  0.056  0.224  93.140.28 
affine T track F  0.7100.04  0.240  0.996  0.9240.01  0.7130.14  0.718  0.707  91.711.05  0.2650.21  0.071  0.213  91.892.01 
affine F track T  0.1440.09  0.084  0.112  0.8820.02  0.1820.15  0.171  0.583  86.414.84  0.3590.25  0.078  0.023  90.330.76 
affine T track T  0.1530.10  0.008  0.229  0.9050.01  0.1340.09  0.417  0.274  90.770.40  0.2160.18  0.050  0.109  90.490.32 
Learning rate.  
0.005  0.6270.07  0.091  0.326  0.9080.01  0.6580.11  0.668  0.141  90.140.55  0.7920.08  0.130  0.033  91.810.68 
0.01  0.6680.06  0.095  0.546  0.9190.00  0.7130.12  0.670  0.711  91.211.18  0.7270.05  0.131  0.258  92.860.64 
0.025  0.7150.05  0.220  0.910  0.9170.01  0.6590.13  0.665  0.844  92.420.58  0.6560.14  0.218  0.299  93.420.20 
0.05  0.7270.05  0.143  0.905  0.9110.02  0.6310.14  0.594  0.730  92.020.70  0.6230.04  0.147  0.489  91.700.33 
0.1  0.6900.07  0.005  0.905  0.9090.02  0.6090.28  0.571  0.618  91.820.81  0.7350.06  0.096  0.099  92.730.24 
0.15  0.0000.00  0.274  N/A  N/A  0.5510.14  0.506  0.553  91.221.20  0.3710.27  0.027  0.218  91.200.72 
0.2          0.5190.12  0.557  0.035  88.740.11  0.1020.48  0.366  N/A  N/A 
Epochs.  
100  0.4680.07  0.190  0.759  0.9200.01  0.4720.09  0.355  0.997  92.111.67  0.6430.04  0.144  0.901  93.900.49 
200  0.6620.05  0.131  0.685  0.9140.01  0.6040.12  0.610  0.881  91.882.01  0.7610.05  0.169  0.778  94.080.21 
300  0.7270.03  0.251  0.739  0.9200.01  0.6640.13  0.627  0.840  91.421.91  0.7930.06  0.098  0.870  93.220.95 
400  0.7690.03  0.236  0.932  0.9210.01  0.6970.14  0.667  0.158  89.830.97  0.7980.07  0.106  0.036  92.340.22 
600  0.8150.02  0.246  0.556  0.9110.01  0.7200.13  0.682  0.285  90.280.82  0.7340.10  0.090  0.209  93.230.19 
800  0.8260.02  0.243  0.177  0.9070.00  0.7600.13  0.711  0.378  91.530.53  0.7280.10  0.044  0.853  93.290.81 
1000  0.7940.03  0.177  0.831  0.9200.01  0.7820.13  0.740  0.589  92.920.48  0.7170.09  0.044  0.997  93.920.90 
1200          0.7750.13  0.723  0.198  90.810.56         
1400          0.7740.13  0.750  0.604  92.260.33         
1600          0.7780.13  0.731  0.882  91.851.20         
1800          0.7830.13  0.746  0.266  90.640.82         
Weight decay.  
0.0  0.6450.05  0.037  0.179  0.8990.01  0.7130.13  0.652  0.266  90.580.99  0.6700.03  0.159  0.629  93.090.73 
0.0001  0.7190.03  0.109  0.659  0.9120.01  0.7560.13  0.734  0.612  91.880.59  0.7510.05  0.143  0.396  93.370.44 
0.0003  0.7710.03  0.144  0.648  0.9150.01  0.7720.13  0.721  0.726  92.340.57  0.7590.06  0.110  0.890  93.820.51 
0.0005  0.7820.03  0.117  0.910  0.9110.02  0.7640.13  0.705  0.882  92.610.59  0.7390.07  0.077  0.051  91.611.01 
Sampling.  
RandomA  0.7170.04  0.133  0.862  0.9190.02  0.7640.13  0.705  0.882  92.610.59         
RandomNAS  0.6380.20  0.167  0.949  0.9130.02  0.7650.14  0.750  0.897  92.171.01         
FairNAS  0.7890.03  0.288  0.382  0.9080.01  0.7740.14  0.713  0.917  93.060.31         
Factor  NASBench101  NASBench201  DARTSNDS  

and  Supernet  Final  Supernet  Final  Supernet  Final  
settings  Accuracy  SKdT  P R  Performance  Accuracy  SKdT  P R  Performance  Accuracy  SKdT  P R  Performance 
Number of Layer (X indicates the baseline minus X)  
Baseline  0.7690.03  0.236  0.932  0.9210.01  0.7820.13  0.740  0.589  92.920.48  0.6700.03  0.159  0.629  93.090.73 
1  0.7590.03  0.214  0.222  0.9010.01  0.7490.13  0.710  0.796  91.850.92  0.8430.04  0.178  0.299  92.351.25 
2  0.8170.03  0.228  0.713  0.9100.02  0.7770.13  0.700  0.822  92.680.37  0.8520.03  0.205  0.609  92.651.89 
Train portion  
0.25  0.4330.07  0.216  0.281  0.9010.01  0.6600.11  0.668  0.979  92.301.14  0.5970.14  0.132  0.359  92.271.84 
0.5  0.6120.06  0.251  0.424  0.8960.02  0.7400.12  0.669  0.979  93.170.47  0.6660.17  0.083  0.551  92.221.36 
0.75  0.6880.05  0.222  0.857  0.9200.01  0.7580.13  0.725  0.618  92.460.19  0.7150.18  0.096  0.081  92.290.47 
0.9  0.7220.05  0.186  0.996  0.9310.01  0.7720.13  0.721  0.726  92.340.57  0.7030.18  0.042  0.065  92.780.10 
Batch size (/ X indicates the baseline divide by X)  
Baseline  0.7690.03  0.236  0.932  0.9210.01  0.7820.13  0.740  0.589  92.920.48  0.6700.03  0.159  0.629  93.090.73 
/ 2  0.6700.05  0.246  0.807  0.9200.01  0.7280.16  0.719  0.842  92.370.61  0.6980.20  0.037  0.209  93.240.13 
/ 4  0.6860.07  0.155  0.913  0.9210.01  0.7030.16  0.679  0.672  92.350.34  0.6330.20  0.033  0.690  93.680.62 
# channel (/ X indicates the baseline divide by X)  
Baseline  0.7690.03  0.236  0.932  0.9210.01  0.7820.13  0.740  0.589  92.920.48  0.6700.03  0.159  0.629  93.090.73 
/ 2  0.6580.05  0.156  0.704  0.8980.02  0.6970.14  0.667  0.158  89.830.97  0.7760.05  0.190  0.993  93.900.71 
/ 4  0.6040.06  0.093  0.907  0.9220.01  0.6060.13  0.616  0.878  92.860.34  0.7070.05  0.202  0.359  92.930.58 
Factor  NASBench101  NASBench201  DARTSNDS  

and  Supernet  Final  Supernet  Final  Supernet  Final  
settings  Accuracy  SKdT  P R  Performance  Accuracy  SKdT  P R  Performance  Accuracy  SKdT  P R  Performance 
Other factors  
Baseline  0.7690.03  0.236  0.932  0.9210.01  0.7820.13  0.740  0.589  92.920.48  0.6700.03  0.159  0.629  93.090.73 
OFA Kernel  0.7080.08  0.132  0.203  92.010.19  0.6720.18  0.574  0.605  91.83 0.86  0.7820.05  0.112  0.399  93.220.43 
WSBN  0.1550.07  0.085  0.504  0.8090.13  0.7030.14  0.676  0.585  92.060.48  0.7440.16  0.033  0.682  92.881.22 
Path dropout rate  
Baseline  0.7690.03  0.236  0.932  0.9210.01  0.7820.13  0.740  0.589  92.920.48  0.6700.03  0.159  0.629  93.090.73 
0.05  0.7500.02  0.206  0.819  0.9150.07  0.4900.09  0.712  0.881  92.250.89  0.1840.06  0.006  0.359  92.930.60 
0.15  0.7260.02  0.186  0.482  0.9100.01  0.2500.03  0.640  0.526  91.441.25  0.3660.05  0.059  0.570  92.611.28 
0.2  0.6690.01  0.110  0.282  0.9010.01  0.1850.02  0.431  0.809  92.150.85  0.5180.06  0.090  0.009  91.450.58 
Global dropout  
Baseline  0.7690.03  0.236  0.932  0.9210.01  0.7820.13  0.740  0.589  92.920.48  0.6700.03  0.159  0.629  93.090.73 
0.2  0.7390.05  0.233  0.221  0.9100.00  0.7120.13  0.702  0.950  91.761.36  0.5570.19  0.018  0.451  93.510.27 
Please refer to Section 4.5.2 for mapping on the node or edge and Section 4.5.1 for dynamic channel factor results. 
4.6 Results for All Factors
5 How should you train your supernet?
Figure 15 summarizes the influence of all tested factors on the final performance. It stands out that properly tuned hyperparameters lead to the biggest improvements by far. Surprisingly, most other factors and techniques either have a hardly measurable effect or in some cases even lead to worse performance. Based on these findings, here is how you should train your supernet:

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=12pt]

Do not use supernet accuracy to judge the quality of your supernet. The sparse KendallTau has much higher correlation with the final search performance.

When batch normalization is used, do not use the moving average statistics during evaluation. Instead, compute the statistics on the fly over a batch of the same size as used during training.

The loss landscape of supernets is less smooth than that of standalone networks. Start from a smaller learning rate than standalone training.

Do not use other lowfidelity estimates than moderately reducing the training set size to decrease the search time.

Do not use dynamic channeling in search spaces that have a varying number of channels in the intermediate nodes. Break the search space into multiple subspaces such that dynamic channeling is not required.
Comparison to the state of the art. Table VII shows that carefully controlling the relevant factors and adopting the techniques proposed in Section 4 allow us to considerably improve the performance of RandomNAS. Thanks to our evaluation, we were able to show that simple RandomNAS together with an appropriate training protocol and mapping function yields results that are competitive to and sometimes even surpass stateoftheart algorithms. Our results provide a strong baseline upon which future work can build.
We also report the best settings in Table XI.
Search Space  implementation  low fidelity  hyperparam.  sampling  
Dynamic Conv  OFA Conv  WSBN  Dropout  Op map  layer  portion  batchsize  channels  batchnorm  learning rate  epochs  weight decay  
NASBench101  Interpolation  N  N  0.  Node  9  0.75  256  128  Tr=F A=T  0.025  400  1e3  FairNAS 
NASBench201  Fix  N  N  0.  Edge  5  0.9  128  16  Tr=F A=T  0.025  1000  3e3  FairNAS 
DARTSNDS  Fix  N  Y  0.  Edge  12  0.9  256  36  Tr=F A=F  0.025  400  0  FairNAS 
For batchnorm, we report Track statistics (Tr) and Affine (A) setting with True (T) or False (F).  
For other notation, Y = Yes, N = No. 
Acknowledgments
This work was partially done during an internship at Intel and at Abacus.AI, and supported in part by the Swiss National Science Foundation.
References
 [1] C. Liu, L.C. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and L. FeiFei, “AutoDeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation,” CVPR, 2019.
 [2] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “FBNet: HardwareAware Efficient ConvNet Design via Differentiable Neural Architecture Search,” CVPR, 2019.
 [3] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun, “DetNAS: Neural Architecture Search on Object Detection,” NeurIPS, 2019.
 [4] M. S. Ryoo, A. Piergiovanni, M. Tan, and A. Angelova, “Assemblenet: Searching for multistream neural connectivity in video architectures,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=SJgMK64Ywr
 [5] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” ICML, 2018.
 [6] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” ICLR, 2019.
 [7] H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware,” in ICLR, 2019.
 [8] S. Xie, H. Zheng, C. Liu, and L. Lin, “SNAS: Stochastic neural architecture search,” ICLR, 2019.
 [9] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter, “Understanding and robustifying differentiable architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=H1gDNyrKDS
 [10] N. Nayman, A. Noy, T. Ridnik, I. Friedman, R. Jin, and L. Zelnik, “Xnas: Neural architecture search with expert advice,” in NeurIPS, 2019.
 [11] X. Li, C. Lin, C. Li, M. Sun, W. Wu, J. Yan, and W. Ouyang, “Improving oneshot NAS by suppressing the posterior fading,” CVPR, 2020.
 [12] C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and F. Hutter, “NASBench101: Towards reproducible neural architecture search,” ICLR, 2019.
 [13] X. Dong and Y. Yang, “NASBench201: Extending the scope of reproducible neural architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=HJxyZkBKDr
 [14] I. Radosavovic, J. Johnson, S. Xie, W.Y. Lo, and P. Dollár, “On Network Design Spaces for Visual Recognition,” in ICCV, 2019.
 [15] X. Chu, B. Zhang, R. Xu, and J. Li, “FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search,” arXiv:, 2019.
 [16] B. Zoph and Q. V. Le, “Neural Architecture Search with Reinforcement Learning,” ICLR, 2017.
 [17] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in CVPR, 2018.
 [18] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin, “Largescale evolution of image classifiers,” ICML, 2017.
 [19] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al., “Evolving deep neural networks,” in Artificial Intelligence in the Age of Neural Networks and Brain Computing, 2019, pp. 293–312.
 [20] D. R. So, C. Liang, and Q. V. Le, “The evolved transformer,” in ICML, 2019.
 [21] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.J. Li, L. FeiFei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in ECCV, 2018.
 [22] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and W. Banzhaf, “NSGANET: A multiobjective genetic algorithm for neural architecture search,” arXiv:1810.03522, 2018.
 [23] Y. Xu, L. Xie, X. Zhang, X. Chen, G.J. Qi, Q. Tian, and H. Xiong, “PCDARTS: Partial channel connections for memoryefficient architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=BJlS634tPr
 [24] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing, “Neural architecture search with bayesian optimisation and optimal transport,” in NeurIPS, 2018.

[25]
H. Jin, Q. Song, and X. Hu, “AutoKeras: An efficient neural architecture search system,” in
International Conference on Knowledge Discovery & Data Mining, 2019.  [26] H. Zhou, M. Yang, J. Wang, and W. Pan, “BayesNAS: A bayesian approach for neural architecture search,” in ICML, 2019.
 [27] L. Wang, S. Xie, T. Li, R. Fonseca, and Y. Tian, “Neural architecture search by learning action space for monte carlo tree search,” AAAI, 2020. [Online]. Available: https://openreview.net/forum?id=SklR6aEtwH
 [28] R. Luo, F. Tian, T. Qin, E.H. Chen, and T.Y. Liu, “Neural architecture optimization,” in NeurIPS, 2018.
 [29] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” AAAI, 2019.
 [30] L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” UAI, 2019.
 [31] K. Yu, C. Sciuto, M. Jaggi, C. Musat, and M. Salzmann, “Evaluating the search phase of neural architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=H1loF2NFwr
 [32] A. Yang, P. M. Esperança, and F. M. Carlucci, “NAS evaluation is frustratingly hard,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=HygrdpVKvr
 [33] A. Zela, J. Siems, and F. Hutter, “NASBench1Shot1: Benchmarking and dissecting oneshot neural architecture search,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=SJx9ngStPH
 [34] J. Yu, P. Jin, H. Liu, G. Bender, P.J. Kindermans, M. Tan, T. Huang, X. Song, R. Pang, and Q. Le, “BigNAS: Scaling Up Neural Architecture Search with Big SingleStage Models,” ECCV, 2020.
 [35] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once for all: Train one network and specialize it for efficient deployment,” in ICLR, 2020. [Online]. Available: https://openreview.net/forum?id=HylxE1HKwS
 [36] G. Bender, H. Liu, B. Chen, G. Chu, S. Cheng, P.J. Kindermans, and Q. V. Le, “Can weight sharing outperform random architecture search? an investigation with tunas,” in CVPR, 2020.
 [37] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” Journal of Machine Learning Research, vol. 20, no. 55, pp. 1–21, 2019.
 [38] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in ICCV, 2019.
 [39] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Single Path OneShot Neural Architecture Search with Uniform Sampling,” ECCV, 2019.
 [40] X. Dong and Y. Yang, “Searching for a robust neural architecture in four gpu hours,” in CVPR, 2019.
 [41] A. Krizhevsky, V. Nair, and G. Hinton, “CIFAR10 (canadian institute for advanced research),” 2009.

[42]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, highperformance deep learning library,” in
NeurIPS. Curran Associates, Inc., 2019, pp. 8026–8037.  [43] Slurm, “Slurm workload manager,” 2020. [Online]. Available: https://slurm.schedmd.com/documentation.html
 [44] Kubernetes, “kubernetes.io,” 2020. [Online]. Available: https://kubernetes.io/docs/reference/
 [45] J. Yu and T. S. Huang, “Network slimming by slimmable networks: Towards oneshot architecture search for channel numbers,” ICLR, 2019.
 [46] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in NeurIPS, 2018.

[47]
X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in
CVPR, 2018.  [48] X. Dong and Y. Yang, “Network pruning via transformable architecture search,” in Advances in Neural Information Processing Systems, 2019, pp. 760–771.
 [49] R. Luo, F. Tian, T. Qin, E.H. Chen, and T.Y. Liu, “Weight sharing batch normalization code,” 2018. [Online]. Available: https://www.github.com/renqianluo/NAO_pytorch/NAO_V2/operations.py##L144
Comments
There are no comments yet.