1 Introduction
The success of deep learning is partially built upon the architecture of neural networks. However, the variation of network architectures always incurs unpredictable changes in performance, causing tremendous efforts in
ad hoc architecture design. Neural Architecture Search (NAS) is believed to be promising in alleviating this pain. Practitioners from the industry would like to see NAS techniques that automatically discover taskspecific networks with reasonable performance, regardless of their generalization capability. Therefore, NAS is always formulated as a hyperparameter optimization problem, whose algorithmic realization spans evolution algorithm [21, 7][27], Bayesian optimization [9], Monte Carlo Tree Search [24], and differentiable architecture search [14, 26, 3]. Recently, these algorithmic frameworks have exhibited pragmatic success in various challenging tasks, e.g. semantic segmentation [12] and object detection [4] etc.However, even as an optimization problem, NAS is almost vaguely defined. Most of the NAS methods proposed recently are implicitly twostage methods. These two stages are searching and evaluation (or retraining). While the architecture optimization process is referring to the searching stage, in which a cooptimization scheme is designed for parameters and architectures, there runs another round of parameter optimization in the evaluation
stage, on the same set of training data for the same task. This is to some extent contradicting the norm in a machine learning task that no optimization is allowed in
evaluation. A seemingly sensible argument could be that the optimization result of NAS is only the architecture, and the evaluation of an architecture is to check its performance after retraining. There is certainly no doubt that architectures that achieve high performance when retrained from scratch are reasonable choices for deployment. But is this search method still valid if the searched architecture does not perform well after retraining, due to the inevitable difference of training setup in searching and evaluation?These questions can only be answered with an assumption that the final searching performance can be generalized to evaluation
stage even though the training schemes in two stages are different. Specifically, differences in training schemes may include different number of cells, different batch sizes, and different epoch numbers,
etc. Using parameter sharing with efficiency concerns during search is also a usual cause. Unfortunately, this assumption is not a valid one. The correlation between the performance at the end of searching and after the retraining in evaluation is fairly low, as long as the parametersharing technique is used [20, 5].We are thus motivated to rethink the problem definition of neural architecture search. We want to argue that as an applicationdriven field, there can be a diverse set of problem definitions, but every one of them should not be vague. And in this work, we put our cards on the table: we aim to tackle the taskspecific endtoend NAS problem. Given a task, defined by a data set and an objective (e.g. training loss), the expected NAS solution optimizes architecture and parameters to automatically discover a neural network with reasonable (if not optimal by principle) performance. By the term endtoend, we highlight the solution only need a singlestage training to obtain a readytodeploy neural network of the given task. And the term taskspecific highlights the boundary of this solution. The searched neural network can only handle this specific task. We are not confident whether this neural network generalizes well in other tasks. Rather, what can be expected to generalize is this NAS framework.
Under this definition, the evaluation metrics of a proposed framework become clear, namely searching efficiency and final performance. Scrutinizing most existing methods in these two metrics, we find a big niche for a brand new framework. On one side of the spectrum, gradientbased methods such as ENAS
[17], DARTS [14], ProxylessNAS [3]require twostage parameter optimization. This is because in the approximation to make them differentiable, unbounded bias or variance are introduced to their gradients. Twostage methods always consume more computation than singlestage ones, not only because of another round of training but also the reproducibility issue
[11]. On the other side of the spectrum, oneshot methods such as random search [11] and SPOS [7] can be extended to singlestage training. But since they do not optimize the architecture distribution in parameter training, the choice of prior distribution becomes crucial. A uniform sampling strategy may potentially subsume too many resources for satisfying accuracy. Lying in the middle, SNAS [26] shows a proof of concept, where the derived network maintains the performance in the searching stage. However, the gumbelsoftmax relaxation makes it necessary to store the whole parent network in memory in both forward and backward, inducing tremendous memory and computation waste.In this work, we confront the challenge of singlestage simultaneous optimization on architecture and parameters. Our proposal is an efficient differentiable NAS framework, Discrete Stochastic Neural Architecture Search (DSNAS). Once the search process finishes, the bestperforming subnetwork is derived with optimized parameters, and no further retraining is needed. DSNAS is built upon a novel search gradient, combining the stability and robustness of differentiable NAS and the memory efficiency of discretesampling NAS. This search gradient is shown to be equivalent to SNAS’s gradient at the discrete limit, optimizing the taskspecific endtoend objective with little bias. And it can be calculated in the same round of backpropagation as gradients to neural parameters. Its forward pass and backpropagation only involve the compact subnetwork, whose computational complexity can be shown to be much more friendly than DARTS, SNAS and even ProxylessNAS, enabling largescale direct search. We instantiate this framework in a singlepath setting. The experimental results show that DSNAS discovers networks with comparable performance () in ImageNet classification task in only GPU hours, reducing the total time of obtaining a readytodeploy solution by from twostage NAS.
To summarize, our main contributions are as follows:

We propose a welldefined neural architecture search problem, taskspecific endtoend NAS, under the evaluation metrics of which most existing NAS methods still have room for improvement.

We propose a plugandplay NAS framework, DSNAS, as an efficient solution to this problem in large scale. DSNAS updates architecture parameters with a novel search gradient, combining the advantages of policy gradient and SNAS gradient. A simple but smart implementation is also introduced.

We instantiate it in a singlepath parent network. The empirical study shows DSNAS robustly discovers neural networks with stateoftheart performance in ImageNet, reducing the computational resources by a big margin over twostage NAS methods. Code will be released once this paper is published.
2 Problem definition of NAS
2.1 TwoStage NAS
Most existing NAS methods involve optimization in both searching stage and evaluation stage. In the searching stage, there must be parameter training and architecture optimization, even though they may not run simultaneously. The ideal way is to train all possible architectures from scratch and then select the optimal one. However, it is infeasible with the combinatorial complexity of architecture. Therefore, designing the cooccurrence of parameter and architecture optimization to improve efficiency is the main challenge of any general NAS problems. This challenge has not been overcome elegantly yet. The accuracy at the end of the searching stage has barely been reported to be satisfying. And an ad hoc solution is to perform another round of parameter optimization in the evaluation stage.
Optimizing parameters in evaluation stage is not normal in traditional machine learning. Normally, the data set provided is divided into training set and validation set. Ones do learning in the training stage, with data from the training set. Then the learned model is tested on the withheld validation set, where no further training is conducted. With the assumption that training data and validation data are from the same distribution, the learning problem is reduced to an optimization problem. Ones can hence be confident to expect models with high training accuracy, if the assumption is correct, have high evaluation accuracy.
Allowing parameter retraining in the evaluation stage makes NAS a vaguely defined machine learning problem. Terming problems as Neural Architecture Search give people an inclined interpretation that only the architecture is the learning result, instead of parameters. But if the searched architecture is the answer, what is the problem? Most NAS methods claim they are discovering bestperforming architecture in the designated space efficiently [3, 7, 9], but what specifically does bestperforming
mean? Given that retraining is conducted in evaluation stage, ones may naturally presume it is a metalearninglike hyperparameter problem. Then the optimization result should exhibit some metalevel advantages, such as faster convergence, better optimum or higher transferability, etc. These are objectives that ones are supposed to state clearly in a NAS proposal. Nonetheless, objectives are only implicitly conveyed (mostly better optimum) in experiments.
Defining problem precisely is one of the milestones in scientific research, whose direct gift in a machine learning task is a clear objective and evaluation metric. Subsequent efforts can then be devoted into validating if the proposed learning loss is approximating a necessary and sufficient equivalence of this objective. Unfortunately, under this criterion, most existing twostage NAS methods are reported [20, 11] failing to prove the correlation between the searching accuracy and the retraining accuracy.
2.2 Taskspecific endtoend NAS
Seeing that the aforementioned dilemma lies in the ambiguity in evaluating an architecture alone, we propose a type of problem termed taskspecific endtoend NAS, the solution to which should provide a readytodeploy network with optimized architecture and parameters.
Task refers to generally any machine learning tasks (in this work we discuss computer vision tasks specifically). A welldefined task should at least have a set of data representing its functioning domain, a learning objective for the taskspecific motives e.g. classification, segmentation, etc. And the task is overwritten if there is a modification in either factor, even a trivial augmentation in the data. In other words, taskspecific sets a boundary on what we can expect from the searched result and what cannot. This can bring tremendous operational benefits to industrial applications.
Endtoend highlights that, given a task, the expected solution can provide a readytodeploy network with satisfying accuracy, the whole process of which can be regarded as a blackbox module. Theoretically, it requires a direct confrontation of the main challenge of any general NAS problem, i.e. cooptimizing parameter and architecture efficiently. Empirically, taskspecific endtoend is the best description of NAS’s industrial application scenarios: i) the NAS method itself should be generalizable for any offtheshelf tasks; and ii) when applied to a specific task, practitioners can at least have some conventional guarantees on the results. Basically, it is to reduce vaguely defined NAS problems to established tasks.
The evaluation metrics become clear under this problem definition. The performance of the final result is, by principle, the accuracy in this task. And the efficiency should be calculated based on the time from this NAS solver starts taking data to it outputs the neural network whose architecture and parameters are optimized. This efficiency metric is different from all existing works. For twostage methods, the time for both searching and evaluation should be taken into account in this metric. Therefore, their efficiency may not be as what they claim. Moreover, twostage methods do not optimize the objective higher accuracy of final derived networks in an endtoend manner.
3 Direct NAS without retraining
3.1 Stochastic Neural Architecture Search (SNAS)
In the literature, SNAS is one of those close to a solution to the taskspecific endtoend NAS problem. Given any task with differentiable loss, the SNAS framework directly optimizes the expected performance over architectures in terms of this task. In this subsection, we provide a brief introduction on SNAS.
Basically, SNAS is a differentiable NAS framework that maintains the generative nature as reinforcementlearningbased methods [27]
. Exploiting the deterministic nature of the MDP of network construction process, SNAS reformulated it as a Markov Chain. This reformulation leads to a novel representation of the network construction process. As shown in Fig.
2, nodes (blue lumps) in the DAG represent feature maps. Edges (arrow lines) represent information flows between nodes and , on which possible operations(orange lumps) are attached. Different from DARTS, which avoids sampling subnetwork with an attention mechanism, SNAS instantiates this DAG with a stochastic computational graph. Forwarding a SNAS parent network is to first sample random variables
and multiplying it to edges in the DAG:(1) 
Ones can thus obtain a Monte Carlo estimate of the expectation of task objective over possible architectures:
(2) 
where and are parameters of architecture distribution and neural operations respectively. This is exactly the taskspecific endtoend NAS objective.
To optimize parameters and architecture simultaneously with Eq. 2, (termed as singlelevel optimization in [14]), SNAS relaxes the discrete onehot random variable
to a continuous random variable
with the gumbelsoftmax trick. However, the continuous relaxation requires to store the whole parent network in GPU, preventing it from directly applying to largescale networks. In Xie et al. [26], SNAS is still a twostage method.If the temperature in SNAS’s gumblesoftmax trick can be directly pushed to zero, SNAS can be extended to largescale networks trivially. However, it is not the case. Take a look at the search gradient given in Xie et al. [26]:
(3) 
ones can see that the temperature is not valid to be zero for the search gradient. Xie et al. [26] only gradually annealed it to be close to zero. In this work, we seek for an alternative way to differentiate Eq. 2, combining the efficiency of discrete sampling and the robustness of continuous differentiation. And we start from SNAS’s credit assignment.
3.2 Discrete SNAS (DSNAS)
In original SNAS [26], to prove its efficiency over ENAS, a policy gradient equivalent of the search gradient is provided
(4) 
where is the gumbelsoftmax random variable, denotes that is a cost independent from for gradient calculation. In other words, Eq. 4 and Eq. 3 both optimize the taskspecific endtoend NAS objective i.e. Eq. 2.
In order to get rid of SNAS’s continuous relaxation, we push the in the PG equivalent (4) to the limit , with the insight that only reparameterization trick needs continuous relaxation but policy gradient doesn’t. The expected search gradient for architecture parameters at each edge becomes:
(5) 
where is a strictly onehot random variable, is the th element in it, denotes that is a cost independent from for gradient calculation. Line 3 is derived from line 2 since [16], .
Exploiting the onehot nature of , i.e. only on edge is 1, others i.e. are , the cost function can be further reduced to
(6) 
as long as . Here is the output of the operation chosen at edge . The equality in line 3 is due to .
3.3 Implementation
The algorithmic fruit of the mathematical manipulation in Eq. 6 is a parallelfriendly implementation of Discrete SNAS, as illustrated in Fig. 2. In SNAS, the network construction process is a pass of forward of stochastic computational graph. The whole network has to be instantiated with the batch dimension. In DSNAS we offer an alternative implementation. Note that only needs to be calculated for the sampled subnetworks. And apparently it is also the case for . That is to say, the backpropagation of DSNAS only involves the sampled network, instead of the whole parent network. Thus we only instantiate the subnetwork with the batch dimension for forward and backward. However, the subnetwork derived in this way does not necessarily contain . If it was not with Line 3 of Eq. 6, we would have to calculate with . Then the policy gradient loss would explicitly depend on the intermediate result , which may need an extra round of forward if it is not stored by the automated differentiation infrastructure. With a smart mathematical manipulation in Eq. 6, ones can simply multiply a to the output of each selected operation, and calculate with . The whole algorithm is shown in Alg. 1
3.4 Complexity analysis
In this subsection, we provide a complexity analysis of DSNAS, SNAS, and ProxylessNAS. Without loss of generality, we define a parent network with layers and each layer has candidate choice blocks. Let the forward time on a sampled subnetwork be , its backward time be , and the memory requirement for this round be .
As the original SNAS instantiates the whole graph with batch dimension, it needs times GPU memory and times calculation comparing to a subnetwork. It is the same case in DARTS.
Method 





Subnetwork  
SNAS  
ProxylessNAS*  
ProxylessNAS  
DSNAS 
This memory consumption problem of differentiable NAS was first raised by [3]. And they proposed an approximation to DARTS’s optimization objective, with the BinaryConnect [6] technique:
(7) 
where denotes the attentionbased estimator as in DARTS [14]
, distinct from the discrete random variable
, highlighting how the approximation is being done. But this approximation does not directly save the memory and computation. Different from Eq. 5 and Eq. 6, theoretically, the calculation of Eq. 7 still involves the whole network, as indicated by the summation. To reduce memory consumption, they further empirically proposed a path sampling heuristic to decrease the number of paths from
to . Table 1 shows the comparison.3.5 Progressive early stop
One potential problem in samplebased differentiable NAS is that empirically, the entropy of architecture distribution does not converge to zero, even though comparing to attentionbased NAS [14] they are reported [26] to converge with smaller entropy. The nonzero entropy keeps the sampling going on until the end, regardless of the fact that sampling at that uncertainty level does not bring significant gains. To the opposite, it may even hinder the learning on other edges.
To avoid this sideeffect of architecture sampling, DSNAS applies a progressive early stop strategy. Sampling and optimization stop at layers/edges in a progressive manner. Specifically, a threshold is set for the stopping condition:
(8) 
Once this condition is satisfied on any edge/layer, we directly select the operation choice with the highest probability there, stop its sampling and architecture parameters update in the following training.
3.6 Comparison with oneshot NAS
Different from all differentiable NAS methods, oneshot NAS only do architecture optimization in oneshot, before which they obtain a rough estimation of the graph through either pretraining [1, 7] or an auxiliary hypernetwork [2]. All of them are twostage methods. The advantage of DSNAS is that it optimizes architecture alongside with parameters, which is expected to save some resources in the pretraining stage. Intuitively, DSNAS rules out nonpromising architectures in an adaptive manner by directly optimizing the objective in an endtoend manner. Although oneshot methods can also have an endtoend realization, by investing more resources in pretraining, it may take them more epochs to achieve comparable performance as DSNAS. They can also do finetuning, but still parameters of the optimal networks are updated less frequently than DSNAS. Ones can expect better performance from DSNAS given equivalent training epochs.
4 Experimental Results
In this section, we first demonstrate why the proposed taskspecific endtoend is an open problem for NAS, by investigating the performance correlation between searching stage and evaluation stage of the twostage NAS. We then validate the effectiveness and efficiency of DSNAS under the proposed taskspecific endtoend metric on the same search space as SPOS [7]. We further provide a breakup of time consumption to illustrate the computational efficiency of DSNAS.
4.1 Accuracy correlation of twostage NAS
Since the validity of the searching in twostage NAS relies on a high correlation in the performance of searching stage and evaluation stage, we check this assumption with a ranking correlation measure, Kendall Tau metric [10].
(9) 
where is the total number of pairs () from the searching stage and evaluation stage consisting of concordant ranking pairs ( or ) and disconcordant ranking pairs ( or ). Kendall Tau metric ranges from 1 to 1, which means the ranking order changes from reversed to identical. being close to 0 indicates the absence of correlation.
We measure the ranking correlation by calculating Kendall Tau metric from two perspectives: (1) The is calculated based on the topk model performance of the searching and evaluation stage in one single searching process; (2) The Kendal Tau metric is calculated by running the twostage NAS methods several times with different random seeds using the top1 model performance in each searching process. As shown in Table 2, the performance correlation between the searching stage and evaluation stage in both SPOS and ProxylessNAS is fairly low. This indicates the necessity of taskspecific endtoend NAS problem formulation. Fairly low correlation may also imply reproducibility problems.
Model  

Single Path OneShot[7]  0.33  0.07 
ProxylessNas [3]    0.33 
4.2 Singlepath architecture search
Motivation To compare the efficiency and accuracy of derived networks from DSNAS versus existing twostage methods, we conduct experiment in singlepath setting. Results are compared in the taskspecific endtoend metrics.
Dataset All our experiments are conducted in a mobile setting on the ImageNet Classification task [18] with a resource constraint . This dataset consists of around training images and validation images. Data transformation is achieved by the standard preprocessing techniques described in the supplementary material.
Search Space The basic building block design is inspired by ShuffleNet v2 [15]. There are 4 candidates for each choice block in the parent network, i.e., choice_3, choice_5, choice_7, and choice_x. These candidates differ in the kernel size and the number of depthwise convolutions, spanning a search space with single path models. The overall architecture of the parent network and building blocks are shown in the supplementary material.
Training Settings We follow the same setting as SPOS [7] except that we do not have an evaluation stage in our searching process. We adopt a SGD optimizer with a momentum of 0.9 [22] to update the parent network weight parameters. A cosine learning rate scheduler with an initial learning rate of 0.5 is applied. Moreover, an L2 weight decay () is used to regularize the training process. The architecture parameters are updated using the Adam optimizer with an initial learning rate of 0.001. All our experiments are done on 8 NVIDIA TITAN X GPUs.
Searching Process To demonstrate the efficiency of DSNAS, we compare the whole process needed to accomplish taskspecific endtoend NAS in ImageNet classification with twostage NAS methods. Among all existing twostage methods, SPOS [7] is the one with stateoftheart accuracy and efficiency. SPOS only does oneshot architecture search after pretraining the uniformly sampled parent network. Its very different architecture search mechanism also makes it a good baseline for DSNAS’s ablation study.
Figure 3 shows DSNAS’s advantage over several different configurations of SPOS. We purposefully present curves in terms of both epoch number and time to illustrate that even though DSNAS updates architecture in an iterationbasis, almost no extra computation time is introduced. Among the four configurations of SPOS, SPOSsearch120retrain240 is the original one as in Guo et al. [7], using the twostage paradigm. Obviously, DSNAS achieves comparable accuracy in an endtoend manner, with roughly less computational resources. As SPOSsearch120retrain240 updates block parameters for only 120 epochs^{1}^{1}1Same learning rate scheduler is used in DSNAS and SPOS., we run the SPOSsearch120tune120 and SPOSsearch240retrain240 configurations for fair comparison. At the end of the 240th epoch, the accuracy of SPOS models is around and lower than DSNAS’s respectively.
In addition, for the ablation study of DSNAS’s progressive early stop strategy, we call the EA algorithm of SPOS at the 120th epoch in the onestage DSNASsearch240 configuration. Continuing the parameter training, the selected models experience a leap in accuracy and converge with accuracy lower than DSNAS’s. However, seeking this EA point is fairly ad hoc and prone to random noise.
Searching Results The experimental results are reported in Table 3
. Comparing with all existing twostage NAS methods, DSNAS shows comparable performance using at least 1/3 less computational resources. More importantly, the standard deviation in DSNAS’s accuracy is lower than those from both
searching and evaluation stage from EAbased SPOS (0.22 vs 0.38/0.36). This exhibits as a differentiable NAS framework, DSNAS is a more robust method in the taskspecific endtoend metric.Model  FLOPS  Search  Retrain 

Time (GPU hour)  




Search  Retrain  
MobileNet V1 (0.75x)[8]  325M  Manual  68.4    Manual  
MobileNet V2 (1.0x)[19]  300M  Manual  72.0  91.00  Manual  
ShuffleNet V2 (1.5x)[15]  299M  Manual  72.6  90.60  Manual  
NASNETA(4@1056)[28]  564M      74.0  91.60  False  48000    
PNASNET[13]  588M      74.2  91.90  False  5400    
MnasNetA1[23]  312M      75.2  92.50  False    
DARTS[14]  574M      73.3  91.30  False  24  288  
SNAS[26]  522M      72.7  90.80  False  36  288  
ProxylessR (mobile)[3]  320M      74.6  92.20  True  384  
Single Path OneShot[7]  319M    74.3    True  250  384  
Single Path OneShot*  323M  68.2  88.28  74.3  91.79  True  250  384  
Random Search  330M  68.2  88.31  73.9  91.8  True  250  384  
DSNAS  324M  74.4  91.54  74.3  91.90  True  420 
4.3 Time consumption breakup
In last subsection, we show DSNAS can achieve comparable performance under the taskspecific endtoend metric with much less computation than oneshot NAS methods. In this subsection, we further break up the time consumption of DSNAS into several specific parts, i.e. forward, backward, optimization and test^{2}^{2}2To clarify, we also do evaluation on testing set, retraining parameters is what we do not do., and conduct a controlled comparison with other differentiable NAS methods. We also hope such a detailed breakup can help readers gain insight into further optimizing our implementation.
We first compare the computation time of SNAS and DSNAS on CIFAR10 dataset. The average time of each splited part^{3}^{3}3The average time of each splited part in one batch is calculated on one NVIDIA TITAN X GPU with the same setting (batchsize 64) as in [26]. is shown in Table 4. Under the same setting, our DSANS is almost five folds faster than SNAS and consumes only of GPU memory as SNAS ( is the total number of candidate operations in each edge).
Method  Train  Test  

Forward  Backward  Opt  
SNAS  0.26s  0.46s  0.14s  0.18s 
DSNAS  0.05s  0.07s  0.13s  0.04s 
We further compare the average time of each splited part between DSNAS and ProxylessNAS in a mobile setting on the ImageNet Classification task. As shown in Table 5, the average time ^{4}^{4}4As shown in Table 1 that Proxyless NAS takes 2 times GPU memory as DSNAS, we use 8 TITAN X GPUs for ProxylessNAS and 4 for DSNAS to calculate the time. is calculated on the same search space of ProxylessNAS [3] with a total batch size of 512. With a fair comparison, DSNAS is roughly two folds faster than ProxylessNAS.
Method  Train  Test  

Forward  Backward  Opt  
ProxylessNAS  3.3s  2.3s  3.6s  1.2s 
DSNAS  1.9s  1.3s  2.6s  0.9s 
5 Summary and future work
In this work, we first define a taskspecific endtoend NAS problem, which reduces the evaluation of vaguelydefined NAS problem to that of traditional machine learning tasks. Under the evaluation metrics, we scrutinize the efficiency of twostage NAS methods, based on the observation that the performance in searching and evaluation stages correlates poorly. We then propose an efficient differentiable NAS framework, DSNAS, which optimizes architecture and parameters in the same round of backpropagation. Since a lowbiased approximation of the taskspecific endtoend NAS objective is optimized, subnetworks derived from DSNAS are readytodeploy. Based upon a simple but smart mathematical manipulation, the implementation of DSNAS tremendously reduces the computation requirement from its continuous counterpart SNAS. Comparing with twostage NAS methods, DSNAS discovers networks with on par performance through onestage optimization, reducing the total computation by a significant margin.
As a plugandplay NAS solution, DSNAS exhibits its efficiency and effectiveness in the singlepath setting. This is orthogonal to the random wiring solution, which focuses on graph topology search [25]. We look forward to their combination for a joint search of topology, operations, and parameters.
References
 [1] (2018) Understanding and simplifying oneshot architecture search. In International Conference on Machine Learning, pp. 549–558. Cited by: §3.6.
 [2] (2017) SMASH: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §3.6.
 [3] (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §1, §1, §2.1, §3.4, §4.3, Table 2, Table 3.
 [4] (2019) Detnas: neural architecture search on object detection. arXiv preprint arXiv:1903.10979. Cited by: §1.
 [5] (2019) Fairnas: rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845. Cited by: §1.
 [6] (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §3.4.
 [7] (2019) Single path oneshot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: Figure 6, §1, §1, §2.1, §3.6, §4.2, §4.2, §4.2, Table 2, Table 3, §4, footnote 5.

[8]
(2017)
Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: Table 3.  [9] (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: §1, §2.1.
 [10] (1938) A new measure of rank correlation. Biometrika 30 (1/2), pp. 81–93. Cited by: §4.1.
 [11] (2019) Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638. Cited by: §1, §2.1.

[12]
(2019)
Autodeeplab: hierarchical neural architecture search for semantic image segmentation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 82–92. Cited by: §1.  [13] (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: Table 3.
 [14] (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §1, §3.1, §3.4, §3.5, Table 3.
 [15] (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §4.2, Table 3.
 [16] (2016) The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §3.2.
 [17] (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1.
 [18] (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.2.
 [19] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: Table 3.
 [20] (2019) Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142. Cited by: §1, §2.1.
 [21] (2002) Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: §1.
 [22] (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §4.2.
 [23] (2019) Mnasnet: platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: Table 3.
 [24] (2017) Finding competitive network architectures within a day using uct. arXiv preprint arXiv:1712.07420. Cited by: §1.
 [25] (2019) Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569. Cited by: §5.
 [26] (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §1, §1, §3.1, §3.1, §3.2, §3.5, Table 3, footnote 3.
 [27] (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §3.1.
 [28] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: Table 3.
Appendix A Detailed Settings of Experimental Results
Data Preprocessing We employ the commonly used preprocessing techniques in our experiments: A 224x224 crop is randomly sampled from an image or its horizontal flip, with a normalization on the pixel values per channel.
Appendix B Details about the architectures
Structures of choice blocks ^{5}^{5}5We follow the setting including choice blocks used in the released implementation of SPOS[7].
Supernet architecture
Input  Block  Channels  Repeat  Stride 

Conv  16  1  2  
CB  64  4  2  
CB  160  4  2  
CB  320  8  2  
CB  640  4  2  
Conv  1024  1  1  
GAP    1    
1024  FC  1000  1   
Structures of searched architectures