1 Introduction
Our community has made substantial progress toward designing better convolutional neural network architectures for visual recognition tasks over the past several years. This overall research endeavor is analogous to a form of
stochastic gradient descent where every new proposed model architecture is a noisy gradient step traversing the infinitedimensional landscape of possible neural network designs. The overall objective of this optimization is to find network architectures that are easy to optimize, make reasonable tradeoffs between speed and accuracy, generalize to many tasks and datasets, and overall withstand the test of time. To make continued steady progress toward this goal, we must use the right loss function to guide our search—in other words, a research methodology for comparing network architectures that can reliably tell us whether newly proposed models are truly better than what has come before.One promising avenue lies in developing better theoretical understanding of neural networks to guide the development of novel network architectures. However, we need not wait for a general theory of neural networks to emerge to make continued progress. Classical statistics provides tools for drawing informed conclusions from empirical
studies, even in the absence of a generalized theory governing the subject at hand. We believe that making use of such statistically grounded scientific methodologies in deep learning research may facilitate our future progress.
Overall, there has already been a general trend toward better empiricism in the literature on network architecture design. In the simplest and earliest methodology in this area (Figure 1a), progress was marked by simple point estimates: an architecture was deemed superior if it achieved lower error on a benchmark dataset [15, 19, 37, 32], often irrespective of model complexity.
An improved methodology adopted in more recent work compares curve estimates (Figure 1b) that explore design tradeoffs of network architectures by instantiating a handful of models from a loosely defined model family and tracing curves of error model complexity [36, 11, 41]. A model family is then considered superior if it achieves lower error at every point along such a curve. Note, however, that in this methodology other confounding factors may vary between model families or may be suboptimal for one of them.
Comparing model families while varying a
single degree of freedom
to generate curve estimates hints at a more general methodology. For a given model family, rather than varying a single network hyperparameter (network depth) while keeping all others fixed (stagewise width, groups), what if instead
we vary all relevant network hyperparameters? While in principle this would remove confounding factors that may affect conclusions about a model family, it would also yield a vast—often infinite—number of possible models. Are comparisons of model families under such unconstrained conditions even feasible?To move toward such more robust settings, we introduce a new comparison paradigm: that of distribution estimates (Figure 1c). Rather than comparing a few selected members of a model family, we instead sample models from a design space parameterizing possible architectures, giving rise to distributions of error rates and model complexities. We then compare network design spaces by applying statistical techniques to these distributions, while controlling for confounding factors like network complexity. This paints a more complete and unbiased picture of the design landscape than is possible with point or curve estimates.
To validate our proposed methodology we perform a largescale empirical study, training over 100,000 models spanning multiple model families including VGG [32], ResNet [8], and ResNeXt [36] on CIFAR [14]. This large set of trained models allows us to perform simulations
of distribution estimates and draw robust conclusions about our methodology. In practice, however, we show that sampling between 100 to 1000 models from a given model family is sufficient to perform robust estimates. We further validate our estimates by performing a study on ImageNet
[4]. This makes the proposed methodology feasible under typical settings and thus a practical tool that can be used to aid in the discovery of novel network architectures.As a case study of our methodology, we examine the network design spaces used by several recent methods for neural architecture search (NAS) [41, 30, 20, 29, 21]. Surprisingly, we find that there are significant differences between the design spaces used by different NAS methods, and we hypothesize that these differences may explain some of the performance improvements between these methods. Furthermore, we demonstrate that design spaces for standard model families such as ResNeXt [36] can be comparable to the more complex ones used in recent NAS methods.
We note that our work complements NAS. Whereas NAS is focused on finding the single best model in a given model family, our work focuses on characterizing the model family itself. In other words, our methodology may enable research into designing the design space for model search.
We will release the code, baselines, and statistics for all tested models so that proposed future model architectures can compare against the design spaces we consider.
2 Related Work
Reproducible research.
There has been an encouraging recent trend toward better reproducibility in machine learning
[26, 22, 9]. For example, Henderson [9]examine recent research in reinforcement learning (RL) and propose guidelines to improve reproducibility and thus enable continued progress in the field. Likewise, we share the goal of introducing a more robust methodology for evaluating model architectures in the domain of visual recognition.
Empirical studies.
In the absence of rigorous theoretical understanding of deep networks, it is imperative to perform largescale empirical studies of deep networks to aid development [6, 3, 28]
. For example, in natural language processing, recent largescale studies
[26, 27] demonstrate that when design spaces are well explored, the original LSTM [10] can outperform more recent models on language modeling benchmarks. These results suggest the crucial role empirical studies and robust methodology play in enabling progress toward discovering better architectures.Hyperparameter search.
General hyperparameter search techniques [2, 33] address the laborious model tuning process in machine learning. A possible approach for comparing networks from two different model families is to first tune their hyperparameters [16]. However, such comparisons can be challenging in practice. Instead, [1] advocates using random search as a strong baseline for hyperparameter search and suggests that it additionally helps improve reproducibility. In our work we propose to directly compare the full model distributions (not just their minima).
Neural architecture search.
Recently, NAS has proven effective for learning networks architectures [40]. A NAS instantiation has two components: a network design space and a search algorithm over that space. Most work on NAS focuses on the search algorithm, and various search strategies have been studied, including RL [40, 41, 29]
, heuristic search
[20], gradientbased search [21, 23], and evolutionary algorithms
[30]. Instead, in our work, we focus on characterizing the model design space. As a case study we analyze recent NAS design spaces [41, 30, 20, 29, 21] and find significant differences that have been largely overlooked.Complexity measures.
In this work we focus on analyzing network design spaces while controlling for confounding factors like network complexity. While statistical learning theory
[31, 35] has introduced theoretical notions of complexity of machine learning models, these are often not predictive of neural network behavior [38, 39]. Instead, we adopt commonlyused network complexity measures, including the number of model parameters or multiplyadd operations [7, 36, 11, 41]. Other measures, wallclock speed [12], can easily be integrated into our paradigm.3 Design Spaces
We begin by describing the core concepts defining a design space in §3.1 and give more details about the actual design spaces used in our experiments in §3.2.
stage  operation  output 

stem  33 conv  323216 
stage 1  {block}d  3232w 
stage 2  {block}d  1616w 
stage 3  {block}d  88w 
head  pool + fc  1110 
R56  R110  

flops (B)  0.13  0.26 
params (M)  0.86  1.73 
error [8]  6.97  6.61 
error [ours]  6.22  5.91 
3.1 Definitions
I. Model family.
A model family
is a large (possibly infinite) collection of related neural network architectures, typically sharing some highlevel architectural structures or design principles (residual connections). Example model families include standard feedforward networks such as ResNets
[8] or the NAS model family from [40, 41].II. Design space.
Performing empirical studies on model families is difficult since they are broadly defined and typically not fully specified. As such we make a distinction between abstract model families, and a design space which is a concrete set of architectures that can be instantiated from the model family. A design space consists of two components: a parametrization of a model family such that specifying a set of model hyperparameters fully defines a network instantiation and a set of allowable values for each hyperparameter. For example, a design space for the ResNet model family could include a parameter controlling network depth and a limit on its maximum allowable value.
III. Model distribution.
To perform empirical studies of design spaces, we must instantiate and evaluate a set of network architectures. As a design space can contain an exponential number of candidate models, exhaustive evaluation is not feasible. Therefore, we sample and evaluate a fixed set of models from a design space, giving rise to a model distribution, and turn to tools from classical statistics for analysis. Any standard distribution, as well as learned distributions like in NAS, can be integrated into our paradigm.
IV. Data generation.
To analyze network design spaces, we sample and evaluate numerous models from each design space. In doing so, we effectively generate datasets of trained models upon which we perform empirical studies.
depth  width  ratio  groups  total  

Vanilla  1,24,9  16,256,12  1,259,712  
ResNet  1,24,9  16,256,12  1,259,712  
ResNeXtA  1,16,5  16,256,5  1,4,3  1,4,3  11,390,625 
ResNeXtB  1,16,5  64,1024,5  1,4,3  1,16,5  52,734,375 
3.2 Instantiations
We provide precise description of design spaces used in the analysis of our methodology. We introduce additional design spaces for NAS model families in §5.
I. Model family.
II. Design space.
Following [8], we use networks consisting of a stem, followed by three stages, and a head, see Table 1 (left). Each stage consists of a sequence of blocks. For our ResNet design space, a single block consists of two convolutions^{1}^{1}1All convs are 33 and are followed by Batch Norm [13]
and ReLU.
and a residual connection. Our Vanilla design space uses an identical block structure but without residuals. Finally, in case of the ResNeXt design spaces, we use bottleneck blocks with groups [36]. Table 1 (right) shows some baseline ResNet models for reference (for details of the training setup see appendix). To complete the design space definitions, in Table 2 we specify the set of allowable hyperparameters for each. Note that we consider two variants of the ResNeXt design spaces with different hyperparameter sets: ResNeXtA and ResNeXtB.III. Model distribution.
We generate model distributions by uniformly sampling hyperparameter from the allowable values for each design spaces (as specified in Table 2).
IV. Data generation.
Our main experiments use image classification models trained on CIFAR10 [14]. This setting enables largescale analysis and is often used as a testbed for recognition networks, including for NAS. While we find that sparsely sampling models from a given design space is sufficient to obtain robust estimates, we perform much denser sampling to evaluate our methodology. We sample and train 25k models from each of the design spaces from Table 2, for a total of 100k models. To reduce computational load, we consider models for which the flops^{2}^{2}2Following common practice, we use flops to mean multiplyadds. or parameters are below the ResNet56 values (Table 1, right).
4 Proposed Methodology
In this section we introduce and evaluate our methodology for comparing design spaces. Throughout this section we use the design spaces introduced in §3.2.
4.1 Comparing Distributions
When developing a new network architecture, human experts employ a combination of grid and manual search to evaluate models from a design space, and select the model achieving the lowest error (as described in [16]). The final model is a point estimate of the design space. As a community we commonly use such point estimates to draw conclusions about which methods are superior to others.
Unfortunately, comparing design spaces via point estimates can be misleading. We illustrate this using a simple example: we consider comparing two sets of models of different sizes sampled from the same design space.
Point estimates.
As a proxy for human derived point estimates, we use random search [1]. We generate a baseline model set (B) by uniformly sampling 100 architectures from our ResNet design space (see Table 2). To generate the second model set (M), we instead use 1000 samples. In practice, the difference in number of samples could arise from more effort in the development of M over the baseline, or simply access to more computational resources for generating M. Such imbalanced comparisons are common in practice.
After training, M’s minimum error is lower than B’s minimum error. Since the best error is lower, a naive comparison of point estimates concludes that M is superior. Repeating this experiment yields the same result: Figure 2 (left) plots the difference in the minimum error of B and M over multiple trials (simulated by repeatedly sampling B and M from our pool of 25k pretrained models). In 90% of cases M has a lower minimum than B, often by a large margin. Yet clearly B and M were drawn from the same design space, so this analysis based on point estimates can be misleading.
Distributions.
In this work we make the case that one can draw more robust conclusions by directly comparing distributions rather than point estimates such as minimum error.
To compare distributions, we use empirical distribution functions (EDFs). Let be the indicator function. Given a set of models with errors , the error EDF is given by:
(1) 
gives the fraction of models with error less than .
We revisit our B M comparison in Figure 2 (right), but this time plotting the full error distributions instead of just their minimums. Their shape is typical of error EDFs: the small tail to the bottom left indicates a small population of models with low error and the long tail on the upper right shows there are few models with error over 10%.
Qualitatively, there is little visible difference between the error EDFs for B and M, suggesting that these two sets of models were drawn from an identical design space. We can make this comparison quantitative using the (two sample) KolmogorovSmirnov () test [25]
, a nonparametric statistical test for the null hypothesis that two samples were drawn from the same distribution. Given EDFs
and , the test computes the statistic , defined as:(2) 
measures the maximum vertical discrepancy between EDFs (see the zoomed in panel in Figure 2); small values suggest that and are sampled from the same distribution. For our example, the test gives and a value of 0.60, so with high confidence we fail to reject the null hypothesis that B and M follow the same distribution.
Discussion.
While pedagogical, the above example demonstrates the necessity of comparing distributions rather than point estimates, as the latter can give misleading results in even simple cases. We emphasize that such imbalanced comparisons occur frequently in practice. Typically, most work reports results for only a small number of best models, and rarely reports the number of total points explored during model development, which can vary substantially.
4.2 Controlling for Complexity
While comparing distributions can lead to more robust conclusions about design spaces, when performing such comparisons, we need to control for confounding factors that correlate with model error to avoid biased conclusions. A particularly relevant confounding factor is model complexity. We study controlling for complexity next.
Unnormalized comparison.
Figure 3 (left) shows the error EDFs for the ResNeXtA and ResNeXtB design spaces, which differ only in the allowable hyperparameter sets for sampling models (see Table 2). Nevertheless, the curves have clear qualitative differences and suggest that ResNeXtB is a better design space. In particular, the EDF for ResNeXtB is higher; , it has a higher fraction of better models at every error threshold. This clear difference illustrates that different design spaces from the same model family under the same model distribution can result in very different error distributions. We investigate this gap further.
Error vs complexity.
From prior work we know that a model’s error is related to its complexity; in particular more complex models are often more accurate [7, 36, 11, 41]. We explore this relationship using our largescale data. Figure 4 plots the error of each trained model against its complexity, measured by either its parameter or flop counts. While there are poorlyperforming highcomplexity models, the best models have the highest complexity. Moreover, in this setting, we see no evidence of saturation: as complexity increases we continue to find better models.
Complexity distribution.
Can the differences between the ResNeXtA and ResNeXtB EDFs in Figure 3 (left) be due to differences in their complexity distributions? In Figure 5, we plot the empirical distributions of model complexity for the two design spaces. We see that ResNeXtA contains a much larger number of lowcomplexity models, while ResNeXtB contains a heavy tail of highcomplexity models. It therefore seems plausible that ResNeXtB’s apparent superiority is due to the confounding effect of complexity.
Normalized comparison.
We propose a normalization procedure to factor out the confounding effect of the differences in the complexity of model distributions. Given a set of models where each model has complexity , the idea is to assign to each model a weight , where , to create a more representative set under that complexity measure.
Specifically, given a set of models with errors, complexities, and weights given by , , and , respectively, we define the normalized complexity EDF as:
(3) 
Likewise, we define the normalized error EDF as:
(4) 
Then, given two model sets, our goal is to find weights for each model set such that for all in a given complexity range. Once we have such weights, comparisons between and reveal differences between design spaces that cannot be explained by model complexity alone.
In practice, we set the weights for a model set such that its complexity distribution (Eqn. 3) is uniform. Specifically, we bin the complexity range into bins, and assign each of the models that fall into a bin a weight . Given sparse data, the assignment into bins could be made in a soft manner to obtain smoother estimates. While other options for matching are possible, we found normalizing both and to be uniform to be effective.
In Figure 3 we show ResNeXtA and ResNeXtB error EDFs, normalized by parameters (middle) and flops (right). Controlling for complexity brings the curves closer, suggesting that much of the original gap was due to mismatched complexity distributions. This is not unexpected as the design spaces are similar and both parameterize the same underlying model family. We observe, however, that their normalized EDFs still show a small gap. We note that ResNeXtB contains wider models with more groups (see Table 2), which may account for this remaining difference.
4.3 Characterizing Distributions
An advantage of examining the full error distribution of a design space is it gives insights beyond the minimum achievable error. Often, we indeed focus on finding the best model under some complexity constraint, for example, if a model will be deployed in a production system. In other cases, however, we may be interested in finding a good model quickly, when experimenting in a new domain or under constrained computation. Examining distributions allows us to more fully characterize a design space.
Distribution shape.
Figure 6 (left) shows EDFs for the Vanilla and ResNet design spaces (see Table 2). In the case of ResNet, the majority (80%) of models have error under 8%. In contrast, the Vanilla design space has a much smaller fraction of such models (15%). This makes it easier to find a good ResNet model. While this is not surprising given the wellknown effectiveness of residual connections, it does demonstrate how the shape of the EDFs can give additional insight into characterizing a design space.
Distribution area.
We can summarize an EDF by the average area under the curve up to some max . That is we can compute . For our example, ResNet has a larger area under the curve. However, like the min, the area gives only a partial view of the EDF.
Random search efficiency.
Another way to assess the ease of finding a good model is to measure random search efficiency. To simulate random search experiments of varying size we follow the procedure described in [1]. Specifically, for each experiment size , we sample models from our pool of models and take their minimum error. We repeat this process times to obtain the mean along with error bars for each . To factor out the confounding effect of complexity, we assign a weight to each model such that (Eqn. 3) and use these weights for sampling.
In Figure 6 (right), we use our 50k pretrained models from the Vanilla and ResNet design spaces to simulate random search (conditioned on parameters) for varying . We observe consistent findings as before: random search finds better models faster in the ResNet design space.
4.4 Minimal Sample Size
Our experiments thus far used very large sets of trained models. In practice, however, far fewer samples can be used to compare distributions of models as we now demonstrate.
Qualitative analysis.
Figure 7 (left) shows EDFs for the ResNet design space with varying number of samples. Using 10 samples to generate the EDF is quite noisy; however, 100 gives a reasonable approximation and 1000 is visually indistinguishable from 10,000. This suggests that 100 to 1000 samples may be sufficient to compare distributions.
Quantitative analysis.
We perform quantitative analysis to give a more precise characterization of the number of samples necessary to compare distributions. In particular, we compute the statistic (Eqn. 2) between the full sample of 25k models and subsamples of increasing size . The results are shown in Figure 7 (right). As expected, as increases, decreases. At 100 samples is about 0.1, and at 1000 begins to saturate. Beyond 1000 samples shows diminishing returns. Thus, our earlier estimate of 100 samples is indeed a reasonable lowerbound, and 1000 should be sufficient for more precise comparisons. We note, however, that these bounds may vary under other circumstances.
Feasibility discussion.
One might wonder about the feasibility of training between 100 and 1000 models for evaluating a distribution. In our setting, training 500 CIFAR models requires about 250 GPU hours. In comparison, training a typical ResNet50 baseline on ImageNet requires about 192 GPU hours. Thus, evaluating the full distribution for a smallsized problem like CIFAR requires a computational budget on par with a point estimate for a mediumsized problem like ImageNet. To put this in further perspective, NAS methods can require as much as GPU hours on CIFAR [29]. Overall, we expect distribution comparisons to be quite feasible under typical settings. To further aid such comparisons, we will release data for all studied design spaces to serve as baselines for future work.
#ops  #nodes  output  #cells (B)  

NASNet [41]  13  5  L  71,465,842 
Amoeba [30]  8  5  L  556,628 
PNAS [20]  8  5  A  556,628 
ENAS [29]  5  5  L  5,063 
DARTS [21]  8  4  A  242 
3 max pool, ), number of nodes (excluding the inputs), and which nodes are concatenated for the output (‘A’ if ‘all’ nodes, ‘L’ if ‘loose’ nodes not used as input to other nodes). Given
ops to choose from, there are choices when adding the node, leading to possible cells with nodes (of course many of these cells are redundant). The spaces vary substantially; indeed, even exact candidate ops for each vary.5 Case Study: NAS
As a case study of our methodology we examine design spaces from recent neural architecture search (NAS) literature. In this section we perform studies on CIFAR [14] and in Appendix B we further validate our results by replicating the study on ImageNet [4], yielding similar conclusions.
NAS has two core components: a design space and a search algorithm over that space. While normally the focus is on the search algorithm (which can be viewed as inducing a distribution over the design space), we instead focus on comparing the design spaces under a fixed distribution. Our main finding is that in recent NAS papers, significant design space differences have been largely overlooked. Our approach complements NAS by decoupling the design of the design space from the design of the search algorithm, which we hope will aid the study of new design spaces.
5.1 Design Spaces
I. Model family.
The general NAS model family was introduced in [40, 41]. A NAS model is constructed by repeatedly stacking a single computational unit, called a cell, where a cell can vary in the operations it performs and in its connectivity pattern. In particular, a cell takes outputs from two previous cells as inputs and contains a number of nodes. Each node in a cell takes as input two previously constructed nodes (or the two cell inputs), applies an operator to each input (convolution), and combines the output of the two operators (by summing). We refer to ENAS [29] for a more detailed description.
II. Design space.
In spite of many recent papers using the general NAS model family, most recent approaches use different design space instantiations. In particular, we carefully examined the design spaces described in NASNet [41], AmoebaNet [30], PNAS [20], ENAS [29], and DARTS [21]. The cell structure differs substantially between them, see Table 3 for details. In our work, we define five design spaces by reproducing these five cell structures, and name them accordingly, , NASNet, Amoeba, .
How the cells are stacked to generate the full network architecture also varies slightly between recent papers, but less so than the cell structure. We therefore standardize this aspect of the design spaces; that is we adopt the network architecture setting from DARTS [21]. Core aspects include the stem structure, even placement of the three reduction cells, and filter width that doubles after each reduction cell.
The network depth and initial filter width are typically kept fixed. However, these hyperparameters directly affect model complexity. Specifically, Figure 8 shows the complexity distribution generated with different cell structures with and kept fixed. The ranges of the distributions differ due to the varying cell structures designs. To factor out this confounding factor, we let and vary (selecting and ). This spreads the range of the complexity distributions for each design space, allowing for more controlled comparisons.
III. Model distribution.
We sample NAS cells by using uniform sampling at each step (operator and node selection). Likewise, we sample and uniformly at random.
IV. Data generation.
5.2 Design Space Comparisons
We adopt our distribution comparison tools (EDFs, test, ) from §4 to compare the five NAS design spaces, each of which varies in its cell structure (see Table 3).
Distribution comparisons.
Figure 9 shows normalized error EDFs for each of the NAS design spaces. Our main observation is that the EDFs vary considerably: the NASNet and Amoeba design spaces are noticeably worse than the others, while DARTS is best overall. Comparing ENAS and PNAS shows that while the two are similar, PNAS has more models with intermediate errors while ENAS has more lower/higher performing models, causing the EDFs to cross.
Interestingly, according to our analysis the design spaces corresponding to newer work outperform the earliest design spaces introduced in NASNet [41] and Amoeba [30]. While the NAS literature typically focuses on the search algorithm, the design spaces also seem to be improving. For example, PNAS [20] removed five ops from NASNet that were not selected in the NASNet search, effectively pruning the design space. Hence, at least part of the gains in each paper may come from improvements of the design space.
Random search efficiency.
We simulate random search in the NAS design spaces (after normalizing for complexity) following the setup from §4.3. Results are shown in Figure 10. First, we observe that ordering of design spaces by random search efficiency is consistent with the ordering of the EDFs in Figure 9. Second, for a fixed search algorithm (random search in this case), this shows the differences in the design spaces alone leads to clear differences in performance. This reinforces that care should be taken to keep the design space fixed if the search algorithm is varied.
5.3 Comparisons to Standard Design Spaces
We next compare the NAS design spaces with the design spaces from §3. We select the best and worst performing NAS design spaces (DARTS and NASNet) and compare them to the two ResNeXt design spaces from Table 2. EDFs are shown in Figure 11. ResNeXtB is on par with DARTS when normalizing by params (left), while DARTS slightly outperforms ResNeXtB when normalizing by flops (right). ResNeXtA is worse than DARTS in both cases.
It is interesting that ResNeXt design spaces can be comparable to the NAS design spaces (which vary in cell structure in addition to width and depth). These results demonstrate that the design of the design space plays a key role and suggest that designing design spaces, manually or via datadriven approaches, is a promising avenue for future work.
flops  params  error  error  error  

(B)  (M)  original  default  enhanced  
ResNet110  0.26  1.7  6.61  5.91  3.65 
ResNeXt  0.38  2.5  –  4.90  2.75 
DARTS  0.54  3.4  2.83  5.21  2.63 
5.4 Sanity Check: Point Comparisons
We note that recent NAS papers report lower overall errors due to higher complexity models and enhanced training settings. As a sanity check, we perform point comparisons using larger models and the exact training settings from DARTS [21]
which uses a 600 epoch schedule with deep supervision
[18], Cutout [5], and modified DropPath [17]. We consider three models: DARTS (the best model found in DARTS [21]), ResNeXt (the best model from ResNeXtB with increased widths), and ResNet110 [8].Results are shown in Table 4. With the enhanced setup, ResNeXt achieves similar error as DARTS (with comparable complexity). This reinforces that performing comparisons under the same settings is crucial, simply using an enhanced training setup gives over 2% gain; even the original ResNet110 is competitive under these settings.
6 Conclusion
We present a methodology for analyzing and comparing model design spaces. Although we focus on convolutional networks for image classification, our methodology should be applicable to other model types (RNNs), domains (NLP), and tasks (detection). We hope our work will encourage the community to consider design spaces as a core part of model development and evaluation.
Appendix A: Supporting Experiments
In the appendix we provide details about training settings and report extra experiments to verify our methodology.
Training schedule.
We use a halfperiod cosine schedule which sets the learning rate via , where is the initial learning rate, is the current epoch, and is the total epochs. The advantage of this schedule is it has just two hyperparameters: and . To determine and also weight decay , we ran a largescale study with three representative design spaces: Vanilla, ResNet, and DARTS. We train 5k models sampled from each design space for 100 epochs with random and
sampled from a log uniform distribution and plot the results in Figure
12. The results across the three different design spaces are remarkably consistent; in particular, we found a single and to be effective across all design spaces. For all remaining experiments, we use 0.1 and 5e4.Training settings.
For all experiments, we use SGD with momentum of 0.9 and minibatch size of 128. By default we train using 100 epochs. We adopt weight initialization from [8] and use standard CIFAR data augmentations [18]. For ResNets, our settings improve upon the original settings used in [8], see Table 1. We note that recent NAS papers use much longer schedules and stronger regularization, see DARTS [21]. Using our settings but with 600 and comparable extra regularization, we can achieve similar errors, see Table 4. Thus, we hope that our simple setup can provide a recipe for strong baselines for future work.
Model consistency.
Training is a stochastic process and hence error estimates vary across multiple training runs.^{3}^{3}3The main source of noise in model error estimates is due to the random number generator seed that determines the initial weights and data ordering. However, surprisingly, even fixing the seed does not reduce the overall variance much due to nondeterminism of floating point operations. Figure 13 shows error distributions of a top and midranked model from the ResNet
design space over 100 training runs (other models show similar results). We note that the gap between models is large relative to the error variance. Nevertheless, we check to see if the reliability of error estimates impacts the overall trends observed in our main results. In Figure
14, we show error EDFs where the error for each of 5k models was computed by averaging over 1 to 3 runs. The EDFs are indistinguishable. Given these results, we use error estimates from a single run in all other experiments.Bucket comparison.
Another way to account for complexity is stratified analysis [24]. As in §4.2 we can bin the complexity range into bins, but instead of reweighing models per bin to generate normalized EDFs, we instead perform analysis within each bin independently. We show the results of this approach applied to our example from §4.2 in Figure 15. We observe similar trends as in Figure 3. Indeed, bucket analysis can be seen as a first step to computing the normalized EDF (Eqn. 4), where the data across all bins is combined into a single distribution estimate, which has the advantage that substantially fewer samples are necessary. We thus rely on normalized EDFs for all comparisons.
Appendix B: ImageNet Experiments
We now evaluate our methodology in the largescale regime of ImageNet [4] (IN for short). In particular, we repeat the NAS case study from §5 on IN. We note that these experiments were performed after our methodology was finalized, and we ran these experiments exactly once. Thus, IN can be considered as a test case for our methodology.
Design spaces.
We provide details about the IN design spaces used in our study. We use the same model families as in §3.2 and §5.1. The precise design spaces on IN are as close as possible to their CIFAR counterparts. We make only two necessary modifications: adopt the IN stem from DARTS [21] and adjust NAS width and depth values to and . We keep the allowable hyperparameter values for ResNeXt design spaces unchanged from Table 2 (our IN models have 3 stages). We further upperbound models to 6M parameters or 0.6B flops which gives models in in the mobile regime commonly used in the NAS literature. The model distributions follow §3.2 and §5.1. For data generation, we train 100 models on IN for each of the five NAS design spaces in Table 3 and two ResNeXt design spaces in Table 2. Note that to stress test our methodology we choose to use the minimal number of samples (see §4.4 for discussion).
NAS distribution comparisons.
In Figure (a)a we show normalized EDFs for the NAS design spaces on IN. We observe that the general EDF shapes match their CIFAR counterparts in Figure 9. Moreover, the relative ordering of the NAS design spaces is consistent between the two datasets as well. This provides evidence that current NAS design spaces, developed on CIFAR, are transferable to IN.
NAS random search efficiency.
Analogously to Figure 10, we simulate random search in the NAS design spaces on IN and show the results in Figure (b)b. Our main findings are consistent with CIFAR: (1) random search efficiency ordering is consistent with EDF ordering and (2) differences in design spaces alone result in differences in performance.
Comparison to standard design spaces.
We next follow the setup from Figure 11 and compare NAS design spaces to standard design spaces on IN in Figure (c)c. The main observation is again consistent: standard design spaces can be comparable to the NAS ones. In particular, ResNeXtB is similar to DARTS when normalizing by params (left), while NAS design spaces outperform the standard ones by a considerable margin when normalizing by flops (right).
Discussion.
Overall, our IN results closely follow their CIFAR counterparts. As before, our core observation is that the design of the design space can play a key role in determining the potential effectiveness of architecture search. These experiments also demonstrate that using 100 models per design space is sufficient to apply our methodology and strengthen the case for its feasibility in practice. We hope these results can further encourage the use of distribution estimates as a guiding tool in model development.
Training settings.
We conclude by listing detailed IN training settings, which follow the ones from Appendix A unless specified next. We train our models for 50 epochs. To determine and we follow the same procedure as for CIFAR (results in Figure (d)d) and set 0.05 and 5e5. We adopt standard IN data augmentations: aspect ratio [34], flipping, PCA [15], and perchannel mean and SD normalization. At test time, we rescale images to 256 (shorter side) and evaluate the model on the center 224224 crop.
Acknowledgements
We would like to thank Ross Girshick, Kaiming He, and Agrim Gupta for valuable discussions and feedback.
References
 [1] J. Bergstra and Y. Bengio. Random search for hyperparameter optimization. JMLR, 2012.
 [2] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyperparameter optimization. In NIPS, 2011.

[3]
J. Collins, J. SohlDickstein, and D. Sussillo.
Capacity and trainability in recurrent neural networks.
In ICLR, 2017.  [4] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [5] T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552, 2017.
 [6] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. arXiv:1503.04069, 2015.
 [7] K. He and J. Sun. Convolutional neural networks at constrained time cost. In CVPR, 2015.
 [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [9] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In AAAI, 2018.
 [10] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 1997.
 [11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [12] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy tradeoffs for modern convolutional object detectors. In CVPR, 2017.
 [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [14] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
 [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [16] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, 2007.
 [17] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultradeep neural networks without residuals. In ICLR, 2017.
 [18] C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In AISTATS, 2015.
 [19] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.
 [20] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.J. Li, L. FeiFei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In ECCV, 2018.
 [21] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In ICLR, 2019.
 [22] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are gans created equal? a largescale study. In NIPS, 2018.
 [23] R. Luo, F. Tian, T. Qin, E. Chen, and T.Y. Liu. Neural architecture optimization. In NIPS, 2018.
 [24] N. Mantel and W. Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the national cancer institute, 1959.
 [25] F. J. Massey Jr. The kolmogorovsmirnov test for goodness of fit. Journal of the American statistical Association, 1951.
 [26] G. Melis, C. Dyer, and P. Blunsom. On the state of the art of evaluation in neural language models. In ICLR, 2018.
 [27] S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing lstm language models. In ICLR, 2018.
 [28] R. Novak, Y. Bahri, D. A. Abolafia, J. Pennington, and J. SohlDickstein. Sensitivity and generalization in neural networks: an empirical study. In ICLR, 2018.
 [29] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. In ICML, 2018.

[30]
E. Real, A. Aggarwal, Y. Huang, and Q. V. Le.
Regularized evolution for image classifier architecture search.
In AAAI, 2019.  [31] R. E. Schapire. The strength of weak learnability. Machine learning, 1990.
 [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [33] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In NIPS, 2012.
 [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [35] V. Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
 [36] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
 [37] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
 [38] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
 [39] C. Zhang, S. Bengio, and Y. Singer. Are all layers created equal? arXiv:1902.01996, 2019.
 [40] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
 [41] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.