FBNetV3: Joint Architecture-Recipe Search using Neural Acquisition Function

Neural Architecture Search (NAS) yields state-of-the-art neural networks that outperform their best manually-designed counterparts. However, previous NAS methods search for architectures under one training recipe (i.e., training hyperparameters), ignoring the significance of training recipes and overlooking superior architectures under other training recipes. Thus, they fail to find higher-accuracy architecture-recipe combinations. To address this oversight, we present JointNAS to search both (a) architectures and (b) their corresponding training recipes. To accomplish this, we introduce a neural acquisition function that scores architectures and training recipes jointly. Following pre-training on a proxy dataset, this acquisition function guides both coarse-grained and fine-grained searches to produce FBNetV3. FBNetV3 is a family of state-of-the-art compact ImageNet models, outperforming both automatically and manually-designed architectures. For example, FBNetV3 matches both EfficientNet and ResNeSt accuracy with 1.4x and 5.0x fewer FLOPs, respectively. Furthermore, the JointNAS-searched training recipe yields significant performance gains across different networks and tasks.



page 1

page 2

page 3

page 4


Fine-Grained Stochastic Architecture Search

State-of-the-art deep networks are often too large to deploy on mobile d...

NAS-VAD: Neural Architecture Search for Voice Activity Detection

The need for automatic design of deep neural networks has led to the eme...

Optimizing the Neural Architecture of Reinforcement Learning Agents

Reinforcement learning (RL) enjoyed significant progress over the last y...

Neural Architecture Search based on Cartesian Genetic Programming Coding Method

Neural architecture search (NAS) is a hot topic in the field of automate...

Automatic Routability Predictor Development Using Neural Architecture Search

The rise of machine learning technology inspires a boom of its applicati...

SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures

Designing neural architectures for edge devices is subject to constraint...

AgEBO-Tabular: Joint Neural Architecture and Hyperparameter Search with Autotuned Data-Parallel Training for Tabular Data

Developing high-performing predictive models for large tabular data sets...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Designing efficient models is a challenging but important problem: A myriad of applications from autonomous vehicles to augmented reality require compact models that must be highly-accurate – even under constraints on power, computation, memory, and latency. The number of possible constraint and architecture combinations is combinatorially large, making manual design very difficult.

In response, recent work employs neural architecture search (NAS) to design state-of-the-art efficient deep neural networks (DNNs). One category of NAS is differentiable neural architecture search (DNAS). These path-finding algorithms cannot search for non-architecture hyperparameters, which are crucial to the model’s performance. Furthermore, supernet-based NAS methods suffer from a limited search space, as the entire supergraph must fit into memory to avoid slow convergence Cai et al. (2018)

or paging. Other methods include reinforcement learning 

Tan et al. (2018)

, and evolutionary algorithms 

Real et al. (2017). However, all previous methods share these drawbacks:

  1. [leftmargin=6mm]

  2. Ignore training hyperparameters: NAS, true to its name, searches only for architectures but not their associated training hyperparameters (i.e., “training recipe”). This ignores the fact that different training recipes may drastically change the success or failure of an architecture.

  3. Support only one-time use: Most conventional NAS approaches produce one model for a specific set of resource constraints. This means deploying to a line of products, each with different resource constraints, requires rerunning NAS once for each resource setting.

To solve these previously-ubiquitous weaknesses, we propose JointNAS to jointly search for training recipes and architectures. We start by pre-training the neural acquisition function (i.e., predictor) on a proxy dataset, predicting architecture statistics from architecture embeddings. JointNAS is then a two-stage, constraint-aware search method, broken up into coarse-grained and fine-grained stages: In the first stage, we identify high-promise candidates by alternately (a) exploring the search space using constrained iterative optimization and (b) fine-tuning the predictor. In the second stage, we refine the set of candidates by using–not training–the predictor to perform fast evolutionary search. The search integrates our hyperparameter-optimizer AutoTrain to produce accurate, compact architectures with customized training recipes. We summarize our contributions:

  1. [leftmargin=6mm]

  2. Joint training-architecture search: We show that training recipes are as crucial as architectures to model performance. We propose an efficient constraint-aware architecture search and present the joint search over both training recipes and architectures for the first time.

  3. Generalizable training recipe: JointNAS’s search over training recipe achieves significant accuracy gains, as illustrated in Fig. 2. Our ResNeXt101-32x8d achieves 82.6% top-1 accuracy; this even outperforms its weakly-supervised counterpart trained on 1B extra images Mahajan et al. (2018). Furthermore, AutoTrain’s favored mechanism, stochastic weight averaging via exponential moving average (EMA), demonstrates remarkable mAP gain on other tasks such as object detection.

  4. Multi-use predictor: We pre-train the predictor to predict architecture statistics, then fine-tune the predictor with coarse-grained, constrained iterative optimization. The predictor, now fully trained, can then be used in fast, fine-grained evolutionary searches to quickly generate models for a wide variety of resource budgets.

  5. State-of-the-art ImageNet accuracy per-FLOP for the searched FBNetV3s. For example, our FBNetV3 matches EfficientNet accuracy with 28.4% fewer FLOPs, as shown in Fig. 2.

Figure 1:

Comparison of FBNetV3 with other efficient Convolutional neural networks.

Figure 2:

Accuracy improvement on existing architectures with the searched training recipe. WSL refers to the weakly supervised learning model using 1B additional images 

Mahajan et al. (2018).

2 Related work

Work on efficient neural networks began with manual design, which can be divided into architectural and non-architectural modifications.

Manual architecture search: Most early work compresses existing architectures. One method was pruning Han et al. (2015a); Dai et al. (2017); Yin et al. (2019); Cai et al. (2020)

, where either layers or channels were removed according to certain heuristics. However, pruning either considers only one architecture 

Han et al. (2015b) or can only sequentially search smaller and smaller architectures Yang et al. (2018). This limits the search space. Later work designs new architectures from the ground up, using new operations that are cost-friendly. This includes convolutional variants like depthwise convolutions in MobileNet, inverted residual blocks in MobileNetV2; activations such as hswish in MobileNetV3 Howard et al. (2017); Sandler et al. (2018); Howard et al. (2019); and memory-based operations like shift Wu et al. (2017) and shuffle Ma et al. (2018). Although many of these operations are still used in state-of-the-art neural networks today, manually-designed architectures have been superseded by automatically-searched counterparts.

Non-architectural modifications: This involves a number of network compression techniques including low-bit quantization Han et al. (2015a) to as few as two Zhu et al. (2016) or even one bit Hubara et al. (2016). Other work downsamples input non-uniformly Wu et al. (2018); Xu et al. (2020); Marin et al. (2019)

to reduce computational cost. These methods can be combined with architecture improvements for roughly additive reduction in latency. Other non-architecture modifications involve hyperparameter tuning, including tuning libraries from the pre-deep-learning era

Bergstra et al. (2013). Several deep-learning-specific tuning libraries are also widely used Liaw et al. (2018). A newer category of approaches automatically searches for the optimal combination of data augmentation strategies. These methods use policy search Cubuk et al. (2019), population-based training Ho et al. (2019), and Bayesian-based augmentation Tran et al. (2017). However, all such methods optimize for a specific architecture.

Automatic architecture search: NAS automates neural network design for state-of-the-art performance. Several of the most common techniques for NAS include reinforcement learning Zoph and Le (2016); Tan et al. (2018), evolutionary algorithms Real et al. (2017), and DNAS Liu et al. (2018); Wu et al. (2019a); Wan et al. (2020); Guo et al. (2019); Xie et al. (2018). DNAS typically features high efficiency with limited search space due to memory and search space constraints. Several works seek to address this issue, by training only subsets at a time Cai et al. (2018) or by introducing approximations Wan et al. (2020). However, its flexibility is still less than that of rival reinforcement learning and evolutionary algorithms methods. Furthermore, all of the above methods search only for the model architecture. By contrast, our JointNAS integrates a hyperparameter-optimizer AutoTrain to jointly search both architectures and training recipes.

3 Method

Our goal is to find the most accurate architecture and training recipe under given resource constraints. We thus formulate the architecture search as a constrained optimization problem:


where , , and refer to the neural network architecture embedding, training recipe embedding, and designed search space, respectively. Embedding contents can be found in Table 1. maps the architecture and training recipe to accuracy. and refer to the formula and count of resource constraints, such as computational cost, storage cost, and run-time latency.

The search space is typically combinatorially large, making exhaustive evaluation an impossibility. To address this complexity, we design a two-stage search algorithm: (1) coarse-grained search, followed by (2) fine-grained search, as shown in Algorithm 1. We explain the details below.

: the designed search space;
: the size of the pool ;
: the number of DNN candidates () to train in each iteration;
: the number of batches for constrained iterative optimization;
Stage 1: Constrained Iterative Optimization:
Initialize as ; Generate a pool with QMC sampling subject to constraints ;
for  do
        Find a batch of DNN candidates based on predicted scores, ;
        Evaluate all by training in parallel; if : Determine early stopping criteria; Update the dataset: ;
        Retrain the accuracy predictor  on ;
end for
Stage 2: Predictor-based Evolutionary Search:
Initialize with best-performing and randomly generated candidates paired with scores predicted by ; Initialize with the best score in ; set ; ;
while  do
        for  do
               Generate a set of children subject to constraints

, by the adaptive genetic algorithm 

Dai et al. (2019), where ;
        end for
       Augment with paired with scores predicted by ;
        Select top candidates from the augmented set to update ;
        Update the previous best ranking score by ;
        Update the current best ranking score by the best predicted score in .
end while
Result: , i.e., top best samples with their predicted scores.
Algorithm 1 Efficient Two-stage Constraint-aware Architecture Search

3.1 Coarse-grained search: Constrained iterative optimization

This initial search produces both an accuracy predictor and a set of high-promise candidates.

Neural Acquisition Function

(i.e., Predictor). This multi-layer perceptron (Fig. 


) consists of an architecture encoder and two heads: (1) An auxiliary “proxy” head, used for pre-training the encoder, predicts architecture statistics (e.g., FLOPs and #Params) from architecture representations; Architectures are represented with one-hot categorical variables and integral (min-max normalized) range variables. (2) The accuracy predictor, which accepts training recipes and architecture representations jointly, is fine-tuned in constrained iterative optimization.

Step 1. Pre-train embedding layer. We introduce an embedding layer and pre-training process that significantly improves the predictor’s accuracy: First, we train a model that takes an architecture representation as input and predicts architecture statistics (i.e., FLOPs and #Params). Note that this dataset is “free” to obtain, as we can generate arbitrary architectures and compute architecture statistics with negligible cost. Second, we transfer the pre-trained embedding layer to initialize the accuracy predictor. This leads to significant gains in predictor’s accuracy and reliability, as shown in Fig. 5.

Step 2. Constrained iterative optimization: We first use quasi Monte-Carlo (QMC) sampling to generate a sample pool of architecture-recipe pairs from the search space. Then, we build the neural acquisition function iteratively: We (a) select a batch of favorable candidates based on predicted accuracy, (b) train and evaluate the candidates, and (c) update the predictor. This iterative shrinking of the candidate space avoids unnecessary evaluations and improves exploration efficiency.

  • [leftmargin=6mm]

  • Early-stopping

    . We introduce an early stopping mechanism to cut down on the computational cost of evaluating candidates. Specifically, we (a) rank samples by both early-stopping and final accuracy after the first iteration of constrained iterative optimization, (b) compute the rank correlation, and (c) find the epoch

    where correlation exceeds a particular threshold (e.g., 0.92), as shown in Fig. 4. For all remaining candidates, we train () only for epochs to approximate . This allows us to use much fewer training iterations to evaluate each queried sample.

  • Predictor training. After obtaining the pre-trained architecture embedding, we first train the predictor for 50 epochs with the embedding layer frozen. Then, we train the entire model with reduced learning rate for another 50 epochs. We adopt the Huber loss to train the accuracy predictor, i.e., if else , where and

    are the prediction and standardized label, respectively. Utilizing the Huber loss prevents the model from being dominated by outliers, as observed in 

    Wen et al. (2019). Thus, with the Huber loss, we alleviate the impact of “abnormal” candidate architectures that may damage predictor performance on the majority of samples.

Figure 3: Rank correlation vs. epochs for the first iteration in coarse-grained search. Correlation threshold is 0.92.
Figure 4: Architecture of the predictor.
Figure 5: Predictor’s performance on both proxy metrics and accuracy.

3.2 Fine-grained search: Predictor-based evolutionary search

The second stage of the proposed method is an iterative process based on adaptive genetic algorithms Srinivas and Patnaik (1994). The best architecture-recipe pairs from the first stage are inherited as the first generation candidates. In each iteration, we introduce mutations to the candidates and generate a set of children subject to given constraints. We evaluate the score for each child with the pre-trained accuracy predictor , and select top highest-scoring candidates for the next generation. We compute the gain of the highest score after each iteration, and terminate the loop when the improvement saturates. Finally, the evolutionary search produces high-accuracy neural network architectures and training recipes.

Note that the accuracy predictor can be reused to search networks for different resource constraints. With the accuracy predictor, which provides an instant evaluation of a candidate, JointNAS requires just the fine-grained search, which in turn incurs a negligible cost (i.e., several CPU minutes).

3.3 Search space

Our search space consists of both training recipes and architecture configurations. The search space for training recipes features optimizer type, initial learning rate, weight decay, mixup ratio Zhang et al. (2017), drop out ratio, stochastic depth drop ratio Huang et al. (2016), and whether or not to use model EMA. Our architecture configuration search space is based on the inverted residual block Sandler et al. (2018) and includes input resolution, kernel size, expansion, number of channels per layer, and depth, as detailed in Table 1.

block k e c n s se act.
Conv 3 - (16, 24, 2) 1 2 - hswish
MBConv [3, 5] 1 (16, 24, 2) (1, 4) 1 N hswish
MBConv [3, 5] (4, 7) / (2, 5) (20, 32, 4) (4, 7) 2 N hswish
MBConv [3, 5] (4, 7) / (2, 5) (24, 48, 4) (4, 7) 2 Y hswish
MBConv [3, 5] (4, 7) / (2, 5) (56, 84, 4) (4, 8) 2 N hswish
MBConv [3, 5] (4, 7) / (2, 5) (96, 144, 4) (6, 10) 1 Y hswish
MBConv [3, 5] (4, 7) (180, 224, 4) (5, 9) 2 Y hswish
MBConv [3, 5] 6 (180, 224, 4) 1 1 Y hswish
MBPool [3, 5] 6 1984 1 - - hswish
FC - - 1000 1 - - -
res lr() optim ema p() d() m() wd()
(224, 272, 8) (20, 30)

[RMSProp, SGD]

[true, false] (1, 31) (10, 31) (0, 41) (7, 21)
Table 1: The network architecture configuration and search space in our experiments. MBConv, MBPool, k, e, c, n, s, se, and act. refer to the inverted residual block Sandler et al. (2018), efficient last stage Howard et al. (2019)

, kernel size, expansion, #Channel, #Layers, stride, squeeze-and-excitation, and activation function, respectively. res, lr, optim, ema, p, d, m, and wd refer to resolution, initial learning rate, optimizer type, EMA, dropout ratio, stochastic depth drop probability, mixup ratio, and weight decay, respectively. Expansion on the left of the slash is used in the first block in the stage, while that on the right for the rest. Tuples of three values in parentheses represent the lowest value, highest, and steps; two-value tuples imply a step of 1, and tuples in brackets represent all available choices during search. Note that lr is multiplied by 4 if the optim chooses SGD. Architecture parameters with the same superscript share the same values during the search.

In AutoTrain experiment, we only tune training recipes on a fixed architecture. However, for joint search, we search both training recipes and architectures, within the search space in Table 1. Overall, the space contains architecture candidates with possible training recipes. Exploring such a vast search space for an optimal network architecture and its corresponding training recipe is non-trivial.

4 Experiments

In this section, we first validate our search method in a narrowed search space to discover the training recipe for a given network. Then, we evaluate our search method for joint search over architecture and training recipes. We use PyTorch 

Paszke et al. (2017) for the implementation and conduct search on the ImageNet 2012 classification dataset Deng et al. (2009). In the search process, we randomly sample 200 classes from the entire dataset to reduce the training time. Then, we randomly withhold 10K images from the training set as the validation set.

4.1 AutoTrain search

We first search for the training recipe on a fixed architecture. We adopt FBNetV2-L3 Wan et al. (2020) (detailed configurations can be found in the supplementary materials) as our base architecture, which achieves 79.1% top-1 accuracy with the original training method used in Wan et al. (2020). We set the sample pool size . We set batch size and iteration in constrained iterative optimization. We train the sampled candidates for 150 epochs with a learning rate decay factor of 0.963 per epoch during the search, and train the final model with 3 slower learning rate decay (i.e., 0.9875 per epoch). We don’t set FLOP constraints since the architecture is fixed. The searched training recipe prefers RMSProp over SGD and adopts EMA. It improves the accuracy of our base architecture by 0.8%. We show detailed settings and the searched training recipe in the supplementary materials.

We extend the searched training recipe to existing renowned DNNs to further validate its generality. Although the training recipe was tailored to FBNetV2-L3, it generalizes surprisingly well, as shown in Table 2. The searched training recipe leads to substantial accuracy gains of up to 5.7% on ImageNet. With the searched training recipe, ResNet50 outperforms the baseline ResNet152 by 0.9%. ResNeXt101-32x8d even surpasses the weakly supervised learning model, which is trained with 1B weakly-labeled images and achieves 82.2% top-1 accuracy. Notably, it is possible to achieve even better performance by searching for specific training recipe for each DNN, which would increase the search cost.

We next explain an interesting finding in the discovered training recipe. We plot the training curves on ResNeXt101-32x8d in Fig. 6. We observe that the EMA model brings substantially higher accuracy than the original model in the middle stage of training. We refer to this phenomenon as stochastic weight averaging via EMA. Note that the EMA model even outperforms the final fully trained model by more than 2%. We hypothesize that EMA intrinsically functions as a strong “ensemble” mechanism, and thus leads to higher accuracy compared to a single model. However, such accuracy gains diminish as the learning rate decays: As the model becomes stable when the learning rate is tiny, the EMA model converges to the model itself. We will further validate the generality of the stochastic weight averaging via EMA mechanism in the next section.

Figure 6: Training curve of the search recipe on ResNeXt101-32x8d.
Model Top-1 Accuracy (%) Original AutoTrain FBNetV2-L3 Wan et al. (2020) 79.1 79.9 +0.8 AlexNet Krizhevsky et al. (2012) 56.6 62.3 +5.7 ResNet34 He et al. (2016) 73.3 76.3 +3.0 ResNet50 He et al. (2016) 76.1 79.2 +3.1 ResNet101 He et al. (2016) 77.4 81.2 +3.8 ResNet152 He et al. (2016) 78.3 81.9 +3.6 DenseNet201 Huang et al. (2017) 77.2 80.2 +3.0 ResNeXt101 Xie et al. (2017) 79.3 82.6 +3.3 Table 2: Accuracy improvements with the searched training recipes. Above, ResNeXt101 refers to the 32x8d variant.

4.2 Search for efficient networks

Next, we perform a joint search of architecture and training recipes to discover compact DNNs. Note that based on our observations on stochastic weight averaging via EMA, we shrink the search space to always use EMA. Most of the settings are the same as in the AutoTrain search, while we increase the optimization iteration and set the FLOPs constraint for the sample pool from 400M to 800M. We pre-train the architecture embedding layer using 80% of the sample pool which contains 20K samples, and plot the validation on the rest 20% in Fig. 5. In the fine-grained search, we set four different FLOPs constraints: 450M, 550M, 650M, and 750M and discover four models (namely FBNetV3-B/C/D/E) with the same accuracy predictor. We further scale down and up the minimum and maximum models and generate FBNetV3-A and FBNetV3-F/G to fit more use scenarios, respectively, with compound scaling Tan and Le (2019). We provide more training details in the supplementary materials.

We compare our searched model against other relevant NAS baselines in Fig. 2, and list the detailed comparison in Table 3. Among all the existing efficient models such as EfficientNet, MobileNetV3, ResNeSt, and FBNetV2, our searched model delivers substantial improvements on the accuracy-efficiency trade-off. For example, our model achieves 80.4% top-1 accuracy with only 752M FLOPs, while the corresponding performance metrics for EfficientNet-B2 is 80.3% accuracy with 1G FLOPs. FBNetV3-G achieves the same accuracy with over 5 fewer FLOPs compared to ResNeSt.

Model Search method Search space FLOPs Accuracy (%, Top-1 ImageNet)
FBNet Wu et al. (2019a) gradient arch 375M 74.9
ProxylessNAS Cai et al. (2018) RL/gradient arch 465M 75.1
ChamNet Dai et al. (2019) predictor arch 553M 75.4
RegNet Radosavovic et al. (2020) pop. param. arch 600M 75.5
MobileNetV3-1.25x Howard et al. (2019) RL/NetAdapt arch 356M 76.6
EfficientNetB0 Tan and Le (2019) RL/scaling arch 390M 77.3
AtomNAS Mei et al. (2019) gradient arch 363M 77.6
FBNetV2-L2 Wan et al. (2020) gradient arch 423M 78.1
FBNetV3-A JointNAS arch/training 343M 78.0
ResNet152 He et al. (2016) manual - 11G 78.3
EfficientNetB1 Tan and Le (2019) RL/scaling arch 734M 79.2
ResNeXt101-32x8d Xie et al. (2017) manual - 7.8G 79.3
FBNetV3-C JointNAS arch/training 544M 79.6
EfficientNetB2 Tan and Le (2019) RL/scaling arch 1.0G 80.3
FBNetV3-E JointNAS arch/training 752M 80.4
EfficientNetB3 Tan and Le (2019) RL/scaling arch 1.8G 81.7
ResNeSt-101 Zhang et al. (2020) manual - 10.2G 82.3
FBNetV3-G JointNAS arch/training 2.0G 82.3
Table 3: Comparisons of different architecture search methods. For baselines, we cite statistics on ImageNet from the original papers. Our results are bolded. 1: population parameterization.

4.3 Transferability of the searched models

We further extend the searched models and training on CIFAR-10 dataset that has 60K images from 10 classes Krizhevsky et al. (2009). We compared the performance of different models in Fig. 8. Note that different from Tan and Le (2019) that scales up the base input resolution to 224x224, we keep the original base input resolution as 32x32, and scale up the input resolutions for larger models based on the scaling ratio. We also replace the second stride-two block with a stride-one block to fit the low-resolution inputs. Again, our searched models significantly outperform EfficientNet baselines.

5 Ablation study and discussions

Architecture and training recipe pairing. Our method yields different training recipes for different models. For example, smaller models prefer less regularization (e.g., smaller stochastic depth drop ratio and mixup). We swap the training recipes for FBNetV3-B and FBNetV3-E, observing a significant accuracy drop for both models (Table 4). This highlights the importance of correct architecture-recipe pairings, emphasizing the downfall of conventional NAS: Ignoring the training recipe and only searching for the network architecture fails to obtain optimal performance.

FBNetV3-B FBNetV3-E Train recipe Train recipe FBNetV3-B Arch 79.0% 78.5% FBNetV3-E Arch 79.8% 80.4% Table 4: Accuracy comparison for the searched models with swapped training recipes. Model Fine-grained FLOPs Accuracy FBNetV3-B Y 454M 79.0% FBNetV3-B N 448M 78.1% FBNetV3-E Y 752M 80.4% FBNetV3-E N 746M 79.6% Table 5: Performance improvement by the fine-grained search. *: Models derived from the coarse-grained search.

Fine-grained search improvements. Fine-grained search (i.e., predictor-based evolutionary search) yields substantial improvement on top of the coarse-grained search (i.e., constrained iterative optimization). To demonstrate this, we compare the best-performing candidates derived from the first search stage with the final searched FBNetV3 under the same FLOPs constraints (Table 5). We observe an accuracy drop of up to 0.9% if the fine-grained search is removed. Thus, the second search stage, though requiring only several CPU minutes, is equally crucial to the final models’ performance.

Generality of stochastic weight averaging via EMA. We observe that stochastic weight averaging via EMA yields significant accuracy gain for the classification tasks. We hypothesize that such a mechanism could be used as a general technique to improve other DNN models. To validate this, we employ it to train a RetinaNet Lin et al. (2017) on COCO object detection Lin et al. (2014) with ResNet50 and ResNet101 backbones. We follow most of the training settings in Wu et al. (2019b) but introduce EMA and Cosine learning rate. We observe similar training curves and behavior as the classification tasks, as shown in Fig. 8. The generated RetinaNets with ResNet50 and ResNet101 backbones achieve 40.3 and 41.9 mAP, respectively, both substantially outperform the best reported values in Wu et al. (2019b) (38.7 and 40.4 for ResNet50 and ResNet101, respectively). A promising future direction is to study such techniques and extend it to other DNNs and applications.

Figure 7: Accuracy vs. FLOPs on CIFAR-10.
Figure 8: Training curves for RetinaNet with ResNet101 backbone on COCO object detection.

6 Conclusion

Previous neural architecture search methods overlook the significance of training hyperparameters (i.e., “training recipe”), leading to missed opportunities. However, our two-stage JointNAS does not: JointNAS features a coarse-grained search, using constrained iterative optimization, and a fine-grained search, using predictor-based evolutionary search. Both stages jointly search over both architectures and training recipes, leading to FBNetV3 with state-of-the-art per-FLOP accuracies. We furthermore show generalizable training hyperparameters, a multi-use predictor, and a number of ablation studies to support our design choices.

We thank Ross Girshick for the discussions.


  • J. Bergstra, D. Yamins, and D. D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. Cited by: §2.
  • H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han (2020) Once for all: train one network and specialize it for efficient deployment. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • H. Cai, L. Zhu, and S. Han (2018) Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §1, §2, Table 3.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 113–123. Cited by: §2.
  • X. Dai, H. Yin, and N. K. Jha (2017) NeST: A neural network synthesis tool based on a grow-and-prune paradigm. arXiv preprint arXiv:1711.02017. Cited by: §2.
  • X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia, et al. (2019) Chamnet: towards efficient network design through platform-aware model adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11398–11407. Cited by: Table 3, 1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §4.
  • Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: §2.
  • S. Han, H. Mao, and W. J. Dally (2015a) Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2, §2.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015b) Learning both weights and connections for efficient neural network. In Proc. Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 770–778. Cited by: Figure 6, Table 3.
  • D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen (2019) Population based augmentation: efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393. Cited by: §2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.
  • A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. arXiv preprint arXiv:1905.02244. Cited by: §2, Table 1, Table 3.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: Figure 6.
  • G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In European conference on computer vision, pp. 646–661. Cited by: §3.3.
  • I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In Proc. Advances in Neural Information Processing Systems, pp. 4107–4115. Cited by: §2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.3.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Figure 6.
  • R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica (2018) Tune: a research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118. Cited by: §2.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §5.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.
  • H. Liu, K. Simonyan, and Y. Yang (2018) Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2.
  • N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) ShuffleNet V2: Practical guidelines for efficient CNN architecture design. arXiv preprint arXiv:1807.11164. Cited by: §2.
  • D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: Figure 2, item 2.
  • D. Marin, Z. He, P. Vajda, P. Chatterjee, S. Tsai, F. Yang, and Y. Boykov (2019) Efficient segmentation: learning downsampling near semantic boundaries. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2131–2141. Cited by: §2.
  • J. Mei, Y. Li, X. Lian, X. Jin, L. Yang, A. Yuille, and J. Yang (2019) AtomNAS: fine-grained end-to-end neural architecture search. arXiv preprint arXiv:1912.09640. Cited by: Table 3.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In Proc. Neural Information Processing Systems Workshop on Autodiff, Cited by: §4.
  • I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár (2020) Designing network design spaces. arXiv preprint arXiv:2003.13678. Cited by: Table 3.
  • E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017)

    Large-scale evolution of image classifiers


    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 2902–2911. Cited by: §1, §2.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv preprint arXiv:1801.04381. Cited by: §2, §3.3, Table 1.
  • M. Srinivas and L. M. Patnaik (1994) Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Trans. Systems, Man, and Cybernetics 24 (4), pp. 656–667. Cited by: §3.2.
  • M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le (2018) MnasNet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626. Cited by: §1, §2.
  • M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §4.2, §4.3, Table 3.
  • T. Tran, T. Pham, G. Carneiro, L. Palmer, and I. Reid (2017) A bayesian data augmentation approach for learning deep models. In Advances in neural information processing systems, pp. 2797–2806. Cited by: §2.
  • A. Wan, X. Dai, P. Zhang, Z. He, Y. Tian, S. Xie, B. Wu, M. Yu, T. Xu, K. Chen, et al. (2020) FBNetV2: differentiable neural architecture search for spatial and channel dimensions. arXiv preprint arXiv:2004.05565. Cited by: §2, Figure 6, §4.1, Table 3.
  • W. Wen, H. Liu, H. Li, Y. Chen, G. Bender, and P. Kindermans (2019) Neural predictor for neural architecture search. arXiv preprint arXiv:1912.00848. Cited by: 2nd item.
  • B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019a) Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §2, Table 3.
  • B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, and K. Keutzer (2017) Shift: A zero FLOP, zero parameter alternative to spatial convolutions. arXiv preprint arXiv:1711.08141. Cited by: §2.
  • B. Wu, A. Wan, X. Yue, and K. Keutzer (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1887–1893. Cited by: §2.
  • Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019b) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §5.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: Figure 6, Table 3.
  • S. Xie, H. Zheng, C. Liu, and L. Lin (2018) SNAS: Stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §2.
  • C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka (2020) SqueezeSegV3: spatially-adaptive convolution for efficient point-cloud segmentation. arXiv preprint arXiv:2004.01803. Cited by: §2.
  • T. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam (2018) NetAdapt: Platform-aware neural network adaptation for mobile applications. In Proc. European Conf. Computer Vision, Vol. 41, pp. 46. Cited by: §2.
  • H. Yin, P. Molchanov, Z. Li, J. M. Alvarez, A. Mallya, D. Hoiem, N. K. Jha, and J. Kautz (2019) Dreaming to distill: data-free knowledge transfer via deepinversion. arXiv preprint arXiv:1912.08795. Cited by: §2.
  • H. Zhang, C. Wu, Z. Zhang, Y. Zhu, Z. Zhang, H. Lin, Y. Sun, T. He, J. Mueller, R. Manmatha, et al. (2020) ResNeSt: split-attention networks. arXiv preprint arXiv:2004.08955. Cited by: Table 3.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §3.3.
  • C. Zhu, S. Han, H. Mao, and W. J. Dally (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §2.
  • B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §2.

Appendix A Appendix

a.1 Architecture of FBNetV2-L3

We show the architecture of FBNetV2-L3 (a scaled and improved version of FBNetV2-L2) in Table 6, while the input resolution is 256x256. This is the base architecture used in the training recipe search in Section 4.1. It achieves 79.1% top-1 accuracy on ImageNet with the original training recipe used for FBNetV2. With the searched training recipes, it achieves 79.9% top-1 accuracy.

block k e c n s se act.
Conv 1 3 16 1 2 - hswish
MBConv 3 1 16 2 1 N hswish
MBConv 5 5.46 24 1 2 N hswish
MBConv 5 1.79 24 1 1 N hswish
MBConv 3 1.79 24 1 1 N hswish
MBConv 5 1.79 24 2 1 N hswish
MBConv 5 5.35 40 1 2 Y hswish
MBConv 5 3.54 32 1 1 Y hswish
MBConv 5 4.54 32 3 1 Y hswish
MBConv 5 5.71 72 1 2 N hswish
MBConv 3 2.12 72 1 1 N hswish
Skip - - 72 - - - hswish
MBConv 3 3.12 72 1 1 N hswish
MBConv 3 5.03 128 1 1 N hswish
MBConv 5 2.51 128 1 1 Y hswish
MBConv 5 1.77 128 1 1 Y hswish
MBConv 5 2.77 128 1 1 Y hswish
MBConv 5 3.77 128 4 1 Y hswish
MBConv 3 5.57 208 1 2 Y hswish
MBConv 5 2.84 208 2 1 Y hswish
MBConv 5 4.88 208 3 1 Y hswish
Skip - - 248 - - - hswish
MBPool - 6 1984 1 - - hswish
FC - - 1000 1 - - -
Table 6: FBNetV2-L3 architecture configurations. The block notations are identical to Table 1. Skip block refers to an identity connection if the input and output channel are equal otherwise a 1x1 conv.

a.2 Search settings and details

In the AutoTrain search experiment, we set the early-stop rank correlation threshold to be 0.92, and find the corresponding early-stop epoch to be 103. In the fine-grained search, we set the population of the initial generation to be 100 (50 best-performing candidates from the coarse-grained search and 50 randomly generated samples). We generate 12 children from each candidate and pick the top 40 candidates for the next generation. Most of the settings are shared by the joint search of architecture and training recipes, expect the early-stop epoch to be 108. The accuracy predictor consists of one embedding layer (architecture encoder layer) and one extra hidden layer. The embedding width is 24 for the joint search (note that there is no pre-trained embedding layer for the AutoTrain search). We set both minimum and maximum FLOPs constraint at 400M and 800M for the joint search, respectively. The selection of best-performing samples in the constrained iterative optimization involves two steps: (1) equally divide the FLOP range into bins and (2) pick the sample with the highest predicted score within each bin.

We show the detailed searched training recipe in Table 7.

Notation lr optim ema p d m wd
Value 0.026 RMSProp true 0.17 0.09 0.19 7e-6
Table 7: Searched training recipe.

a.3 Training details

We use distributed training with 8 nodes for the final models, and scale up the learning rate by the number of distributed nodes (e.g., 8

for 8-node training). We use synchronized batch normalization in distributed training. We train the models for 400 epochs with a learning rate decay factor of 0.9875 per epoch after a 5-epoch warmup. The batch size is set to be 256 per node. We use label smoothing and AutoAugment in the training. Additionally, we set the weight decay and momentum for batch normalization parameters to be zero and 0.9, respectively. We train the scaled models FBNetV3-A and FBNetV3-F/G with the searched training recipes for FBNetV3-B and FBNetV3-E, respectively, only increasing the stochastic depth drop ratio for FBNetV3-F/G to 0.2 .

We implement the EMA model as a copy of the original network (they share the same weights at ). After each backward pass and model weights update, we update the EMA weights as


where , , and refer to the EMA weight at step , EMA weight at step , and model weight at . We use an EMA decay of 0.99985, 0.999, and 0.9998 in our experiments on ImageNet, CIFAR-10, and COCO, respectively.

The AlexNet, ResNet, DenseNet, and ResNeXt models are adopted from PyTorch open-source implementation without any architecture change. The input resolution is 224x224.