1 Introduction
Model  rank  Model  
HP (LR=5.5,L2=1.5e4)  56.9%  >  55.6% 
HP (LR=1.1,L2=8.4e4)  54.7%  <  56.2% 
Neural Architecture Search (NAS) has brought significant improvements in many applications, such as machine perception Howard et al. (2019); Tan et al. (2020); Cai et al. (2020); Xie and Yuille (2017); Wu et al. (2019), language modeling Liu et al. (2019); Dong and Yang (2019b), and model compression Han (2018); Dong and Yang (2019a); Han et al. (2016). Most NAS works apply the same hyperparameters while searching for network architectures. For example, all models in Zoph and Le (2017); Tan and Le (2019); Ying et al. (2019) are trained with the same optimizer, learning rate, and weight decay. As a result, the relative ranking of models in the search space is only determined by their architectures. However, we observe that different models favor different hyperparameters. Table 1 shows the performance of two randomly sampled models with different hyperparameters: under hyperparameter HP, model outperforms model, but model is better under HP. These results suggest using fixed hyperparameters in NAS would lead to suboptimal results.
A natural question is: could we extend NAS to a broader scope for joint Hyperparameter and Architecture Search (HAS)? In HAS, each model can potentially be coupled with its own best hyperparameters, thus achieving better performance than existing NAS with fixed hyperparameters. However, jointly searching for architectures and hyperparameters is challenging. The first challenge is how to deal with both categorical and continuous values in the joint HAS search space. While architecture choices are mostly categorical values (e.g., convolution kernel size), hyperparameters choices can be both categorical (e.g., the type of optimizer) and continuous values (e.g., weight decay). There is not yet a good solution to tackle this challenge: previous NAS methods only focus on categorical search spaces, while hyperparameter optimization methods only focus on continuous search spaces. They thus cannot be directly applied to such a mixture of categorical and continuous search space. Secondly, another critical challenge is how to efficiently search over the larger joint HAS search space as it combines both architecture and hyperparameter choices.
In this paper, we propose AutoHAS, a differentiable HAS algorithm. It is, to the best of our knowledge, the first algorithm that can efficiently handle the large joint HAS search space. To address the mixture of categorical and continuous search spaces, we first discretize the continuous hyperparameters into a linear combination of multiple categorical basis, then we can unify them during search. As explained below, we will use a differentiable method to search over the combination, i.e., architecture and HP encodings in Fig. 1
. These encodings represent the probability distribution over all candidates in the respective space. They can be used to find the best architecture together with its associated hyperparameters.
To efficiently navigate the much larger search space, we further introduce a novel weight sharing technique for AutoHAS. Weight sharing has been widely used in previous NAS approaches Pham et al. (2018); Liu et al. (2019) to reduce the search cost. The main idea is to train a SuperModel, where each candidate in the architecture space is its submodel. Using a SuperModel can avoid training millions of candidates from scratch Liu et al. (2019); Dong and Yang (2019b); Cai et al. (2019); Pham et al. (2018). Motivated by the weight sharing in NAS, AutoHAS extends its scope from architecture search to both architecture and hyperparameter search. We not only share the weights of the SuperModel with each architecture but also share this SuperModel across hyperparameters. At each search step, AutoHAS optimizes the shared SuperModel by a combination of the basis of HAS space, and the shared SuperModel serves as a good initialization for all hyperparameters at the next step of search (see Fig. 1 and Sec. 3).
In this paper, we focus on architecture, learning rate, and L2 penalty weight optimization, but it should be straightforward to apply AutoHAS to other hyperparameters. A summary of our results is in Fig. 2, which shows that AutoHAS outperforms many AutoML methods regarding both accuracy and efficiency (more details in Sec. 4.3). In Sec. 4
, we show that it improves a number of computer vision and natural language processing models, i.e., MobileNetV2
Sandler et al. (2018), ResNet He et al. (2016), EfficientNet Tan and Le (2019), and BERT finetuning Devlin et al. (2019).2 Related Works
Neural Architecture Search (NAS). Since the seminal works Baker et al. (2017); Zoph and Le (2017) show promising improvements over manually designed architectures, more efforts have been devoted to NAS. The accuracy of the found architectures has been improved by carefully designed search space Zoph et al. (2018), better search method Real et al. (2019), or compound scaling Tan and Le (2019). The model size and latency of the searched architectures have been reduced by Pareto optimization Tan et al. (2019); Wu et al. (2019); Cai et al. (2019, 2020) and enlarged search space of network size Cai et al. (2020); Dong and Yang (2019a). The efficiency of NAS algorithms has been improved by weight sharing Pham et al. (2018), differentiable optimization Liu et al. (2019), or stochastic sampling Dong and Yang (2019b); Xie et al. (2019). These methods have found stateoftheart architectures, however, their performance is bounded by the fixed or manually tuned hyperparameters.
Hyperparameter optimization (HPO). Blackbox and multifidelity HPO methods have a long standing history Bergstra and Bengio (2012); Hutter (2009); Hutter et al. (2011, 2019); Kohavi and John (1995); Hutter et al. (2019). Blackbox methods, e.g., grid search and random search Bergstra and Bengio (2012), regard the evaluation function as a blackbox. They sample some hyperparameters and evaluate them one by one to find the best. Bayesian methods can make the sampling procedure in random search more efficient Jones et al. (1998); Shahriari et al. (2015); Snoek et al. (2015). They employ a surrogate model and an acquisition function to decide which candidate to evaluate next Thornton et al. (2013)
. Multifidelity optimization methods accelerate the above methods by evaluating on a proxy task, e.g., using less training epochs or a subset of data
Domhan et al. (2015); Jaderberg et al. (2017); Kohavi and John (1995); Li et al. (2017). These HPO methods are computationally expensive to search for deep learning models
Krizhevsky et al. (2012).Recently, gradientbased HPO methods have shown better efficiency Baydin et al. (2018); Lorraine et al. (2020), by computing the gradient with respect to the hyperparameters. For example, Maclaurin et al. Maclaurin et al. (2015) calculate the extract gradients w.r.t. hyperparameters. Fabian Pedregosa (2016) leverages the implicit function theorem to calculate approximate hypergradient. Following that, different approximation methods have been proposed Lorraine et al. (2020); Pedregosa (2016); Shaban et al. (2019). Despite of their efficiency, they can only be applied to differentiable hyperparameters such as weight decay, but not nondifferentiable hyperparameters, such as learning rate Lorraine et al. (2020) or optimizer Shaban et al. (2019). Our AutoHAS is not only as efficient as gradientbased HPO methods but also applicable to both differentiable and nondifferentiable hyperparameters. Moreover, we show significant improvements on stateoftheart models with largescale datasets, which supplements the lack of strong empirical evidence in previous HPO methods.
3 AutoHAS
3.1 Preliminaries
HAS aims to find architecture and hyperparameters that achieve high performance on the validation set. HAS can be formulated as a bilevel optimization problem:
(1) 
where indicates the objective function (e.g., crossentropy loss) and indicates the initial weights of the architecture . and denote the training data and the validation data, respectively. represents the algorithm with hyperparameters to obtain the optimal weights , such as using SGD to minimize the training loss. In that case, . We can also use HyperNetwork Ha et al. (2017) to generate weights .
HAS generalizes both NAS and HPO by introducing a broader search space. On onehand, NAS is a special case of HAS, where the inner optimization uses fixed and to optimize . On the other, HPO is a special case of HAS, where is fixed in Eq. (1).
3.2 Representation of the HAS Search Space in AutoHAS
The search space of HAS in AutoHAS is a Cartesian product of the architecture and hyperparameter candidates. To search over this mixed search space, we need a unified representation of different searchable components, i.e., architectures, learning rates, optimizers, etc.
Architectures Search Space. We use the simplest case as an example. First of all, let the set of predefined candidate operations (e.g., 3x3 convolution, pooling, etc.) be , where the cardinality of is
. Suppose an architecture is constructed by stacking multiple layers, each layer takes a tensor
as input and output , which serves as the next layer’s input. denotes the operation at a layer and might be different at different layers. Then a candidate architecture is essentially the sequence for all layers . Further, a layer can be represented as a linear combination of the operations in as follows:(2) 
where (the
th element of the vector
) is the coefficient of operation for a layer. We call the set of all coefficients the architecture encoding, which can then represent the search space of the architecture.Hyperparameter Search Space. Now we can define the hyperparameter search space in a similar way. The major difference is that we have to consider both categorical and continuous cases:
(3) 
where is a predefined set of hyperparameter basis with the cardinality of and is the th basis in . (the th element of the vector ) is the coefficient of hyperparameter basis . If we have a continuous hyperparameter, we have to discretize it into a linear combination of basis and unify both categorical and continuous. For example, for weight decay, could be {1e1, 1e2, 1e3}, and therefore, all possible weight decay values can be represented as a linear combination over . For categorical hyperparameters, taking the optimizer as an example,
could be {Adam, SGD, RMSProp}. In this case, a constraint on
is applied: as in Eq. (3). When there are multiple different types of hyperparameters, each of them will have their own . The hyperparameter basis becomes the Cartesian product of their own basis and the coefficient is the product of the corresponding . We name the set of all coefficients the hyperparameter encoding, which can then represent the search space of hyperparameters.3.3 AutoHAS: Automated Hyperparameter and Architecture Search
Since each candidate in the HAS search space can be represented by a pair of and , the searching problem is converted to optimizing the encoding and . However, it is computationally prohibitive to compute the exact gradient of in Eq. (1) w.r.t. and . Alternatively, we propose a simple approximation strategy with weight sharing to accelerate this procedure.
First of all, we leverage a SuperModel to share weights among all candidate architectures in the architecture space, where each candidate is a submodel in this SuperModel Pham et al. (2018); Liu et al. (2019). The weights of the SuperModel is the union of weights of all basis operations in each layer. The weights of an architecture can thus be represented by , a subset of . Computing the exact gradients of w.r.t. and requires backpropagating through the initial network state , which is too expensive. Inspired by Liu et al. (2019); Pham et al. (2018), we approximate it using the current SuperModel weight as follows:
(4) 
Ideally, we should backpropagate through to modify the encoding . However, might be a complex optimization algorithm and not allow backpropagation. To solve this problem, we regard as a blackbox function and reformulate as follows:
(5) 
In this way, is calculated as a weighted sum of and generated weights from .
In practice, it is not easy to directly optimize the encodings and , because they naturally have some constraints associated with them, such as Eq. (3). Inspired by the continuous relaxation Liu et al. (2019); Dong and Yang (2019b), we instead use another set of relaxed variables } and to calculate and . and have the same dimension as and . The calculation procedure encapsulates the constraints of and in Eq. (2) and Eq. (3) as follows:
(6)  
(7)  
(8) 
where is computed by applying the GumbelSoftmax function Jang et al. (2017); Maddison et al. (2017) on . is a temperature value and are i.i.d samples drawn from Gumbel (0,1). The GumbelSoftmax in Eq. (7) incorporates the stochastic procedure during search. It can help explore more candidates in the HAS search space and avoid overfitting to some suboptimal architecture and hyperparameters. We use the same procedure as Eq. (6)(8) to define and for architecture encodings. Ideally, the encodings should be optimized with Eq. (6
) by backpropagation, but unfortunately onehot encodings
and are not differentiable. To address this issue, we follow Dong and Yang (2019b); Jang et al. (2017); Maddison et al. (2017) to relax the onehot encodings: in the forward pass, we use onehot encodings to compute validation loss, but in the backward pass, we apply relaxation on and substitute by during backpropagation.We describe our AutoHAS algorithm in Algorithm 1. During search, we jointly optimize and in an iterative way. The is updated as follows:
(9) 
where is a training algorithm: in our experiments, it is implemented as minimizing the training loss with respect to hyperparameter by one step. Notably, in Eq. (9), is computed by and is computed by .
3.4 Deriving Hyperparameters and Architecture
After obtaining the optimized encoding of architecture and hyperparameters following Sec. 3.3, we use them to derive the final architecture and hyperparameters. For hyperparameters, we apply different strategies to the continuous and categorical values:
(10) 
For architectures, since all values are categorical, we apply the same strategy in Eq. (10) for categorical values.
Notably, unlike other fixed hyperparameters, the learning rate can have different values at each training step, so it is a list of continuous values instead of a single scalar. To deal with this special case, we use Eq. (10) to derive the continuous learning rate value at each searching step, such that we can obtain a list of learning rate values corresponding to each specific step.
After we derive the final architecture and hyperparameters as in Algorithm 1, we will use the searched hyperparameters to retrain the searched architecture.
4 Experiments
4.1 Experimental Settings
Datasets. We demonstrate the effectiveness of our AutoHAS on five vision datasets, i.e., ImageNet Deng et al. (2009), Birdsnap Berg et al. (2014), CIFAR10, CIFAR100 Krizhevsky and Hinton (2009), and Cars Krause et al. (2013), and a NLP dataset, i.e., SQuAD 1.1 Rajpurkar et al. (2016).
Searching settings. We call the hyperparameters that control the behavior of AutoHAS as meta hyperparameters. For the meta hyperparameters, we set in GumbelSoftmax and employ Adam optimizer with a fixed learning rate 0.002. Notably, we use the same meta hyperparameters for all search experiments. The number of searching epochs and batch size are set to be the same as in the training settings of baseline models, i.e., they can be different for different baseline models. When searching for MBConvbased models Tan et al. (2019); Sandler et al. (2018), we search for the kernel size from {3, 5, 7} and the expansion ratio from {3, 6}. For vision tasks, the hyperparameter basis for the continuous value is the product of the default value and multipliers {0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0}. For the NLP task, we use smaller multipliers {0.01, 0.05, 0.1, 0.5, 1.0, 1.2, 1.5} since they are for finetuning on top of pretrained models. If a model has a default learning rate schedule, we create a range of values around the default learning rate at each step and use AutoHAS to find the best learning rate at each step.
Training settings. On vision datasets, we use six models, i.e., three variants of MobileNetV2 (MNet2), EfficientNetA0 (ENetA0), ResNet18, and ResNet50. We use the batch size of 4096 for ImageNet and 1024 for other vision datasets. We use the same data augmentation as shown in He et al. (2016). On the NLP dataset SQuAD, we finetune the pretrained BERT model and follow the setting of Devlin et al. (2019). The number of training epochs is different for different datasets, and we will explain their details later. For learning rate and weight decay, we use the values found by AutoHAS.
4.2 Ablation Studies
Searching Strategy  Deriving Strategy  MNet2 (S0)  MNet 
Softmax  Eq. (6)  44.0%  63.5% 
GS (soft)  Eq. (6)  45.5%  65.2% 
GS (hard)  Eq. (6)  45.9%  66.4% 
Softmax  Eq. (10)  40.8%  61.4% 
GS (soft)  Eq. (10)  41.5%  67.0% 
GS (hard)  Eq. (10)  46.3%  67.5% 
We did a series of experiments to study the effect of (I) different searching strategies; (II) different deriving strategies; (III) AutoHASsearched vs. manually tuned hyperparameters.
The effect of searching strategies. One of the key questions in searching is how to relax and optimize the architecture and hyperparameter encodings. Our AutoHAS leverages GumbelSoftmax in Eq. (7) to stochastically explore different hyperparameter and architecture basis. We evaluate two different variants in Table 2. “Softmax” does not add the Gumbel distributed noise and performs poorly compared to using GumbelSoftmax. This strategy has an overfitting problem, which is also found in NAS Dong and Yang (2019b); Xie et al. (2019); Dong and Yang (2020); Wu et al. (2019). “GS (soft)” does not use the onehot vector in Eq. (7) and thus it will explore too many hyperparameters during searching. As a result, its optimization might become difficult and the found are worse than AutoHAS.
The effect of deriving strategies. We evaluate two kinds of strategies to derive the final hyperparameters and architectures. The vanilla strategy is to follow previous NAS methods: selecting the basis hyperparameters with the maximal probability. However, it does not work well for the continuous choices. As shown in Table 2, our proposed strategy “GS (hard) + Eq. (10)” can improve the accuracy by 1.1% compared to the vanilla strategy “GS (hard) + Eq. (6)”.
Searched hyperparameters vs. manually tuned hyperparameters. We show the searched and manually tuned hyperparameters in Fig. 3. For the weight of L2 penalty, it is interesting that AutoHAS indicates using large penalty for the large models (ResNet) at the beginning and decay it to a smaller value at the end of searching. For manual tuning, you need to tune every model one by one to obtain their optimal hyperparameters. In contrast, AutoHAS only requires to tune two meta hyperparameters, in which it can successfully find good hyperparameters for tens of models. Besides, some hyperparameters, such as learning rate, are dynamically changed for every training step. It is hard for human to tune its perstep value, while AutoHAS can deal with such hyperparameters.
4.3 AutoHAS for Vision Datasets
ImageNet: We first apply AutoHAS to ImageNet and compare the performance with previous AutoML algorithms. We choose the hyperparameters used for ResNet He et al. (2016) as our default hyperparameters: warmup the learning rate at the first 5 epochs from 0 to and decay it to 0 via cosine annealing schedule Goyal et al. (2017); use the weight of L2 penalty as 1e4. Since these hyperparameters have been heavily tuned by human experts, there is limited headroom to improve. Therefore, we study how to train a model to achieve a good performance in shorter time, i.e., 30 epochs.
Type  Model  Searching Methods  
default HP  RS Bergstra and Bengio (2012)  Vizier Golovin et al. (2017)  IFT Lorraine et al. (2020)  HGD Baydin et al. (2018)  AutoHAS  
LR  MNet2 (S0)  44.60.6  12.38.7  6.14.5  N/A  29.62.1  44.80.4 
MNet2 (T0)  52.40.5  17.53.0  14.320.0  N/A  33.04.4  52.00.2  
MNet2  66.80.2  38.95.6  49.03.6  N/A  49.14.6  66.90.1  
ENetA0  60.80.0  46.61.2  50.80.8  N/A  50.01.4  61.00.0  
ResNet18  67.60.1  60.41.2  63.50.2  N/A  56.30.5  67.90.2  
ResNet50  74.80.1  67.20.1  71.10.3  N/A  62.30.3  75.20.1  
L2  MNet2 (S0)  44.60.6  45.90.7  46.30.2  46.20.1  N/A  46.30.1 
MNet2 (T0)  52.40.5  52.20.0  52.40.4  52.50.2  N/A  53.50.3  
MNet2  66.80.2  66.40.8  67.00.2  66.40.2  N/A  67.50.1  
ENetA0  60.80.0  60.02.0  62.00.2  61.10.2  N/A  62.20.1  
ResNet18  67.60.1  67.90.2  67.60.1  66.60.3  N/A  67.90.0  
ResNet50  74.80.1  75.00.1  74.80.1  73.10.4  N/A  75.00.1  
LR +L2  MNet2 (S0)  44.60.6  13.110.9  15.27.3  N/A  N/A  45.70.3 
MNet2 (T0)  52.40.5  29.320.6  30.215.9  N/A  N/A  53.80.2  
MNet2  66.80.2  21.615.1  25.214.6  N/A  N/A  67.30.1  
ENetA0  60.80.0  47.34.7  49.32.4  N/A  N/A  61.50.1  
ResNet18  67.60.1  54.28.5  53.50.5  N/A  N/A  67.80.0  
ResNet50  74.80.1  67.44.7  66.71.9  N/A  N/A  74.80.1  
A+LR +L2  MNet2 (S0)  44.60.6  22.412.4  25.44.1  46.40.4  N/A  47.50.3 
ENetA0  60.80.0  53.45.7  56.43.9  61.80.5  N/A  62.90.2 
Model  Params  FLOPs  Train Time  Searching Methods  
(MB)  (M)  (seconds)  RS / Vizier  IFTNeumann  HGD  AutoHAS  
MNet2 (S0)  1.49  35.0  2.0e3  1.9e4 (9.4)  2.0e3 (1.0)  2.6e3 (1.3)  2.8e3 (1.4) 
MNet2 (T0)  1.77  89.5  2.1e3  2.0e4 (9.3)  2.5e3 (1.2)  4.1e3 (2.0)  2.4e3 (1.2) 
MNet2  3.51  307.3  2.4e3  1.8e4 (7.5)  5.7e3 (2.3)  2.5e3 (1.1)  4.7e3 (1.9) 
ENetA0  2.17  76.2  1.4e3  1.2e4 (8.7)  2.2e3 (1.6)  1.9e3 (1.4)  2.2e3 (1.6) 
ResNet18  11.69  1818  2.0e3  1.9e4 (9.6)  2.7e3 (1.4)  2.2e3 (1.1)  2.2e3 (1.1) 
ResNet50  25.56  4104  2.6e3  2.0e4 (7.6)  2.9e3 (1.1)  2.8e3 (1.1)  2.8e3 (1.1) 
Table 3 and Table 4 shows the performance comparison. There are some interesting observations: (I) AutoHAS is applicable to searching for almost all kinds of hyperparameters and architectures, while previous hypergradient based methods Lorraine et al. (2020); Baydin et al. (2018) can only be applied to some hyperparameters. (II) AutoHAS shows improvements in seven different representative models, including both lightweight and heavy models. (III) The found hyperparameters by AutoHAS outperform the (default) manually tuned hyperparameters. (IV) The found hyperparameters by AutoHAS outperform that found by other AutoML algorithms. (V) Searching over the large joint HAS search space can obtain better results compared to searching for hyperparameters only. (VI) Gradientbased AutoML algorithms are more efficient than blackbox optimization methods, such as random search and vizier.
Birdsnap  CIFAR10  CIFAR100  Cars  
default  51.70.7  93.90.2  75.70.2  72.00.8 
GDAS (Arch) Dong and Yang (2019b)  55.80.6  93.50.0  76.40.5  77.02.5 
AutoHAS (HP)  54.40.7  93.90.1  76.00.1  77.70.1 
AutoHAS (HP+Arch)  56.50.9  93.70.3  76.00.4  80.30.5 
Smaller datastes: To analyze the effect of architecture and hyperparameters, we compare AutoHAS with two variants: searching for architecture only, i.e., GDAS (Arch), and searching for hyperparameters only, i.e., AutoHAS (HP). The results on four datasets are shown in Table 5. On Birdsnap and Cars, AutoHAS significantly outperforms GDAS (Arch) and AutoHAS (HP). On CIFAR100, the accuracy of AutoHAS is similar to GDAS (Arch) and AutoHAS (HP), while all of them outperform the default. On CIFAR10, the accuracy of autotuned architecture or hyperparameters is similar or slightly lower than the default. It might because the default choices are close to the optimal solution in the current HAS search space on CIFAR10.
4.4 AutoHAS for SQuAD
To further validate the generalizability of AutoHAS, we also conduct experiments on a reading comprehension dataset in the NLP domain, i.e., SQuAD 1.1 Rajpurkar et al. (2016). We pretrain a BERT model following Devlin et al. (2019) and then apply AutoHAS when finetuning it on SQuAD 1.1. In particular, we search the perstep learning rate and weight decay of Adam. For AutoHAS, we split the training set of SQUAD 1.1 into 80% for training and 20% validation. In Fig. 4, we show the results on the dev set, and compare the default setup in Devlin et al. (2019) with hyperparameters found by AutoHAS. We vary the finetuning steps from 2K to 22K and each setting is run 5 times. We can see that AutoHAS is superior to the default hyperparameters under most of the circumstances, in terms of both F1 and exact match (EM) scores. Notably, the average gain on F1 over all the steps is 0.3, which is highly nontrivial.^{1}^{1}1As of 06/03/2020, it takes 11months effort for the best model LUKE of SQuAD 1.1 to outperform the runnerup XLNet on F1 by 0.3 (see https://rajpurkar.github.io/SQuADexplorer/).
5 Conclusion
In this paper, we study the joint search of hyperparameters and architectures. Our framework overcomes the unrealistic assumptions in NAS that the relative ranking of models’ performance is primarily affected by their architecture. To address the challenge of joint search, we proposed AutoHAS, i.e., an efficient and differentiable searching algorithm for both hyperparameters and architecture. AutoHAS represents the hyperparameters and architectures in a unified way to handle the mixture of categorical and continuous values of the search space. AutoHAS shares weights across all hyperparameters and architectures, which enable it to search efficiently over the joint large search space. Experiments on both largescale vision and NLP datasets demonstrate the effectiveness of AutoHAS.
Appendix A More Experimental Details
a.1 Datasts
We use five vision datasets and a NLP dataset to validate the effectiveness of our AutoHAS.
The ImageNet dataset Deng et al. (2009) is a large scale image classification dataset, which has 1.28 million training images and 50 thousand images for validation. All images in ImageNet are categorized into 1000 classes. During searching, we split the training images into 1231121 images to optimize the weights and 50046 images to optimize the encoding.
The Birdsnap dataset Berg et al. (2014) is for finegrained visual classification, with 49829 images for 500 species. There are 47386 training images and 2443 test images. During searching, we split the training images into 42405 images to optimize the weights and 4981 images to optimize the encoding.
The CIFAR10 dataset Krizhevsky and Hinton (2009) consists of 60000 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. During searching, we split the training images into 45000 images to optimize the weights and 5000 images to optimize the encoding.
The CIFAR100 dataset Krizhevsky and Hinton (2009)
is similar to CIFAR10, while it classifies each image into 100 finegrained classes. During searching, we split the training images into 45000 images to optimize the weights and 5000 images to optimize the encoding.
The Cars dataset Krause et al. (2013) is contains 16185 images of 196 classes of cars. The data is split into 8144 training images and 8041 testing images. During searching, we split the training images into 6494 images to optimize the weights and 1650 images to optimize the encoding.
The SQuAD 1.1 dataset Rajpurkar et al. (2016) is a reading comprehension dataset of 107.7K data examples, with 87.5K for training, 10.1K for validation, and another 10.1K (hidden on server) for testing. Each example is a questionparagraph pair, where the question is generated by crowdsourced workers, and the answer must be a span from the paragraph, which is also labeled by the worker. In this paper, we train on the training set, and only report the results on validation set. During search, we use 80% of the training data to optimize the weights and the remaining to optimize the encoding.
a.2 Experimental Settings
Hyperparameters in Table 1. Both HP and HP train the model by 30 epochs in total, warmup the learning rate from 0 to the peak value in 5 epochs, and then decay the learning rate to 0 by the cosine schedule. The model is trained by momentum SGD. For HP, we use the peak value as 5.5945 and the weight for L2 penalty as 0.000153. For HP, we use the peak value as 1.1634 and the weight for L2 penalty as 0.0008436. Both Model and Model are similar to MobileNetV2 but use different kernel size and expansion ratio for the MBConv block. The (kernel size, expansion ratio) in all blocks for Model1 are {(7,1), (3,6), (3,6), (7,3), (3,3), (5,6), (3,6), (7,6), (3,3), (5,3), (5,6), (7,6), (3,6), (3,6), (3,6), (7,3), (3,3)}. The (kernel size, expansion ratio) in all blocks for Model2 are {(7,1), (7,3), (5,3), (5,3), (7,6), (5,6), (7,3), (5,6), (3,3), (3,3), (7,6), (3,6), (5,6), (7,6), (7,3), (7,6), (5,6)}.
More details in Figure 2. The baseline model is EfficientNetA0. For Random Search and Vizier, we report their results when searching for both learning rate and the weight of L2 penalty. When searching for the architecture, we search for the kernel size from {3, 5, 7} and the expansion ratio from {3, 6}.
Architecture of six models on ImageNet. We use the standard MobilieNetV2, ResNet18, and ResNet50. MNet2 (S0) is similar to MobilieNetV2 but sets the width multiplier and depth multiplier as 0.3. MNet2 (T0) sets the width multiplier as 0.2 and the multiplier as 3.0, thus it is a very thin network. EfficientNetA0 (ENetA0) uses the coefficients for scaling network width, depth, and resolution as 0.7, 0.5, and 0.7, respectively.
Data augmentation. On ImageNet, we use the standard data augmentation following He et al. (2016). Since we use a large batch size, i.e, 4096, the learning rate is increased to 1.6. By default, we train the model on ImageNet by 30 epochs. On other smaller vision datasets, we apply the same data augmentation as Szegedy et al. (2015), while we resize the image into 112 instead of 224 in ImageNet. We use a relatively small batch size 1024 and the learning rate of 0.4. By default, we train the model on small vision datasets by 9000 iterations.
Architecture and hyperparameters of BERT. We start with the pretrained BERT model Devlin et al. (2019), whose backbone is essentially a transformer model with 24 layers, 1024 hidden units and 16 heads. When finetuning the model on SQuAD 1.1, we adopt the hypeparameters in Devlin et al. (2019) as the default values: learning rate warmup from 0 to 5e5 for the first 10% training steps and then linearly decay to 0, with a batch size of 32.
Hardware. ImageNet experiments are performed on a 32core TPUv3, and others are performed on a 8core TPUv3 by default. When the memory is not enough, we increase the number of cores to meet the memory requirements.
All codes are implemented in Tensorflow. We run each searching experiment three times for the vision tasks and report the mean
variance. For the NLP task, since it is more sensitive than vision task, we run each searching experiment five times and report the meanvariance.References

[1]
(2017)
Designing neural network architectures using reinforcement learning
. In iclr, Cited by: §2.  [2] (2018) Online learning rate adaptation with hypergradient descent. In iclr, Cited by: §2, §4.3, Table 3.
 [3] (2014) Birdsnap: largescale finegrained visual categorization of birds. In cvpr, pp. 2011–2018. Cited by: §A.1, §4.1.
 [4] (2012) Random search for hyperparameter optimization. jmlr 13 (Feb), pp. 281–305. Cited by: §2, Table 3.
 [5] (2020) Once for all: train one network and specialize it for efficient deployment. In iclr, Cited by: §1, §2.
 [6] (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In iclr, Cited by: §1, §2.
 [7] (2009) ImageNet: a largescale hierarchical image database. In cvpr, pp. 248–255. Cited by: §A.1, §4.1.
 [8] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In acl, pp. 4171–4186. Cited by: §A.2, §1, Figure 4, §4.1, §4.4.

[9]
(2015)
Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves
. In ijcai, pp. 3460–3468. Cited by: §2.  [10] (2019) Network pruning via transformable architecture search. In neurips, pp. 760–771. Cited by: §1, §2.
 [11] (2019) Searching for a robust neural architecture in four gpu hours. In cvpr, pp. 1761–1770. Cited by: §1, §1, §2, §3.3, §4.2, Table 5.
 [12] (2020) NASbench201: extending the scope of reproducible neural architecture search. In iclr, Cited by: §4.2.
 [13] (2017) Google vizier: a service for blackbox optimization. In sigkdd, Cited by: Table 3.
 [14] (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §4.3.
 [15] (2017) HyperNetworks. In iclr, Cited by: §3.1.
 [16] (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In iclr, Cited by: §1.
 [17] (2018) AMC: automl for model compression and acceleration on mobile devices. In eccv, pp. 183–202. Cited by: §1.
 [18] (2016) Deep residual learning for image recognition. In cvpr, pp. 770–778. Cited by: §A.2, §1, §4.1, §4.3, Table 3.
 [19] (2019) Searching for mobilenetv3. In iccv, pp. 1314–1324. Cited by: §1.
 [20] (2011) Sequential modelbased optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pp. 507–523. Cited by: §2.

[21]
(2019)
Automated machine learning
. Springer. Cited by: §2.  [22] (2009) Automated configuration of algorithms for solving hard computational problems. Ph.D. Thesis, University of British Columbia. Cited by: §2.
 [23] (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §2.
 [24] (2017) Categorical reparameterization with gumbelsoftmax. In iclr, Cited by: §3.3.
 [25] (1998) Efficient global optimization of expensive blackbox functions. Journal of Global Optimization 13 (4), pp. 455–492. Cited by: §2.
 [26] (2019) Tabular benchmarks for joint architecture and hyperparameter optimization. arXiv preprint arXiv:1905.04970. Cited by: §2.

[27]
(1995)
Automatic parameter selection by minimizing estimated error
. In Machine Learning Proceedings, pp. 304–312. Cited by: §2.  [28] (2013) 3D object representations for finegrained categorization. In International IEEE Workshop on 3D Representation and Recognition (3dRR), Cited by: §A.1, §4.1.
 [29] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §A.1, §A.1, §4.1.

[30]
(2012)
ImageNet classification with deep convolutional neural networks
. In neurips, pp. 1097–1105. Cited by: §2.  [31] (2017) Hyperband: a novel banditbased approach to hyperparameter optimization. jmlr 18 (1), pp. 6765–6816. Cited by: §2.
 [32] (2019) DARTS: differentiable architecture search. In iclr, Cited by: Figure 1, §1, §1, §2, §3.3, §3.3.
 [33] (2020) Optimizing millions of hyperparameters by implicit differentiation. In aistats, Cited by: §2, §4.3, Table 3.

[34]
(2017)
SGDR: stochastic gradient descent with warm restarts
. In iclr, Cited by: Table 3.  [35] (2015) Gradientbased hyperparameter optimization through reversible learning. In icml, pp. 2113–2122. Cited by: §2.

[36]
(2017)
The concrete distribution: a continuous relaxation of discrete random variables
. In iclr, Cited by: §3.3.  [37] (2016) Hyperparameter optimization with approximate gradient. In icml, pp. 737–746. Cited by: §2.
 [38] (2018) Efficient neural architecture search via parameter sharing. In icml, pp. 4092–4101. Cited by: Figure 1, §1, §2, §3.3.
 [39] (2016) SQuAD: 100,000+ questions for machine comprehension of text. In emnlp, pp. 2383–2392. Cited by: §A.1, §4.1, §4.4.
 [40] (2019) Regularized evolution for image classifier architecture search. In aaai, pp. 4780–4789. Cited by: §2.
 [41] (2018) MobileNetV2: inverted residuals and linear bottlenecks. In cvpr, pp. 4510–4520. Cited by: Table 1, §1, §4.1.
 [42] (2019) Truncated backpropagation for bilevel optimization. In aistats, pp. 1723–1732. Cited by: §2.
 [43] (2015) Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §2.
 [44] (2015) Scalable bayesian optimization using deep neural networks. In icml, pp. 2171–2180. Cited by: §2.
 [45] (2015) Going deeper with convolutions. In cvpr, pp. 1–9. Cited by: §A.2.
 [46] (2019) MNasNet: platformaware neural architecture search for mobile. In cvpr, pp. 2820–2828. Cited by: §2, §4.1.
 [47] (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In icml, pp. 6105–6114. Cited by: §1, §1, §2.
 [48] (2020) EfficientDet: scalable and efficient object detection. In cvpr, Cited by: §1.
 [49] (2013) Autoweka: combined selection and hyperparameter optimization of classification algorithms. In sigkdd, pp. 847–855. Cited by: §2.
 [50] (2019) FBNet: hardwareaware efficient convnet design via differentiable neural architecture search. In cvpr, pp. 10734–10742. Cited by: §1, §2, §4.2.
 [51] (2017) Genetic CNN. In iccv, pp. 1379–1388. Cited by: §1.
 [52] (2019) SNAS: stochastic neural architecture search. In iclr, Cited by: §2, §4.2.
 [53] (2019) NASbench101: towards reproducible neural architecture search. In icml, pp. 7105–7114. Cited by: §1.
 [54] (2018) Towards automated deep learning: efficient joint neural architecture and hyperparameter search. In icml_w, Cited by: §2.
 [55] (2017) Neural architecture search with reinforcement learning. In iclr, Cited by: §1, §2.
 [56] (2018) Learning transferable architectures for scalable image recognition. In cvpr, pp. 8697–8710. Cited by: §2.