AutoHAS: Differentiable Hyper-parameter and Architecture Search

06/05/2020 ∙ by Xuanyi Dong, et al. ∙ Google University of Technology Sydney 0

Neural Architecture Search (NAS) has achieved significant progress in pushing state-of-the-art performance. While previous NAS methods search for different network architectures with the same hyper-parameters, we argue that such search would lead to sub-optimal results. We empirically observe that different architectures tend to favor their own hyper-parameters. In this work, we extend NAS to a broader and more practical space by combining hyper-parameter and architecture search. As architecture choices are often categorical whereas hyper-parameter choices are often continuous, a critical challenge here is how to handle these two types of values in a joint search space. To tackle this challenge, we propose AutoHAS, a differentiable hyper-parameter and architecture search approach, with the idea of discretizing the continuous space into a linear combination of multiple categorical basis. A key element of AutoHAS is the use of weight sharing across all architectures and hyper-parameters which enables efficient search over the large joint search space. Experimental results on MobileNet/ResNet/EfficientNet/BERT show that AutoHAS significantly improves accuracy up to 2 0.4 on SQuAD 1.1, with search cost comparable to training a single model. Compared to other AutoML methods, such as random search or Bayesian methods, AutoHAS can achieve better accuracy with 10x less compute cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model rank Model
HP (LR=5.5,L2=1.5e-4) 56.9% > 55.6%
HP (LR=1.1,L2=8.4e-4) 54.7% < 56.2%
Table 1: ImageNet accuracy of two models randomly sampled from search space based on MobileNet-V2 Sandler et al. (2018). Model favors HP while Model favors HP.

Neural Architecture Search (NAS) has brought significant improvements in many applications, such as machine perception  Howard et al. (2019); Tan et al. (2020); Cai et al. (2020); Xie and Yuille (2017); Wu et al. (2019), language modeling Liu et al. (2019); Dong and Yang (2019b), and model compression Han (2018); Dong and Yang (2019a); Han et al. (2016). Most NAS works apply the same hyper-parameters while searching for network architectures. For example, all models in Zoph and Le (2017); Tan and Le (2019); Ying et al. (2019) are trained with the same optimizer, learning rate, and weight decay. As a result, the relative ranking of models in the search space is only determined by their architectures. However, we observe that different models favor different hyper-parameters. Table 1 shows the performance of two randomly sampled models with different hyper-parameters: under hyper-parameter HP, model outperforms model, but model is better under HP. These results suggest using fixed hyper-parameters in NAS would lead to sub-optimal results.

A natural question is: could we extend NAS to a broader scope for joint Hyper-parameter and Architecture Search (HAS)? In HAS, each model can potentially be coupled with its own best hyper-parameters, thus achieving better performance than existing NAS with fixed hyper-parameters. However, jointly searching for architectures and hyper-parameters is challenging. The first challenge is how to deal with both categorical and continuous values in the joint HAS search space. While architecture choices are mostly categorical values (e.g., convolution kernel size), hyper-parameters choices can be both categorical (e.g., the type of optimizer) and continuous values (e.g., weight decay). There is not yet a good solution to tackle this challenge: previous NAS methods only focus on categorical search spaces, while hyper-parameter optimization methods only focus on continuous search spaces. They thus cannot be directly applied to such a mixture of categorical and continuous search space. Secondly, another critical challenge is how to efficiently search over the larger joint HAS search space as it combines both architecture and hyper-parameter choices.

In this paper, we propose AutoHAS, a differentiable HAS algorithm. It is, to the best of our knowledge, the first algorithm that can efficiently handle the large joint HAS search space. To address the mixture of categorical and continuous search spaces, we first discretize the continuous hyper-parameters into a linear combination of multiple categorical basis, then we can unify them during search. As explained below, we will use a differentiable method to search over the combination, i.e., architecture and HP encodings in Fig. 1

. These encodings represent the probability distribution over all candidates in the respective space. They can be used to find the best architecture together with its associated hyper-parameters.

Figure 1: The AutoHAS framework. Architecture encoding () and hyper-parameter (HP) encoding () represent the distribution of possible choices. Similar to Pham et al. (2018); Liu et al. (2019), we use a SuperModel to share the weights among all candidate architectures. AutoHAS alternates between updating the shared weights and updating the encoding ( and ). When updating encoding, each HP basis combination will result in a separate copy of the model weights (, …, ). These copies are weighted by the HP encoding to compute the final weights . Encoding is updated by back-propagation to minimize validation loss. When updating the shared weights, we first forward the SuperModel to compute the training loss. Then, different HP basis are weighted by the HP encoding to compute one set of hyper-parameters, which will be used to back-propagate the gradients from the training loss to update the shared weights . After this searching procedure, AutoHAS will derive the final architecture and hyper-parameters from the learned architecture and HP encodings.
Figure 2: AutoHAS achieves higher accuracy with 10 less search cost than other AutoML methods. We search for the LR, L2, kernel size, expansion of EfficientNet-A0.

To efficiently navigate the much larger search space, we further introduce a novel weight sharing technique for AutoHAS. Weight sharing has been widely used in previous NAS approaches Pham et al. (2018); Liu et al. (2019) to reduce the search cost. The main idea is to train a SuperModel, where each candidate in the architecture space is its sub-model. Using a SuperModel can avoid training millions of candidates from scratch Liu et al. (2019); Dong and Yang (2019b); Cai et al. (2019); Pham et al. (2018). Motivated by the weight sharing in NAS, AutoHAS extends its scope from architecture search to both architecture and hyper-parameter search. We not only share the weights of the SuperModel with each architecture but also share this SuperModel across hyper-parameters. At each search step, AutoHAS optimizes the shared SuperModel by a combination of the basis of HAS space, and the shared SuperModel serves as a good initialization for all hyper-parameters at the next step of search (see Fig. 1 and Sec. 3).

In this paper, we focus on architecture, learning rate, and L2 penalty weight optimization, but it should be straightforward to apply AutoHAS to other hyper-parameters. A summary of our results is in Fig. 2, which shows that AutoHAS outperforms many AutoML methods regarding both accuracy and efficiency (more details in Sec. 4.3). In Sec. 4

, we show that it improves a number of computer vision and natural language processing models, i.e., MobileNet-V2 

Sandler et al. (2018), ResNet He et al. (2016), EfficientNet Tan and Le (2019), and BERT fine-tuning Devlin et al. (2019).

2 Related Works

Neural Architecture Search (NAS). Since the seminal works Baker et al. (2017); Zoph and Le (2017) show promising improvements over manually designed architectures, more efforts have been devoted to NAS. The accuracy of the found architectures has been improved by carefully designed search space Zoph et al. (2018), better search method Real et al. (2019), or compound scaling Tan and Le (2019). The model size and latency of the searched architectures have been reduced by Pareto optimization Tan et al. (2019); Wu et al. (2019); Cai et al. (2019, 2020) and enlarged search space of network size Cai et al. (2020); Dong and Yang (2019a). The efficiency of NAS algorithms has been improved by weight sharing Pham et al. (2018), differentiable optimization Liu et al. (2019), or stochastic sampling Dong and Yang (2019b); Xie et al. (2019). These methods have found state-of-the-art architectures, however, their performance is bounded by the fixed or manually tuned hyper-parameters.

Hyper-parameter optimization (HPO). Black-box and multi-fidelity HPO methods have a long standing history Bergstra and Bengio (2012); Hutter (2009); Hutter et al. (2011, 2019); Kohavi and John (1995); Hutter et al. (2019). Black-box methods, e.g., grid search and random search Bergstra and Bengio (2012), regard the evaluation function as a black-box. They sample some hyper-parameters and evaluate them one by one to find the best. Bayesian methods can make the sampling procedure in random search more efficient Jones et al. (1998); Shahriari et al. (2015); Snoek et al. (2015). They employ a surrogate model and an acquisition function to decide which candidate to evaluate next Thornton et al. (2013)

. Multi-fidelity optimization methods accelerate the above methods by evaluating on a proxy task, e.g., using less training epochs or a subset of data 

Domhan et al. (2015); Jaderberg et al. (2017); Kohavi and John (1995); Li et al. (2017)

. These HPO methods are computationally expensive to search for deep learning models 

Krizhevsky et al. (2012).

Recently, gradient-based HPO methods have shown better efficiency Baydin et al. (2018); Lorraine et al. (2020), by computing the gradient with respect to the hyper-parameters. For example, Maclaurin et al. Maclaurin et al. (2015) calculate the extract gradients w.r.t. hyper-parameters. Fabian Pedregosa (2016) leverages the implicit function theorem to calculate approximate hyper-gradient. Following that, different approximation methods have been proposed Lorraine et al. (2020); Pedregosa (2016); Shaban et al. (2019). Despite of their efficiency, they can only be applied to differentiable hyper-parameters such as weight decay, but not non-differentiable hyper-parameters, such as learning rate Lorraine et al. (2020) or optimizer Shaban et al. (2019). Our AutoHAS is not only as efficient as gradient-based HPO methods but also applicable to both differentiable and non-differentiable hyper-parameters. Moreover, we show significant improvements on state-of-the-art models with large-scale datasets, which supplements the lack of strong empirical evidence in previous HPO methods.

Joint Hyper-parameter and Architecture Search (HAS). Few approaches have been developed for the joint searching of HAS Klein and Hutter (2019); Zela et al. (2018). However, they focus on small datasets and small search spaces. These methods are more computationally expensive than our AutoHAS.

3 AutoHAS

3.1 Preliminaries

HAS aims to find architecture and hyper-parameters that achieve high performance on the validation set. HAS can be formulated as a bi-level optimization problem:

(1)

where indicates the objective function (e.g., cross-entropy loss) and indicates the initial weights of the architecture . and denote the training data and the validation data, respectively. represents the algorithm with hyper-parameters to obtain the optimal weights , such as using SGD to minimize the training loss. In that case, . We can also use HyperNetwork Ha et al. (2017) to generate weights .

HAS generalizes both NAS and HPO by introducing a broader search space. On one-hand, NAS is a special case of HAS, where the inner optimization uses fixed and to optimize . On the other, HPO is a special case of HAS, where is fixed in Eq. (1).

3.2 Representation of the HAS Search Space in AutoHAS

The search space of HAS in AutoHAS is a Cartesian product of the architecture and hyper-parameter candidates. To search over this mixed search space, we need a unified representation of different searchable components, i.e., architectures, learning rates, optimizers, etc.

Architectures Search Space. We use the simplest case as an example. First of all, let the set of predefined candidate operations (e.g., 3x3 convolution, pooling, etc.) be , where the cardinality of is

. Suppose an architecture is constructed by stacking multiple layers, each layer takes a tensor

as input and output , which serves as the next layer’s input. denotes the operation at a layer and might be different at different layers. Then a candidate architecture is essentially the sequence for all layers . Further, a layer can be represented as a linear combination of the operations in as follows:

(2)

where (the

-th element of the vector

) is the coefficient of operation for a layer. We call the set of all coefficients the architecture encoding, which can then represent the search space of the architecture.

Hyper-parameter Search Space. Now we can define the hyper-parameter search space in a similar way. The major difference is that we have to consider both categorical and continuous cases:

(3)

where is a predefined set of hyper-parameter basis with the cardinality of and is the -th basis in . (the -th element of the vector ) is the coefficient of hyper-parameter basis . If we have a continuous hyper-parameter, we have to discretize it into a linear combination of basis and unify both categorical and continuous. For example, for weight decay, could be {1e-1, 1e-2, 1e-3}, and therefore, all possible weight decay values can be represented as a linear combination over . For categorical hyper-parameters, taking the optimizer as an example,

could be {Adam, SGD, RMSProp}. In this case, a constraint on

is applied: as in Eq. (3). When there are multiple different types of hyper-parameters, each of them will have their own . The hyper-parameter basis becomes the Cartesian product of their own basis and the coefficient is the product of the corresponding . We name the set of all coefficients the hyper-parameter encoding, which can then represent the search space of hyper-parameters.

3.3 AutoHAS: Automated Hyper-parameter and Architecture Search

Since each candidate in the HAS search space can be represented by a pair of and , the searching problem is converted to optimizing the encoding and . However, it is computationally prohibitive to compute the exact gradient of in Eq. (1) w.r.t. and . Alternatively, we propose a simple approximation strategy with weight sharing to accelerate this procedure.

First of all, we leverage a SuperModel to share weights among all candidate architectures in the architecture space, where each candidate is a sub-model in this SuperModel Pham et al. (2018); Liu et al. (2019). The weights of the SuperModel is the union of weights of all basis operations in each layer. The weights of an architecture can thus be represented by , a subset of . Computing the exact gradients of w.r.t. and requires back-propagating through the initial network state , which is too expensive. Inspired by Liu et al. (2019); Pham et al. (2018), we approximate it using the current SuperModel weight as follows:

(4)

Ideally, we should back-propagate through to modify the encoding . However, might be a complex optimization algorithm and not allow back-propagation. To solve this problem, we regard as a black-box function and reformulate as follows:

(5)

In this way, is calculated as a weighted sum of and generated weights from .

In practice, it is not easy to directly optimize the encodings and , because they naturally have some constraints associated with them, such as Eq. (3). Inspired by the continuous relaxation Liu et al. (2019); Dong and Yang (2019b), we instead use another set of relaxed variables } and to calculate and . and have the same dimension as and . The calculation procedure encapsulates the constraints of and in Eq. (2) and Eq. (3) as follows:

(6)
(7)
(8)

where is computed by applying the Gumbel-Softmax function Jang et al. (2017); Maddison et al. (2017) on . is a temperature value and are i.i.d samples drawn from Gumbel (0,1). The Gumbel-Softmax in Eq. (7) incorporates the stochastic procedure during search. It can help explore more candidates in the HAS search space and avoid over-fitting to some sub-optimal architecture and hyper-parameters. We use the same procedure as Eq. (6)(8) to define and for architecture encodings. Ideally, the encodings should be optimized with Eq. (6

) by back-propagation, but unfortunately one-hot encodings

and are not differentiable. To address this issue, we follow Dong and Yang (2019b); Jang et al. (2017); Maddison et al. (2017) to relax the one-hot encodings: in the forward pass, we use one-hot encodings to compute validation loss, but in the backward pass, we apply relaxation on and substitute by during back-propagation.

We describe our AutoHAS algorithm in Algorithm 1. During search, we jointly optimize and in an iterative way. The is updated as follows:

(9)

where is a training algorithm: in our experiments, it is implemented as minimizing the training loss with respect to hyper-parameter by one step. Notably, in Eq. (9), is computed by and is computed by .

3.4 Deriving Hyper-parameters and Architecture

0:  Randomly initialize
0:  Randomly initialize
0:  Split the available data into two disjoint sets: and
1:  while not converged do
2:     Update weights via Eq. (9)
3:     Optimize via Eq. (4)(8)
4:  end while
5:  Derive the final architecture from and hyper-parameters from
Algorithm 1 The AutoHAS Procedure

After obtaining the optimized encoding of architecture and hyper-parameters following Sec. 3.3, we use them to derive the final architecture and hyper-parameters. For hyper-parameters, we apply different strategies to the continuous and categorical values:

(10)

For architectures, since all values are categorical, we apply the same strategy in Eq. (10) for categorical values.

Notably, unlike other fixed hyper-parameters, the learning rate can have different values at each training step, so it is a list of continuous values instead of a single scalar. To deal with this special case, we use Eq. (10) to derive the continuous learning rate value at each searching step, such that we can obtain a list of learning rate values corresponding to each specific step.

After we derive the final architecture and hyper-parameters as in Algorithm 1, we will use the searched hyper-parameters to re-train the searched architecture.

4 Experiments

4.1 Experimental Settings

Datasets. We demonstrate the effectiveness of our AutoHAS on five vision datasets, i.e., ImageNet Deng et al. (2009), Birdsnap Berg et al. (2014), CIFAR-10, CIFAR-100 Krizhevsky and Hinton (2009), and Cars Krause et al. (2013), and a NLP dataset, i.e., SQuAD 1.1 Rajpurkar et al. (2016).

Searching settings. We call the hyper-parameters that control the behavior of AutoHAS as meta hyper-parameters. For the meta hyper-parameters, we set in Gumbel-Softmax and employ Adam optimizer with a fixed learning rate 0.002. Notably, we use the same meta hyper-parameters for all search experiments. The number of searching epochs and batch size are set to be the same as in the training settings of baseline models, i.e., they can be different for different baseline models. When searching for MBConv-based models Tan et al. (2019); Sandler et al. (2018), we search for the kernel size from {3, 5, 7} and the expansion ratio from {3, 6}. For vision tasks, the hyper-parameter basis for the continuous value is the product of the default value and multipliers {0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0}. For the NLP task, we use smaller multipliers {0.01, 0.05, 0.1, 0.5, 1.0, 1.2, 1.5} since they are for fine-tuning on top of pretrained models. If a model has a default learning rate schedule, we create a range of values around the default learning rate at each step and use AutoHAS to find the best learning rate at each step.

Training settings. On vision datasets, we use six models, i.e., three variants of MobileNet-V2 (MNet2), EfficientNet-A0 (ENet-A0), ResNet-18, and ResNet-50. We use the batch size of 4096 for ImageNet and 1024 for other vision datasets. We use the same data augmentation as shown in He et al. (2016). On the NLP dataset SQuAD, we fine-tune the pretrained BERT model and follow the setting of Devlin et al. (2019). The number of training epochs is different for different datasets, and we will explain their details later. For learning rate and weight decay, we use the values found by AutoHAS.

4.2 Ablation Studies

Searching Strategy Deriving Strategy MNet2 (S0) MNet
Softmax Eq. (6) 44.0% 63.5%
GS (soft) Eq. (6) 45.5% 65.2%
GS (hard) Eq. (6) 45.9% 66.4%
Softmax Eq. (10) 40.8% 61.4%
GS (soft) Eq. (10) 41.5% 67.0%
GS (hard) Eq. (10) 46.3% 67.5%
Table 2: We analyze different strategies used in AutoHAS. “GS” indicates Gumbel-Softmax. “hard” indicates using one-hot in forward pass and relaxation during backward pass. “soft” indicates using relaxation during both forward and backward passes.

We did a series of experiments to study the effect of (I) different searching strategies; (II) different deriving strategies; (III) AutoHAS-searched vs. manually tuned hyper-parameters.

Figure 3: AutoHAS found different learning rate and weight of L2 penalty for different models.

The effect of searching strategies. One of the key questions in searching is how to relax and optimize the architecture and hyper-parameter encodings. Our AutoHAS leverages Gumbel-Softmax in Eq. (7) to stochastically explore different hyper-parameter and architecture basis. We evaluate two different variants in Table 2. “Softmax” does not add the Gumbel distributed noise and performs poorly compared to using Gumbel-Softmax. This strategy has an over-fitting problem, which is also found in NAS Dong and Yang (2019b); Xie et al. (2019); Dong and Yang (2020); Wu et al. (2019). “GS (soft)” does not use the one-hot vector in Eq. (7) and thus it will explore too many hyper-parameters during searching. As a result, its optimization might become difficult and the found are worse than AutoHAS.

The effect of deriving strategies. We evaluate two kinds of strategies to derive the final hyper-parameters and architectures. The vanilla strategy is to follow previous NAS methods: selecting the basis hyper-parameters with the maximal probability. However, it does not work well for the continuous choices. As shown in Table 2, our proposed strategy “GS (hard) + Eq. (10)” can improve the accuracy by 1.1% compared to the vanilla strategy “GS (hard) + Eq. (6)”.

Searched hyper-parameters vs. manually tuned hyper-parameters. We show the searched and manually tuned hyper-parameters in Fig. 3. For the weight of L2 penalty, it is interesting that AutoHAS indicates using large penalty for the large models (ResNet) at the beginning and decay it to a smaller value at the end of searching. For manual tuning, you need to tune every model one by one to obtain their optimal hyper-parameters. In contrast, AutoHAS only requires to tune two meta hyper-parameters, in which it can successfully find good hyper-parameters for tens of models. Besides, some hyper-parameters, such as learning rate, are dynamically changed for every training step. It is hard for human to tune its per-step value, while AutoHAS can deal with such hyper-parameters.

4.3 AutoHAS for Vision Datasets

ImageNet: We first apply AutoHAS to ImageNet and compare the performance with previous AutoML algorithms. We choose the hyper-parameters used for ResNet He et al. (2016) as our default hyper-parameters: warm-up the learning rate at the first 5 epochs from 0 to and decay it to 0 via cosine annealing schedule Goyal et al. (2017); use the weight of L2 penalty as 1e-4. Since these hyper-parameters have been heavily tuned by human experts, there is limited headroom to improve. Therefore, we study how to train a model to achieve a good performance in shorter time, i.e., 30 epochs.

Type Model Searching Methods
default HP RS Bergstra and Bengio (2012) Vizier Golovin et al. (2017) IFT Lorraine et al. (2020) HGD Baydin et al. (2018) AutoHAS
LR MNet2 (S0) 44.60.6 12.38.7 6.14.5 N/A 29.62.1 44.80.4
MNet2 (T0) 52.40.5 17.53.0 14.320.0 N/A 33.04.4 52.00.2
MNet2 66.80.2 38.95.6 49.03.6 N/A 49.14.6 66.90.1
ENet-A0 60.80.0 46.61.2 50.80.8 N/A 50.01.4 61.00.0
ResNet-18 67.60.1 60.41.2 63.50.2 N/A 56.30.5 67.90.2
ResNet-50 74.80.1 67.20.1 71.10.3 N/A 62.30.3 75.20.1
L2 MNet2 (S0) 44.60.6 45.90.7 46.30.2 46.20.1 N/A 46.30.1
MNet2 (T0) 52.40.5 52.20.0 52.40.4 52.50.2 N/A 53.50.3
MNet2 66.80.2 66.40.8 67.00.2 66.40.2 N/A 67.50.1
ENet-A0 60.80.0 60.02.0 62.00.2 61.10.2 N/A 62.20.1
ResNet-18 67.60.1 67.90.2 67.60.1 66.60.3 N/A 67.90.0
ResNet-50 74.80.1 75.00.1 74.80.1 73.10.4 N/A 75.00.1
LR +L2 MNet2 (S0) 44.60.6 13.110.9 15.27.3 N/A N/A 45.70.3
MNet2 (T0) 52.40.5 29.320.6 30.215.9 N/A N/A 53.80.2
MNet2 66.80.2 21.615.1 25.214.6 N/A N/A 67.30.1
ENet-A0 60.80.0 47.34.7 49.32.4 N/A N/A 61.50.1
ResNet-18 67.60.1 54.28.5 53.50.5 N/A N/A 67.80.0
ResNet-50 74.80.1 67.44.7 66.71.9 N/A N/A 74.80.1
A+LR +L2 MNet2 (S0) 44.60.6 22.412.4 25.44.1 46.40.4 N/A 47.50.3
ENet-A0 60.80.0 53.45.7 56.43.9 61.80.5 N/A 62.90.2
Table 3: We compare four AutoML algorithms Bergstra and Bengio (2012); Golovin et al. (2017); Baydin et al. (2018); Lorraine et al. (2020) on four search spaces. “-” indicates the algorithm can not be applied to that search space. We choose the hyper-parameters in ResNet He et al. (2016) with warm-up mechanism Loshchilov and Hutter (2017) as the default setting (“default HP”) to train models on ImageNet. The number of training epochs is 30. “RS” indicates the random searching algorithm Bergstra and Bengio (2012). “N/A” indicates the corresponding searching algorithm can not be applied. “ENet” and “MNet2” indicate EfficientNet and MobileNet-V2, respectively. “A+LR+L2” indicates searching for architecture, learning rate, and weight of L2 penalty. For AutoHAS, we use the same meta hyper-parameters for all searching experiments: Adam with the learning rate of 0.002, of 10, and the same multipliers to create basis.
Model Params FLOPs Train Time Searching Methods
(MB) (M) (seconds) RS / Vizier IFT-Neumann HGD AutoHAS
MNet2 (S0) 1.49 35.0 2.0e3 1.9e4 (9.4) 2.0e3 (1.0) 2.6e3 (1.3) 2.8e3 (1.4)
MNet2 (T0) 1.77 89.5 2.1e3 2.0e4 (9.3) 2.5e3 (1.2) 4.1e3 (2.0) 2.4e3 (1.2)
MNet2 3.51 307.3 2.4e3 1.8e4 (7.5) 5.7e3 (2.3) 2.5e3 (1.1) 4.7e3 (1.9)
ENet-A0 2.17 76.2 1.4e3 1.2e4 (8.7) 2.2e3 (1.6) 1.9e3 (1.4) 2.2e3 (1.6)
ResNet-18 11.69 1818 2.0e3 1.9e4 (9.6) 2.7e3 (1.4) 2.2e3 (1.1) 2.2e3 (1.1)
ResNet-50 25.56 4104 2.6e3 2.0e4 (7.6) 2.9e3 (1.1) 2.8e3 (1.1) 2.8e3 (1.1)
Table 4: We report the computational costs of each model and the searching costs of each AutoML algorithm on ImageNet. Since the time may vary on batch size, platforms, accelerators, or devices, we also report the relative cost to the training time. We use the same notation as used in Table 3.

Table 3 and Table 4 shows the performance comparison. There are some interesting observations: (I) AutoHAS is applicable to searching for almost all kinds of hyper-parameters and architectures, while previous hyper-gradient based methods Lorraine et al. (2020); Baydin et al. (2018) can only be applied to some hyper-parameters. (II) AutoHAS shows improvements in seven different representative models, including both light-weight and heavy models. (III) The found hyper-parameters by AutoHAS outperform the (default) manually tuned hyper-parameters. (IV) The found hyper-parameters by AutoHAS outperform that found by other AutoML algorithms. (V) Searching over the large joint HAS search space can obtain better results compared to searching for hyper-parameters only. (VI) Gradient-based AutoML algorithms are more efficient than black-box optimization methods, such as random search and vizier.

Birdsnap CIFAR-10 CIFAR-100 Cars
default 51.70.7 93.90.2 75.70.2 72.00.8
GDAS (Arch) Dong and Yang (2019b) 55.80.6 93.50.0 76.40.5 77.02.5
AutoHAS (HP) 54.40.7 93.90.1 76.00.1 77.70.1
AutoHAS (HP+Arch) 56.50.9 93.70.3 76.00.4 80.30.5
Table 5: We use AutoHAS to search for hyper-parameters (HP), architectures (Arch), and both hyper-parameters and architectures (HP+Arch) on four datasets. We use MobileNet-V2 as the default model. We follow Table 3 to setup the default hyper-parameters. For all datasets, AutoHAS either outperforms or is competitive to searching HP or Arch only.

Smaller datastes: To analyze the effect of architecture and hyper-parameters, we compare AutoHAS with two variants: searching for architecture only, i.e., GDAS (Arch), and searching for hyper-parameters only, i.e., AutoHAS (HP). The results on four datasets are shown in Table 5. On Birdsnap and Cars, AutoHAS significantly outperforms GDAS (Arch) and AutoHAS (HP). On CIFAR-100, the accuracy of AutoHAS is similar to GDAS (Arch) and AutoHAS (HP), while all of them outperform the default. On CIFAR-10, the accuracy of auto-tuned architecture or hyper-parameters is similar or slightly lower than the default. It might because the default choices are close to the optimal solution in the current HAS search space on CIFAR-10.

4.4 AutoHAS for SQuAD

Figure 4: Performance comparison on SQuAD 1.1 fine-tuned on the BERT model. AutoHAS achieves better performance in both F1 and exact match (EM) than the default setting in Devlin et al. (2019) under various maximum training steps over 5 runs.

To further validate the generalizability of AutoHAS, we also conduct experiments on a reading comprehension dataset in the NLP domain, i.e., SQuAD 1.1 Rajpurkar et al. (2016). We pretrain a BERT model following Devlin et al. (2019) and then apply AutoHAS when fine-tuning it on SQuAD 1.1. In particular, we search the per-step learning rate and weight decay of Adam. For AutoHAS, we split the training set of SQUAD 1.1 into 80% for training and 20% validation. In Fig. 4, we show the results on the dev set, and compare the default setup in Devlin et al. (2019) with hyper-parameters found by AutoHAS. We vary the fine-tuning steps from 2K to 22K and each setting is run 5 times. We can see that AutoHAS is superior to the default hyper-parameters under most of the circumstances, in terms of both F1 and exact match (EM) scores. Notably, the average gain on F1 over all the steps is 0.3, which is highly nontrivial.111As of 06/03/2020, it takes 11-months effort for the best model LUKE of SQuAD 1.1 to outperform the runner-up XLNet on F1 by 0.3 (see https://rajpurkar.github.io/SQuAD-explorer/).

5 Conclusion

In this paper, we study the joint search of hyper-parameters and architectures. Our framework overcomes the unrealistic assumptions in NAS that the relative ranking of models’ performance is primarily affected by their architecture. To address the challenge of joint search, we proposed AutoHAS, i.e., an efficient and differentiable searching algorithm for both hyper-parameters and architecture. AutoHAS represents the hyper-parameters and architectures in a unified way to handle the mixture of categorical and continuous values of the search space. AutoHAS shares weights across all hyper-parameters and architectures, which enable it to search efficiently over the joint large search space. Experiments on both large-scale vision and NLP datasets demonstrate the effectiveness of AutoHAS.

Appendix A More Experimental Details

a.1 Datasts

We use five vision datasets and a NLP dataset to validate the effectiveness of our AutoHAS.

The ImageNet dataset Deng et al. (2009) is a large scale image classification dataset, which has 1.28 million training images and 50 thousand images for validation. All images in ImageNet are categorized into 1000 classes. During searching, we split the training images into 1231121 images to optimize the weights and 50046 images to optimize the encoding.

The Birdsnap dataset Berg et al. (2014) is for fine-grained visual classification, with 49829 images for 500 species. There are 47386 training images and 2443 test images. During searching, we split the training images into 42405 images to optimize the weights and 4981 images to optimize the encoding.

The CIFAR-10 dataset Krizhevsky and Hinton (2009) consists of 60000 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. During searching, we split the training images into 45000 images to optimize the weights and 5000 images to optimize the encoding.

The CIFAR-100 dataset Krizhevsky and Hinton (2009)

is similar to CIFAR-10, while it classifies each image into 100 fine-grained classes. During searching, we split the training images into 45000 images to optimize the weights and 5000 images to optimize the encoding.

The Cars dataset Krause et al. (2013) is contains 16185 images of 196 classes of cars. The data is split into 8144 training images and 8041 testing images. During searching, we split the training images into 6494 images to optimize the weights and 1650 images to optimize the encoding.

The SQuAD 1.1 dataset Rajpurkar et al. (2016) is a reading comprehension dataset of 107.7K data examples, with 87.5K for training, 10.1K for validation, and another 10.1K (hidden on server) for testing. Each example is a question-paragraph pair, where the question is generated by crowd-sourced workers, and the answer must be a span from the paragraph, which is also labeled by the worker. In this paper, we train on the training set, and only report the results on validation set. During search, we use 80% of the training data to optimize the weights and the remaining to optimize the encoding.

a.2 Experimental Settings

Hyper-parameters in Table 1. Both HP and HP train the model by 30 epochs in total, warm-up the learning rate from 0 to the peak value in 5 epochs, and then decay the learning rate to 0 by the cosine schedule. The model is trained by momentum SGD. For HP, we use the peak value as 5.5945 and the weight for L2 penalty as 0.000153. For HP, we use the peak value as 1.1634 and the weight for L2 penalty as 0.0008436. Both Model and Model are similar to MobileNet-V2 but use different kernel size and expansion ratio for the MBConv block. The (kernel size, expansion ratio) in all blocks for Model-1 are {(7,1), (3,6), (3,6), (7,3), (3,3), (5,6), (3,6), (7,6), (3,3), (5,3), (5,6), (7,6), (3,6), (3,6), (3,6), (7,3), (3,3)}. The (kernel size, expansion ratio) in all blocks for Model-2 are {(7,1), (7,3), (5,3), (5,3), (7,6), (5,6), (7,3), (5,6), (3,3), (3,3), (7,6), (3,6), (5,6), (7,6), (7,3), (7,6), (5,6)}.

More details in Figure 2. The baseline model is EfficientNet-A0. For Random Search and Vizier, we report their results when searching for both learning rate and the weight of L2 penalty. When searching for the architecture, we search for the kernel size from {3, 5, 7} and the expansion ratio from {3, 6}.

Architecture of six models on ImageNet. We use the standard MobilieNet-V2, ResNet-18, and ResNet-50. MNet2 (S0) is similar to MobilieNet-V2 but sets the width multiplier and depth multiplier as 0.3. MNet2 (T0) sets the width multiplier as 0.2 and the multiplier as 3.0, thus it is a very thin network. EfficientNet-A0 (ENet-A0) uses the coefficients for scaling network width, depth, and resolution as 0.7, 0.5, and 0.7, respectively.

Data augmentation. On ImageNet, we use the standard data augmentation following He et al. (2016). Since we use a large batch size, i.e, 4096, the learning rate is increased to 1.6. By default, we train the model on ImageNet by 30 epochs. On other smaller vision datasets, we apply the same data augmentation as Szegedy et al. (2015), while we resize the image into 112 instead of 224 in ImageNet. We use a relatively small batch size 1024 and the learning rate of 0.4. By default, we train the model on small vision datasets by 9000 iterations.

Architecture and hyper-parameters of BERT. We start with the pre-trained BERT model Devlin et al. (2019), whose backbone is essentially a transformer model with 24 layers, 1024 hidden units and 16 heads. When fine-tuning the model on SQuAD 1.1, we adopt the hype-parameters in Devlin et al. (2019) as the default values: learning rate warm-up from 0 to 5e-5 for the first 10% training steps and then linearly decay to 0, with a batch size of 32.

Hardware. ImageNet experiments are performed on a 32-core TPUv3, and others are performed on a 8-core TPUv3 by default. When the memory is not enough, we increase the number of cores to meet the memory requirements.

All codes are implemented in Tensorflow. We run each searching experiment three times for the vision tasks and report the mean

variance. For the NLP task, since it is more sensitive than vision task, we run each searching experiment five times and report the meanvariance.

References

  • [1] B. Baker, O. Gupta, N. Naik, and R. Raskar (2017)

    Designing neural network architectures using reinforcement learning

    .
    In iclr, Cited by: §2.
  • [2] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood (2018) Online learning rate adaptation with hypergradient descent. In iclr, Cited by: §2, §4.3, Table 3.
  • [3] T. Berg, J. Liu, S. Woo Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur (2014) Birdsnap: large-scale fine-grained visual categorization of birds. In cvpr, pp. 2011–2018. Cited by: §A.1, §4.1.
  • [4] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. jmlr 13 (Feb), pp. 281–305. Cited by: §2, Table 3.
  • [5] H. Cai, C. Gan, and S. Han (2020) Once for all: train one network and specialize it for efficient deployment. In iclr, Cited by: §1, §2.
  • [6] H. Cai, L. Zhu, and S. Han (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In iclr, Cited by: §1, §2.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In cvpr, pp. 248–255. Cited by: §A.1, §4.1.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In acl, pp. 4171–4186. Cited by: §A.2, §1, Figure 4, §4.1, §4.4.
  • [9] T. Domhan, J. T. Springenberg, and F. Hutter (2015)

    Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves

    .
    In ijcai, pp. 3460–3468. Cited by: §2.
  • [10] X. Dong and Y. Yang (2019) Network pruning via transformable architecture search. In neurips, pp. 760–771. Cited by: §1, §2.
  • [11] X. Dong and Y. Yang (2019) Searching for a robust neural architecture in four gpu hours. In cvpr, pp. 1761–1770. Cited by: §1, §1, §2, §3.3, §4.2, Table 5.
  • [12] X. Dong and Y. Yang (2020) NAS-bench-201: extending the scope of reproducible neural architecture search. In iclr, Cited by: §4.2.
  • [13] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley (2017) Google vizier: a service for black-box optimization. In sigkdd, Cited by: Table 3.
  • [14] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §4.3.
  • [15] D. Ha, A. Dai, and Q. V. Le (2017) HyperNetworks. In iclr, Cited by: §3.1.
  • [16] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In iclr, Cited by: §1.
  • [17] S. Han (2018) AMC: automl for model compression and acceleration on mobile devices. In eccv, pp. 183–202. Cited by: §1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In cvpr, pp. 770–778. Cited by: §A.2, §1, §4.1, §4.3, Table 3.
  • [19] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In iccv, pp. 1314–1324. Cited by: §1.
  • [20] F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pp. 507–523. Cited by: §2.
  • [21] F. Hutter, L. Kotthoff, and J. Vanschoren (2019)

    Automated machine learning

    .
    Springer. Cited by: §2.
  • [22] F. Hutter (2009) Automated configuration of algorithms for solving hard computational problems. Ph.D. Thesis, University of British Columbia. Cited by: §2.
  • [23] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §2.
  • [24] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In iclr, Cited by: §3.3.
  • [25] D. R. Jones, M. Schonlau, and W. J. Welch (1998) Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13 (4), pp. 455–492. Cited by: §2.
  • [26] A. Klein and F. Hutter (2019) Tabular benchmarks for joint architecture and hyperparameter optimization. arXiv preprint arXiv:1905.04970. Cited by: §2.
  • [27] R. Kohavi and G. H. John (1995)

    Automatic parameter selection by minimizing estimated error

    .
    In Machine Learning Proceedings, pp. 304–312. Cited by: §2.
  • [28] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In International IEEE Workshop on 3D Representation and Recognition (3dRR), Cited by: §A.1, §4.1.
  • [29] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §A.1, §A.1, §4.1.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    .
    In neurips, pp. 1097–1105. Cited by: §2.
  • [31] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017) Hyperband: a novel bandit-based approach to hyperparameter optimization. jmlr 18 (1), pp. 6765–6816. Cited by: §2.
  • [32] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In iclr, Cited by: Figure 1, §1, §1, §2, §3.3, §3.3.
  • [33] J. Lorraine, P. Vicol, and D. Duvenaud (2020) Optimizing millions of hyperparameters by implicit differentiation. In aistats, Cited by: §2, §4.3, Table 3.
  • [34] I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    .
    In iclr, Cited by: Table 3.
  • [35] D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In icml, pp. 2113–2122. Cited by: §2.
  • [36] C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The concrete distribution: a continuous relaxation of discrete random variables

    .
    In iclr, Cited by: §3.3.
  • [37] F. Pedregosa (2016) Hyperparameter optimization with approximate gradient. In icml, pp. 737–746. Cited by: §2.
  • [38] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. In icml, pp. 4092–4101. Cited by: Figure 1, §1, §2, §3.3.
  • [39] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In emnlp, pp. 2383–2392. Cited by: §A.1, §4.1, §4.4.
  • [40] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In aaai, pp. 4780–4789. Cited by: §2.
  • [41] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In cvpr, pp. 4510–4520. Cited by: Table 1, §1, §4.1.
  • [42] A. Shaban, C. Cheng, N. Hatch, and B. Boots (2019) Truncated back-propagation for bilevel optimization. In aistats, pp. 1723–1732. Cited by: §2.
  • [43] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas (2015) Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §2.
  • [44] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams (2015) Scalable bayesian optimization using deep neural networks. In icml, pp. 2171–2180. Cited by: §2.
  • [45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In cvpr, pp. 1–9. Cited by: §A.2.
  • [46] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) MNasNet: platform-aware neural architecture search for mobile. In cvpr, pp. 2820–2828. Cited by: §2, §4.1.
  • [47] M. Tan and Q. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In icml, pp. 6105–6114. Cited by: §1, §1, §2.
  • [48] M. Tan, R. Pang, and Q. V. Le (2020) EfficientDet: scalable and efficient object detection. In cvpr, Cited by: §1.
  • [49] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In sigkdd, pp. 847–855. Cited by: §2.
  • [50] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In cvpr, pp. 10734–10742. Cited by: §1, §2, §4.2.
  • [51] L. Xie and A. Yuille (2017) Genetic CNN. In iccv, pp. 1379–1388. Cited by: §1.
  • [52] S. Xie, H. Zheng, C. Liu, and L. Lin (2019) SNAS: stochastic neural architecture search. In iclr, Cited by: §2, §4.2.
  • [53] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter (2019) NAS-bench-101: towards reproducible neural architecture search. In icml, pp. 7105–7114. Cited by: §1.
  • [54] A. Zela, A. Klein, S. Falkner, and F. Hutter (2018) Towards automated deep learning: efficient joint neural architecture and hyperparameter search. In icml_w, Cited by: §2.
  • [55] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In iclr, Cited by: §1, §2.
  • [56] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In cvpr, pp. 8697–8710. Cited by: §2.