Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator

03/12/2021 ∙ by Sian-Yao Huang, et al. ∙ 0

In one-shot NAS, sub-networks need to be searched from the supernet to meet different hardware constraints. However, the search cost is high and N times of searches are needed for N different constraints. In this work, we propose a novel search strategy called architecture generator to search sub-networks by generating them, so that the search process can be much more efficient and flexible. With the trained architecture generator, given target hardware constraints as the input, N good architectures can be generated for N constraints by just one forward pass without re-searching and supernet retraining. Moreover, we propose a novel single-path supernet, called unified supernet, to further improve search efficiency and reduce GPU memory consumption of the architecture generator. With the architecture generator and the unified supernet, we propose a flexible and efficient one-shot NAS framework, called Searching by Generating NAS (SGNAS). With the pre-trained supernt, the search time of SGNAS for N different hardware constraints is only 5 GPU hours, which is 4N times faster than previous SOTA single-path methods. After training from scratch, the top1-accuracy of SGNAS on ImageNet is 77.1 https://github.com/eric8607242/SGNAS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is time-consuming and difficult to manually design neural architectures under specific hardware constraints. Neural architecture search (NAS) [evolutionnas][oneshotnas][rlnas] aiming at automatically searching the best neural architecture is thus highly demanded. However, how to efficiently and flexibly determine the architectures conforming to various constraints is still very challenging [onceforall].

The earliest NAS methods were developed based on reinforcement learning (RL)

[mnasnet][rlnas] or the evolution algorithm [evolutionnas]. However, extremely expensive computation is needed. For example, 2,000 GPU days are needed by an RL method [rlnas], and 3,150 GPU days are needed by the evolution algorithm [evolutionnas].

Figure 1: Overview of SGNAS. Given the target hardware constraint as the input, the architecture generator can generate architecture parameters instantly within the inference time of one forward pass. With the generated parameters, the specific architectures can be sampled from the unified supernet.

To improve searching efficiency, one-shot NAS methods [oneshotnas][darts][proxylessnas][fbnet]

were proposed to encode the entire search space into an over-parameterized neural network, called a

supernet. Once the supernet is trained, all sub-networks in the supernet can be evaluated by inheriting the weights of the supernet without additional training. One-shot NAS methods can be divided into two categories: differentiable NAS (DNAS) and single-path NAS.

In addition to optimizing the supernet, DNAS [gdas][darts][fbnet][pcdarts:][atomnas] utilizes additional differentiable parameters, called architecture parameters, to indicate the architecture distribution in the search space. Because DNAS couples architecture parameters optimization with supernet optimization, for different hardware constraints, the supernet and the architecture parameters should be trained jointly for times to find different best architectures. This makes DNAS methods inflexible.

In contrast, single-path methods [uniform_sampling][fairnas][scarlet][greedynas] decouple supernet training from architecture searching. For supernet training, only a single path consisting of one block in each layer is activated and is optimized in one iteration. The main idea is to simulate discrete neural architectures in the search space and save GPU memory consumption. Once the supernet is trained, different search strategies, like the evolution algorithm [greedynas][uniform_sampling], can be used to search the architecture under different constraints without retraining the supernet. Single-path methods are thus more flexible than DNAS. However, re-executing the search strategy times for different constraints is costly and not flexible enough.

On top of one-shot NAS, we especially investigate efficiency and flexibility. For efficiency, we mean that, when the supernet is available, the time required to search the best architecture for a specific hardware constraint. For flexibility, we mean that, when different hardware constraints are to be met, how much total time required to search for best architectures. As a comparison instance, GreedyNAS [greedynas] takes almost 24 GPU hours to search for the best neural architecture under a specific constraint. Totally GPU hours are required for different constraints.

In this work, we focus on improving efficiency and flexibility of the search strategy of the single-path method. The main idea is searching the best architecture by generating it. First, we decouple supernet training from architecture searching and train the supernet as a single-path method. After obtaining the supernet, we propose to build an architecture generator to generate the best architecture directly. Given a hardware constraint as input, the architecture generator can generate the architecture parameter within the inference time of one forward pass. This method is extremely efficient and flexible. The total search time for various hardware constraints of the architecture generator is only 5 GPU hours. Moreover, we do not need to re-execute search strategies or re-train the supernet once the architecture generator is trained. When different constraints are to be met, the search strategy only needs to be conducted once, which is more flexible than searches required in previous single-path methods [fairnas][greedynas][scarlet].

The aforementioned idea is on top of a trained supernet. However, we notice that searching on a single-path supernet still requires a lot of GPU memory and time because of the huge number of supernet parameters and complex supernet structure. Previous single-path NAS methods [uniform_sampling][greedynas][scarlet] determine a block for each layer, and there may be different candidate blocks with various configurations. For example, GreedyNAS [greedynas] has 13 types of candidate blocks for each layer, and thus size of the search space is , where denotes the total number of layers in the supernet. Inspired by the fine-grained supernet in AtomNAS [atomnas], we propose a novel single-path supernet called unified supernet to reduce GPU memory consumption. In the unified supernet, we only construct a block called unified block in each layer. There are multiple sub-blocks in the unified block, and each sub-block can be implemented by different operations. By combining sub-blocks, all configurations can be described in a block. In this way, the number of parameters of the unified supernet is much fewer than previous single-path methods.

The contributions of this paper are summarized as follows. With the architecture generator and the unified supernet, we propose Searching by Generating NAS (SGNAS), which is a flexible and efficient one-shot NAS framework. We illustrate the process of SGNAS in Fig. 1. Given various hardware constraints as the input, the architecture generator can generate the best architecture for different hardware constraints instantly in one forward pass. After training the best architecture from scratch, the evaluation results show that SGNAS achieves 77.1% top-1 accuracy on the ImageNet dataset [imagenet] at around 370M FLOPs, which is comparable with the state-of-the-arts of the single-path methods. Meanwhile, SGNAS outperforms SOTA single-path NAS in terms of efficiency and flexibility.

2 Related Work

Recently, one-shot NAS [darts][fbnet][uniform_sampling] has received much attention because of reduced search cost brought by supernet. To futher reduce the search cost, a number of methods have been proposed, which can be roughly divided into two types: efficient NAS and flexible NAS.

2.1 Efficient NAS

For efficiency, many methods were proposed to improve the supernet training strategy or redesign the supernet architecture. Stamoulis et al. [singlelessfour] proposed a single-path supernet to encode architectures with shared convolutional kernel parameters, which reduce search cost of differential NAS. To reduce the huge cost when training on large-scale datasets, training supernet and searching on proxy datasets like CIFAR10 or part of ImageNet was proposed in [fbnet][fbnetv2][tfnas][darts][cars][darts]. PC-DARTS [pcdarts:] only sampled a small part of the supernet for training in each iteration to reduce computation cost. DA-NAS [danas] designed a data-adaptive pruning strategy for efficient architecture search.

2.2 Flexible NAS

For flexibility, in OFA [onceforall] a single full network is carefully trained. Sub-networks inherit weights from the network and can be directly deployed without training from scratch. An accuracy predictor is trained after training supernet to guide the process for searching a specialized sub-network. FBNetV3 [fbnetv3]

trained a predictor on a proxy dataset. The accuracy predictor estimates performance of a candidate sub-network. However, it is still time-consuming to train an accuracy predictor.

In this work, we focus on improving the search strategy in terms of both efficiency and flexibility. Note that our search strategy can be incorporated with the methods mentioned above.

3 Searching by Generating NAS

3.1 Background

Given a supernet represented by weights , to find an architecture that achieves the best performance while meeting a specific hardware constraint, we need to find the best sub-network from which achieves the minimum validation loss . Sampling from is a non-differentiable process. To optimize by the gradient descent algorithm, DNAS [fbnet][proxylessnas][darts] relaxes the problem as finding a set of continuous architecture parameters , and computes the weighted sum of outputs of candidate blocks by the Gumbel softmax function [gumbel_softmax]:

(1)
(2)

where

is the input tensor of the

th layer, is the th block of the th layer, and thus denotes the output of the th block. The term is the weight of the th block in the th layer. The term

is a random variable sampled from the Gumbel distribution

and is the temperature parameter. The value is the weight for the output .

After relaxation, DNAS can be formulated as a bi-level optimization:

(3)
(4)

where is the training loss.

Because of the bi-level optimization of and , the best architecture sampled from the supernet is only suitable to a specific hardware constraint. With this searching process, for different hardware constraints, the supernet and architecture parameters should be retrained for times. This makes DNAS less flexible.

In contrast, single-path methods [uniform_sampling][fairnas][scarlet][mixpath] decouple supernet training from architecture searching. For supernet training, only a single path consisting of one block in each layer is activated and is optimized in one iteration to simulate discrete neural architecture in the search space. We can formulate the process as:

(5)

where denotes the subset of corresponding to the sampled architecture , and is a prior distribution of . The best weights to be determined are the ones yielding the minimum expected training loss. After training, the supernet is treated as a performance estimator to all architectures in the search space. With the pretrained supernet weights , we can search the best architecture :

(6)

Single-path methods are more flexible than DNAS, because supernet training and architecture search are decoupled. Once the supernet is trained, for different constraints, only architecture search should be conducted for times.

In this work, we propose to decouple supernet training from architecture searching and train supernet as in single-path NAS (Eq. (5)). After supernet training, we search the best architecture by the gradient descent algorithm as in DNAS (Eq. (3)). Instead of training architecture parameters for one specific hardware constraint, we propose a novel search strategy called architecture generator to largely increase flexibility and efficiency.

3.2 Architecture Generator

3.2.1 Essential Idea

Given the target hardware constraint , the architecture generator can generate the best architecture parameters for the hardware constraint . The process of the architecture generator can be described as such that . With the architecture generator , the objective function of the architecture searching in Eq. (3) can be reformulated as :

(7)

To make generate the best architecture parameters for different hardware constraints accurately, we propose the hardware constraint loss as:

(8)

where the cost yielded by the generated architecture is estimated by:

(9)

The term is the constant cost of the th block in the layer and is the weight of different blocks described in Eq. (2). The cost is differentiable with respect to and , similarly in [fbnet][tfnas]. Note that Eq. (9) is also highly correlated with latency, as mentioned in [fbnet] and [tfnas]. By combining the hardware constraint loss and the cross entropy loss defined in Eq. (7), the overall loss of the architecture generator is:

(10)

where is a hyper-parameter to trade-off the validation loss and hardware constraint loss.

3.2.2 Accurate Generation with Random Prior

In practice, we found that the architecture generator easily overfits to a specific hardware constraint. The reason is that it is too difficult to generate complex and high-dimensional architecture parameters based on a given simple integer hardware constraint .

To address this issue, a prior is given as input to stabilize the architecture generator. We randomly sample a neural architecture from the search space, and encode the neural architecture into a one-hot vector to be the prior knowledge of architecture parameters. We name it as a random prior

. Formally, , , where is the th layer of the neural architecture randomly sampled from , and is the total number of layers in the supernet. With the random prior, the architecture generator is to learn the residual from the random prior to the best architecture parameters, making training architecture generator more stable and accurately (blue line in Fig. 7(a)), and the process of the architecture generator can be reformulated as such that .

3.2.3 The Generator Training Algorithm

We illustrate the algorithm of architecture generator training in Algorithm 1. In each iteration, given the target constraint and the random prior, the architecture generator can generate the architecture parameters (as illustrated in Fig. 1). With , the corresponding cost can be calculated by Eqn. (9). We can predict based on the pretrained supernet with . The total loss is given by Eqn. (10). No matter what constraint is given, the architecture generator generates architecture parameters to get the best prediction results. Therefore, training the generator is equivalent to searching the best architectures for various constraints in the proposed SGNAS.

3.2.4 The Architecture of the Generator

Fig. 2

illustrates the architecture of the generator. We set the channel size of all convolutional layers as 32 and set the stride as 1, making sure the output shape same as the shape of the random prior. Please refer to supplementary materials for detailed configurations of the architecture generator and random prior representations.

1:: Random prior; : Unified supernet; : Generator; : Pre-define hardware constraint interval; : Validation dataset; : Max iterations;
2:for  do;
3:     Get a data batch and from
4:     Randomly sample from
5:     
6:     
7:     
8:     
9:     Calculate gradients from
10:     Update from gradients
11:end for
Algorithm 1 Training Architecture Generator
Figure 2: Structure of the architecture generator. Given the input target hardware constraint, the expansion layer expands the input to a tensor with shape same as the random prior.

3.3 Unified Supernet

Previous single-path NAS [fairnas][scarlet][uniform_sampling][greedynas] adopts the MobilenetV2 inverted bottleneck [mobilenetv2] as the basic building block. Given the input tensor , the corresponding output is obtained by

(11)

where denotes the pointwise convolution with the input channel size and output channel size , denotes the depthwise convolution with kernel size. Eq. (11) represents that of channels is first expanded to a tensor of channels, which can be described as , and then a depthwise convolution is conducted. The term denotes the expansion rate of the inverted bottleneck. After that, the tensor of channels is embedded into the output tensor of channels. Because one basic building block can only represent one configuration with one kernel size and one expansion rate, previous single-path NAS [proxylessnas][scarlet][fairnas] needs to construct blocks of various configurations in each layer, which leads an exponential increase in parameter numbers and complexity of the supernet.

In this work, we propose a novel single-path supernet called unified supernet to improve efficiency and flexibility of the architecture generator. The only type of block, i.e., unified block, is constructed in each layer. The unified block is built with only the maximum expansion rate , i.e., .

Fig. 3 illustrates the idea of a unified block. To make the unified block represent all possible configurations, we replace the depthwise convolution by sub-blocks, and each sub-block can be implemented by different operations or skip connection. The output tensor of the first pointwise convolution is equally split into parts, . With the sub-blocks and split tensors , we can reformulate in Eq. (11) as:

(12)

where denotes the channel concatenation function.

With the sub-blocks implemented by different operations, we can simulate blocks with various expansion rates, as shown in Fig. 4. The unified supernet thus can significantly reduce the parameters and GPU memory consumption. It is interesting that the MixConv in MixNet [mixconv] is a special case of our search space if different sub-blocks are implemented by different kernel sizes.

Figure 3: Illustration of a unified block. Given the input, the first pointwise convolution expands the input channel size to times. The channel split layer splits the tensor into parts and feeds to each sub-block, respectively.
Figure 4: Different expansion rates can be simulated by sub-blocks with different operations. For example, the expansion rate 6 can be simulated if no skip connection is implemented, and the expansion rate 2 can be simulated if four skip connections are implemented.

3.3.1 Large Variability of BNs Statistics

As in [atomnas], [universlimm], and [aows]

, we suffer from the problem of unstable running statistics of batch normalization (BN). In the unified supernet, because one unified block would represent different expansion rates, the BN scales change more dramatically during training. To address the problem, BN recalibration

[atomnas][universlimm][aows] is used to recalculate the running statistics of BNs by forwarding thousands of training data after training. On the other hand, shadow batch normalization (SBN) [mixpath] or switchable batch normalization [slimmable] are used to stabilize BN. In this work, we utilize SBN to address the large variability issue, as illustrated in Fig. 4. In our setting, there are five different expansion rates, i.e., 2, 3, 4, 5, and 6. We thus take five BNs after the second pointwise convolution block to capture the BN statistics for different expansion rates. With SBN, we can capture different statistics and make supernet training more stable.

3.3.2 Architecture Redundancy

Denote two sub-blocks as and . In the unified supernet, for example, the case of using kernel size and using skip connection is distinct from the case of using skip connnection and using kernel size. However these two cases actually correspond to the same sub-network, and thus the architecture redundancy problem arises. This redundancy makes the unified supernet more complex and hard to train. To address this issue, we force that skip connection can only be used in sub-blocks with higher index. For example, if we want to train a unified block with expansion rate 3, only the last three sub-blocks can be skip connection. We call this strategy forced sampling (FS). Please refer to supplementary materials for details of architecture redundancy and forced sampling.

4 Experiments

4.1 Experimental Settings

We adopt the macro structure of supernet (e.g., channel size of each layer and layer number) same as [fairnas] and [proxylessnas], but utilize the proposed unified blocks to reduce GPU memory consumption and the number of parameters. Each sub-block in the unified blocks can be realized based on convolutional kernel sizes 3, 5, or 7, or the skip connection. We set the minimum and maximum expansion rates as 2 and 6, respectively. The size of our search space is . Please refer to supplementary materials for more details.

For experiments on the ImageNet dataset [imagenet]

, we train the unified supernet for 50 epochs using batch size 256 and adopt the stochastic gradient descent optimizer. The learning rate is decayed with the cosine annealing strategy

[cosine_ann]

from the initial value 0.045. After supernet training, the architecture generator is trained for 50 epochs by the Adam optimizer with the learning rate 0.001. After searching/generating the best architecture under hardware constraints, we adopt the RMSProp optimizer with 0.9 momentum

[mnasnet] to train the searched architecture from scratch. Learning rate is increased from 0 to 0.16 in the first 5 epochs with batch size 256, and then decays 0.03 every 3 epochs.

4.2 Experiments on ImageNet

4.2.1 Comparison with Baselines

Li and Talwalkar [randomsearch] presented that a random search approach usually achieves satisfactory performance. To make comparison, we randomly select 1,000 candidate architectures with FLOPs under 320 millions (320M) from the unified supernet and pick the architecture yielding the highest top-1 accuracy, as mentioned in [randomsearch]. Besides, we also search the network with FLOPs under 320M by the evolution algorithm [uniform_sampling] as another baseline.

Table 1 shows the comparison results. As can be seen, with around 320M FLOPs, the proposed SGNAS achieves the highest top-1 accuracy. Both baselines take around 34 GPU hours to complete the search. For different hardware constraints, the search strategy should be re-executed for times, and the search time of each of two baselines is GPU hours totally. In contrast, SGNAS only takes 5 GPU hours totally for different hardware constraints, which is much more efficient and flexible than the baselines.

Search Strategy
Search Time
(GPU hrs)
FLOPs
(M)
Top-1
(%)
Random search 322 74.63
Evolution algorithm 318 74.67
SGNAS 5 324 74.87
Table 1: Performance comparison with baselines.

4.2.2 Comparison with SOTAs

Method
FLOPs
(M)
Top-1
(%)
Train
time
Search
time
MobileNetV2 [mobilenetv2] 300 72.0
EfficientNet B0 [efficientnet] 390 76.3
MixNet-M [mixconv] 360 77.0
MixPath-A [mixpath] 349 76.9 240
AtomNAS-C [atomnas] 363 77.6 0
PC-DARTS [pcdarts:] 597 75.8 0
ScarletNAS-A [scarlet] 365 76.9 240
GreedyNAS-A [greedynas] 366 77.1 168
SGNAS-A (Ours) 373 77.1 280 5
FBNetV2-L1 [fbnetv2] 325 77.2 0
Proxyless-R [proxylessnas] 320 74.6 0
FairNAS-C [fairnas] 325 76.7 240
ScarletNAS-B [scarlet] 329 76.3 240
SPOS [uniform_sampling] 326 74.5 288
GreedyNAS-B [greedynas] 324 76.8 168
SGNAS-B (Ours) 326 76.8 280 5
MobileNetV3-L [mobilenetv3] 219 75.2
ScarletNAS-C [scarlet] 280 75.6 240
GreedyNAS-C [greedynas] 284 76.2 168
SGNAS-C (Ours) 281 76.2 280 5
Table 2: Comparison with the SOTAs for different hardware constraints. : training with AutoAugment [autoaugment]. : searching on a proxy dataset. The unit of search time and train time is GPU hours.

This section is dedicated to compare with various SOTA one-shot NAS methods that utilize the augmented techniques (e.g., Swish activation function

[swish] and Squeeze-and-Excitation [se]

). We directly modify the searched architecture by replacing all ReLU activation with H-Swish

[mobilenetv3] activation and equip it with the squeeze-and-excitation module as in AtomNAS [atomnas].

For comparison, similar to the settings in ScarletNAS [scarlet] and GreedyNAS [greedynas], we search architectures under 275M, 320M, and 365M FLOPs, and denote the searched architecture as SGNAS-C, SGNAS-B, and SGNAS-A, respectively. The comparison results are shown in Table 2. The column “Train time” denotes that the time needed to train the supernet, and the column “Search time” denotes that the time needed to search the best architecture based on the pre-trained supernet. Because DNAS couples architecture searching with supernet optimization, we list the time needed for the entire pipeline in the “Search time” column. As can be seen, our SGNAS is competitive with SOTAs in terms of top-1 accuracy under different FLOPs. For example, SGNAS-A achieves 77.1% top-1 accuracy, which outperforms ScarletNAS [scarlet] by 0.2%, outperforms MixNet-M [mixconv] by 0.1%, outperforms MixPath-A [mixpath] by 0.2%, and is comparable with GreedyNAS-A [greedynas].

More importantly, SGNAS achieves much higher search efficiency. With the architecture generator and the unified supernet, even for different architectures under different hardware constraints, totally only 5 GPU hours are needed for SGNAS on a Tesla V100 GPU. However, FairNAS [fairnas], GreedyNAS [greedynas], and ScarletNAS [scarlet] need , , and GPU hours, respectively, because of the cost of re-executing search. Supernet retraining is needed for FBNetV2 [fbnetv2] and AtomNAS [atomnas], which makes search very inefficient.

Note that after finding the best architecture, training from scratch is required in most methods in Table 2 (including SGNAS, except for AtomNAS [atomnas]). However, training a supernet that can be directly deployed to many constraints (like AtomNAS) needs expensive computation. Even with the time for training from scratch, SGNAS is still more efficient and flexible than AtomNAS.

4.3 Experiments on NAS-Bench-201

Method Search Time (GPU hrs) CIFAR-10 CIFAR-100 ImageNet-16-120
Val Test Val Test Val Test
optimal N/A 91.61 94.37 73.49 73.51 46.77 47.31
RSPS [randomsearch] 2.6 84.161.69 87.661.69 59.004.60 58.334.34 31.563.28 31.143.88
DARTS [darts] 39.770.00 54.300.00 15.030.00 15.610.00 16.430.00 16.320.00
SETN [setn] 82.255.17 86.194.63 56.867.59 56.877.77 32.543.63 31.904.07
GDAS [gdas] 90.000.21 93.510.13 71.140.27 70.610.26 41.701.26 41.840.90
SGNAS (Ours) 2.5 90.180.31 93.530.12 70.281.2 70.311.09 44.652.32 44.982.10
Table 3: Performance comparison on different datasets in the NAS-Bench-201 benchmark.

To demonstrate efficiency and robustness of SGNAS more fairly, we evaluate it based on a NAS benchmark dataset called NAS-Bench-201 [NAS-Bench-201:]. NAS-Bench-201 includes 15,625 architectures in total. It provides full information of the 15,625 architectures (e.g., top-1 accuracy and FLOPs) on CIFAR-10, CIFAR-100, and ImageNet-16-120 datasets [iamgenet16], respectively.

Based on the search space defined by NAS-Bench-201, we follow SETN [setn] to train the supernet by uniform sampling. After that, the architecture generator is applied to search architectures on the supernet. We search based on the CIFAR-10 dataset and look up the ground-truth performance of the searched architectures on CIFAR-10, CIFAR-100, and ImageNet-16-120 datasets, respectively. This process is run for three times, and the average performance is calculated as in Table 3. We see that the architectures searched by SGNAS outperform previous methods on both CIFAR-10 and ImageNet16-120. It is worth noting that, with the supernet training strategy same as SETN [setn], our result greatly surpasses SETN [setn] on all three datasets. Moreover, the required search time of SGNAS is only 2.5 GPU hours even for different hardware constraints.

We show the 15,625 architectures in NAS-Bench-201 on each dataset as gray dots in Fig. 5, and draw the architectures searched by the architecture generator under different FLOPs as blue rectangles. After searching once, the architecture generator can generate all blue rectangles directly without re-searching. Moreover, various generated architectures approach the best among all choices.

Figure 5: Search results of SGNAS on the CIFAR10, CIFAR100, and ImageNet16-120 datasets. (a) Result on CIFAR-10; (b) Result on CIFAR100; (c) Result on ImageNet16-120.

4.4 Performance on Object Detection

Model FLOPs(M) Top-1 (%) mAP (%)
MobileNetV2 [mobilenetv2] 300 72.0 28.3
MixNet-M [mixconv] 360 77.0 31.3
FairNAS-A [fairnas] 392 77.5 32.4
Scarlet-A [scarlet] 365 76.9 31.4
MobileNetV2 300 72.0 29.4
SGNAS-A (Ours) 373 77.1 33.9
Table 4:

Performance comparison on the COCO object detection.

: Our implementation result. reported in [scarlet][fairnas].

To verify the transferability of SGNAS on object detection, we adopt the RetinaNet [retinanet] implemented in MMDetection [mmdetection] to do object detection, but replace its backbone by the network searched by SGNAS. The models are trained and evaluated on the MS COCO dataset [coco] (train2017 and val2017, respectively) for 12 epochs with batch size 16 [fairnas][scarlet]. We use the SGD optimizer with 0.9 momentum and 0.0001 weight decay. The initial learning rate is 0.01, and is multiplied by 0.1 at epochs 8 and 11. Table 4 shows that SGNAS has better transferability than the baselines, especially in terms of mAP.

5 Ablation Studies

5.1 Analysis of SGNAS

In Fig. 6(a), we randomly sample 360 architectures from the search space and illustrate the corresponding top-1 validation accuracies as blue dots. Moreover, we draw the architectures searched by SGNAS under different hardware constraint as red rectangles. As can be seen, the architectures searched by SGNAS are almost always the best.

Figure 6: (a) Top-1 validation accuracy of randomly sampled architectures (blue dots) and the architectures searched by SGNAS (red rectangles). Performance of other variants is also shown. (b) The relationship between the number of supernet parameters and the number of candidate operations in each layer.

5.2 Analysis of Unified Supernet

5.2.1 Efficiency of Unified Supernet

Unified
Supernet
Search
Space
Batch
Size
Search Time
(GPU hrs)
Memory Cost
(GPU)
32 11 10.5GB
128 40 GB
32 5 9GB
128 28GB
Table 5: Comparison in terms of of GPU memory consumption and search time of the architecture generator.

To show efficiency of the proposed unified supernet, we report the relationship between the total number of parameters in the (unified) supernet and the number of candidate operations per layer in Fig. 6(b). For fair comparison, we calculate the number of parameters of different supernets all based on 19 layers. As can be seen, the number of possible operations of the unified supernet in each layer is 7 times larger than GreedyNAS [greedynas] and ScarletNAS [scarlet], but the number of parameters needed to represent this unified supernet is only 1/6 times of them. The number of possible operations is 13 times larger than FairNAS [fairnas] and ProxylessNAS [proxylessnas], but the number of supernet parameters for the unified supernet is only half of them. To compare under the same size of search space, we estimate the number of required supernet parameters in previous single-path methods [fairnas][greedynas][scarlet][proxylessnas] when the number of possible operations in each layer increases to 25, 50, 60, and 80, and show them by green squares. Fig. 6(b) shows that the required parameters are significantly boosted when the number of possible operations increases, while the unified supernet avoids this intractability. Under the same size of search space, the number of needed parameters to represent the unified supernet is only 1/35 times of estimated supernets.

Table 5 shows the comparison in terms of GPU memory consumption and search time of the architecture generator when it works based on the unified supernet or based on the previous single-path supernet [fairnas][proxylessnas]. Based on the unified supernet, the GPU memory consumption reduces to 12 GB and the search time is only times of that based on the previous supernet.

5.2.2 Training Stabilization

Although the unified supernet largely reduces supernet parameters, the large search space makes the supernet hard to train. To study the effect of force sampling (FS) and shadow batch normalization (SBN) [mixpath] on supernet training, we train the supernet based on different settings, including baseline, with FS, with SBN, and with both FS and SBN. After training the supernet, we randomly sample 360 architectures from the search space and show the corresponding top-1 accuracies in Fig. 6(a). Without FS and SBN, because of large variability and complex architecture, the baseline supernet is hard to train. After utilizing SBN, variability can be well characterized, and the performance becomes more stable. After applying FS, complexity of the supernet is greatly reduced by reducing architecture redundancy. Performance is largely boosted when redundancy is reduced. With both FS and SBN, the unified supernet can more consistently represent architectures with better performance.

5.3 Study of Random Priors

Figure 7: (a) The relationship between target FLOPs and the FLOPs of generated architecture. (b) Performance of the architectures randomly sampled from the unified supernet (blue dots) and those generated by the architecture generator trained based on four random priors.

To enable the generator to generate architectures under various hardware constraints accurately, random prior is given as the input of the generator. In Fig. 7(a), we show the correlation between the target FLOPs and the FLOPs of the generated architectures. With the random prior, the generator can generate architectures much more accurately. With the random prior, the Kendall tau correlation between the target FLOPs and the generated is 1, while the Pearson correlation is 0.99, which are significantly positive.

We randomly sample four sub-networks A, B, C, and D from the unified supernet as four priors to train the architecture generator. Inherited from the weights of the unified supernet, the top-1 validation accuracy of these four sub-networks are 58.02%, 63.36%, 66.48%, and 68.48%, respectively. Fig. 7(b) shows that, no matter starting from good priors or bad priors, the corresponding trained architecture generators are able to generate the architecture yielding the best performance. This shows that random priors are not to improve the top-1 accuracy, but to give reasonable priors to make the architecture generator generate good architectures under the target constraints.

6 Conclusion

To improve efficiency and flexibility of finding best sub-networks from the supernet under various hardware constraints, we propose the idea of architecture generator that searches the best architecture by generating it. This approach is very efficient and flexible for that only one forward pass is needed to generate good architectures for various constraints, comparing to previous one-shot NAS methods. To ease GPU memory consumption and boost searching, we propose the idea of unified supernet which consists of a stack of unified blocks. We show that the proposed one-shot framework, called SGNAS (searching by generating NAS), is extremely efficient and flexible by comparing with state-of-the-art methods. We further comprehensively investigate the impact of architecture generator and unified supernet from multiple perspectives. Please refer to supplementary materials for the limitation of SGNAS.

Acknowledgement This work was funded in part by Qualcomm through a Taiwan University Research Collaboration Project and in part by the Ministry of Science and Technology, Taiwan, under grants 108-2221-E-006-227-MY3, 107-2923-E-006-009-MY3, and 109-2218-E-002-015.

References

Appendix A More Details of Experimental Settings

a.1 Dataset

We perform all experiments based on the ImageNet dataset [imagenet]. Same as the settings in [proxylessnas][greedynas][scarlet], we randomly sample 50,000 images (50 images for each class) from the training set as our validation set, and the rest is kept as the training set. The original validation set is taken as the test set to measure the final performance of each model. The resolution of input images is 224224.

a.2 Supernet Training

We train the unified supernet for 50 epochs using batch size 256 and adopt the stochastic gradient descent optimizer with a momentum of 0.9 and weight decay of . The learning rate is decayed based on the cosine annealing strategy from initial value 0.045. We train the unified supernet with strict fairness [fairnas] so that each operation in all sub-blocks and each expansion rate are trained fairly.

a.3 Generator Training

After supernet training, the architecture generator is trained for 50 epochs using batch size 128 by the Adam optimizer with the learning rate 0.001, momentum (0.5, 0.999), and weight decay 0. The temperature of Gumbel Softmax [gumbel_softmax] in Eq. (2) is initially set to 5.0 and annealed by a factor of 0.95 for each epoch. The trade-off parameter in Eq. (11) is set to 0.0003 in our experiment.

Appendix B Details of Search Space

The marco-architecture of our unified supernet is shown in Table 6.

Appendix C More Details of Architecture Generator

A random prior is encoded into a one-hot format and then is reshaped into the shape of architecture parameters to be generated. The output of the architecture generator is a parameter map with size

. Motivated by generative adversarial networks, we reshape a random prior so that its shape is the same as the output map. We then feed it to the architecture generator, where we can apply 2D convolution with stride 1 for processing. Without carefully tuning convolution parameters, we can ensure that shape of the output map fits different search spaces. This design is to make the generator easily adapt to different settings. We have experimented with various structures for the architecture generator (e.g., fully connected) and found that convolutional layers yield reliable results.

Input shape Block C N S E
Conv 33 32 1 2 -
MBConv 33 16 1 1 1
Unified Block 32 2 2 (2, 6, 1)
Unified Block 40 4 2 (2, 6, 1)
Unified Block 80 4 2 (2, 6, 1)
Unified Block 96 4 1 (2, 6, 1)
Unified Block 192 4 2 (2, 6, 1)
Unified Block 320 1 1 (2, 6, 1)
Unified Block 1280 1 1 (2, 6, 1)
Avg pool - 1 1 -
FC 1000 1 - -
Table 6: Macro-architecture of the search space. MBConv denotes MobileNetV2 [mobilenetv2] block with kernel size 3. Column-C denotes the number of output channel of a block. Column-N denotes the number of the blocks. Column-S denotes the stride of the first block when stacked for multiple blocks. Column-E denotes the expansion rate of the blocks, and the tuples of three values represent the lowest value, highest value, and steps between options (low, high, steps).
Figure 8: Illustration of architecture redundancy and forced sampling (FS).

Appendix D More Details of Architecture Redundancy

We illustrate architecture redundancy in the left of Fig. 8 and forced sampling (FS) in the right of Fig. 8. In the four unified blocks in Fig. 8, four depthwise convolution with kernel size and two skip connections are used in different sub-blocks. However, the three situations on the left of Fig. 8 are treated as different because of different arrangements. With FS, we enforce arrangement of the operations to be unique, and thus only the unified block on the right of Fig. 8 can be sampled.

Appendix E Visualization of Searched Architectures

We visualize SGNAS-A, SGNAS-B, and SGNAS-C in Fig. 9. Besides, we also visualize the architectures searched by SGNAS under different hardware constraints in Fig. 10. It is interesting that even if the target hardware constraint is low (e.g., 280M), the expansion rate simulated by sub-blocks is still high in some layers (e.g., layer 1, layer 7, and layer 19).

Figure 9: Visualization of the architectures searched by SGNAS (SGNAS-A, SGNAS-B, and SGNAS-C). ”MBE1” denotes the mobile inverted bottleneck convolution layers with expansion rate 1. ”KX” denotes depthwise convolution with the kernel size X. The gray blocks are predefined blocks before searching.
Figure 10: Visualization of the architectures search by SGNAS under different hardware constraints.

Appendix F Limitation

Careful hyperparameter tuning:

In SGNAS, the overall loss function of the architecture generator is defined in Eq. (10). However, in our experiment, carefully tuning the hyperparameter

for different datasets is required to get trade off between hardware constraints and performance.

Architecture of the architecture generator: In SGNAS, we manually design architecture of the architecture generator. But we definitely believe that there is a better architecture for the generator. It is worth further study in the future.