Powering One-shot Topological NAS with Stabilized Share-parameter Proxy

05/21/2020 ∙ by Ronghao Guo, et al. ∙ Beihang University 4

One-shot NAS method has attracted much interest from the research community due to its remarkable training efficiency and capacity to discover high performance models. However, the search spaces of previous one-shot based works usually relied on hand-craft design and were short for flexibility on the network topology. In this work, we try to enhance the one-shot NAS by exploring high-performing network architectures in our large-scale Topology Augmented Search Space (i.e., over 3.4*10^10 different topological structures). Specifically, the difficulties for architecture searching in such a complex space has been eliminated by the proposed stabilized share-parameter proxy, which employs Stochastic Gradient Langevin Dynamics to enable fast shared parameter sampling, so as to achieve stabilized measurement of architecture performance even in search space with complex topological structures. The proposed method, namely Stablized Topological Neural Architecture Search (ST-NAS), achieves state-of-the-art performance under Multiply-Adds (MAdds) constraint on ImageNet. Our lite model ST-NAS-A achieves 76.4 with only 326M MAdds. Our moderate model ST-NAS-B achieves 77.9 just required 503M MAdds. Both of our models offer superior performances in comparison to other concurrent works on one-shot NAS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An instance of our topology augmented search space. It contains over different network topologies which enables us to explore complex network typologies. Solid line denotes the chain-structured stem edges, and dotted line represents branch edges which connects feature maps with depth difference 2 or 3.
(a) SPOS space (b) DARTS space
Figure 2: Two typical search spaces in previous work. Fig. (a) shows a chain-structured space while Fig. (b) shows a cell-based search space.

Significant progress made by convolution neural networks (CNN) in challenging computer vision tasks has raised the demand to design powerful neural networks. Instead of manually design, Neural architecture search (NAS) has demonstrated great potentials in recent years. Early works of NAS by Real 

et al[29, 28] and Elsken et al[11] achieved promising results but can only be applied to small datasets due to their large computation expenses. To this end, one-shot based methods have drawn much interest thanks to its promising training efficiency and remarkable ability to discover high-performing models. One-shot method usually utilizes a hyper-network, which subsumes all architectures in the search space, and use shared weights to evaluate different architectures.

However, the search space of previous works (e.g, shown in Fig. 2) were usually carefully designed and did not enjoy too much flexibility on the network topology. For example, as one of mostly applied search spaces in the one-shot literature, the chain-structured search space[14] has sequentially connected intermediate feature maps, between which the edges are chosen from a set of computation operations. Networks with better operations can be discovered on this search space, but the network topology remains trivial. However, previous works [10, 17]

on network architecture design proved that complex topology will tremendously enhance the performance of deep learning models. We argue that complex topological structures added in search space will improve the performance of the searched network architectures as shown in Table 

1.

In this work, we are interested in exploring complex network typologies with one-shot method. We propose a novel network architecture search space shown in Fig. 2 which contains over

different network topologies, enabling the discovery of complex topology networks. The search space is obtained by introducing numerous computation modules as edges between nodes. A topology based architecture sampler is also introduced to sample architectures during one-shot training stage from the hyper-network. However, the great diversity introduced by topologies brings difficulties to the one-shot approach. Specifically, we observe high variance of performance estimation through the one-shot shared parameters in two cases: estimation through shared parameters at different epochs of a single run and estimation through shared parameters obtain in different runs. Zhang

et al[43] explore the reason behind the variance of ranking under weight sharing strategy. Thus the ranking ability of shared parameters is compromised.

To eliminate the interference of complex topologies, we estimate the expectation of architecture performance in additional training epochs of hyper-network via multiple samples of shared parameters. An fast weights sampling methods based on Stochastic Gradient Langevin Dynamics is developed to sample shared parameters efficiently.

The resulted Stabilized Topological Neural Architecture Search (ST-NAS) achieves compatible performance with the state-of-the-art NAS method. The resulted architecture ST-NAS-A obtains 76.4% top-1 accuracy with only 326M MAdds. A larger architecture ST-NAS-B obtains 77.9% top-1 accuracy with around 503M MAdds.

To summarize, our main contributions are as follows:

  1. We introduce a topology augmented neural architecture search space that enables the discovery of efficient architectures with complex topology.

  2. To relieve the complex topology’s interference on model ranking, we modified model evaluation based on the expectation of the sharing parameters’ performance.

  3. We empirically demonstrate improvements on ImageNet classification under the same MAdds constraints compared with previous work, and show that the searched architectures transfer well to COCO object detection.

2 Related Work

Recently, auto machine learning methods have received a lot of attention due to its ability to design augmentation

[9, 22]

, loss function

[19] and network architectures [45, 28, 4, 44, 25, 14, 13, 21, 20].

Early neural architecture search (NAS) works normally involves reinforcement learning

[1, 45, 44, 46, 13, 36] or evolution algorithm [29, 26] to search for high-performing network architectures. However, these methods are usually computationally expensive which limits its uses in real scenarios.

Recent attentions have been focused on alleviating the computation cost via weight sharing method. This method usually contains a single training process of an over parameterized hyper-network which subsumes all candidate models, i.e, weights of the same operators are shared across different sub-models. Notably, Liu et al[25] proposes continuous relaxations which enables optimizing network architectures with gradient decent, Cai et al[5] proposes a proxy-less method to search on target datasets directly and Bender et al[2] introduces one-shot method to decouple training and searching stages. Our NAS work take the use of the weight sharing hyper-network but relieve the variance during model training.

Early hand-craft neural networks [15, 35, 33] tend to stack repeated motifs. Works in [34, 15, 17, 16] introduce different manual designed network topologies and result in performance gain.

Motivated by manual designed architectures [15, 35, 33], a widely used search space in works [45, 25, 44, 26, 13] are proposed to search for such motifs, dubbed cells or blocks, rather than all possible architectures. This search space is called cell-based space. Another widely used search space adopted in [5, 14, 36, 42] is called chain-structured space. This space sequentially stacks several operation layers where each layer serves its output as next layer’s input. NAS methods are adopted to search for operation layers in different position of this space. Work in [41] explores random wiring networks with less human prior and achieves comparable performance with manual designed networks.

3 Approach

Methods for NAS usually consist of three basic components: search space, performance estimation and search strategy. In this section, we first introduce our novel Topology Augmented Search Space and a new sampling strategy for hyper-network training in this particular space. Secondly, we provide new model performance estimation approach to relieve the variance of model ranks during the training of hyper-network. Finally the evolution algorithm for network search is described.

3.1 Topology Augmented Search Space

3.1.1 Motivation

To demonstrate the improvement of complex topology against a sequential structure, we take ResNet-18 as a baseline and shows a subtle change on the topology obtains obvious performance boost. We randomly add 4 residual blocks to connect the feature maps of blocks in ResNet-18’s [15] chain structure with 3 random seeds, and rescale the width to keep the same FLOPs, the results are in Table 1. The 3 complex structures imply the great potential of topology-based structure search.


Architecture
Res-18 Rand0 Rand1 Rand2
Accuracy (%) 70.2 71.5 72.0 69.6

Table 1: The accuracy of ResNet-18 and three networks with 4 random skip residual blocks added on the baseline. The three networks are scaled to keep the FLOPs same with ResNet18. Obvious boosting is obtained via exploration in a more complex topology space.

3.1.2 Search Space

A neural network is denoted as a directed acyclic graph (DAG) defined by , where the node indicates the feature connected by edge and edges represent CNN operators. The nodes are indexed by the order of computation of their corresponding feature maps.

In our formulation, each is a minimum search unit, also referred as a choice block, which contains a set of candidates computation blocks. A hyper-network is the network which subsumes all the sub-architectures in the search space. Following the previous works, we divide our search space into several sub DAGs (stages), each of which downsamples the input by a factor .

Figure 3: Illustration of candidates in stem edges and branch edges.
Figure 4: Detailed ranking of the best and the worst among 20 runs. The ground truth ranking is provided in the middle. This figure shows that the quality of ranking exceedingly differs from s.

To enable the discovery of complex topology architectures, a novel topology augmented search space is proposed. In our search space, edges are divided into two categories, stem edges and branch edges, detailed in Fig. 4.

Stem Edges are non-removable edges which always appear in candidate architectures. The stem edges exist between all node pairs , where . Stem edges are chain-structured, which sequentially connect all consecutive nodes in each stages. We use the 9 kinds of linear bottlenecks (LB) [31] as the candidate choices of stem edges. Further, on stem edges between feature maps with the same resolution, identity operation is added as an extra candidate to enhance topological diversity and depth flexibility. Therefore, there are choices in the sequential structures.

Branch Edges are optional to contribute to topology diversity in the search space. The branch edges exist between all node pairs , where . The candidate choices of branch edges are the same to stem edges. Differently, the branch edges could be abandoned flexibly.

When and

has different resolution, the stride of convolution operation in the edge is automatically adapted to align the feature maps. The number of nodes in a single stage is required to define the search space. Based on previous method, we set the number of nodes in each stage as 2, 2, 4, 8, 4.

The search space we proposed ensures network topology complexity. Network topology in this work is defined as the DAG formed by nodes and edges. For nodes, the total number of topology is . The search space we used in the experiment contains 20 nodes in total, which is around topologies. For comparison, the topologies contained in cell-based search space is around .

3.2 Training the One-shot Hyper-network

One-shot method uses the hyper-network to estimate the performance of architectures. Since huge amount of architectures exists in the hyper-network concurrently, training the hyper-network in whole will make the parameters of different architectures correlated with each other. To reduce the correlation, one-shot method samples a new network architectures at each gradient step and update the only the activated part of the shared parameters.

(1)

makes prediction of input utilizing sampled model . Thus the gradient of parameters unused by remains zero. The architecture sampling distribution is usually set to trivial uniform sampling [14] across the choice for each single edge.

Suppose there are choices for stem edges and for branch edges other than none, a simple uniform sampling strategy in our search space can be described as:

(2)
(3)

However, the network sampled under this strategy in our space tends to sample architectures with high computational cost, because each of the large amount of branch edges has a low probability to be

none. Consequently, the architecture with low computational cost in the hyper-network will under-fit, which would cause a bias in evaluation stage. Thus, the sampling strategy needs further consideration. The whole training process of hyper-network can be finded in Algo. 1. Suppose that is our target MAdds and is the MAdds of architecture , the sampling strategy should meet:

(4)

To meet the constraints on expected computation, the sampling probability of none choice in branch edges, , is defined to adjust the expected computation cost of sampled networks:

(5)
1:  Inputs: , ,
2:   = InitializeHyperNetwork()
3:  for  do
4:     
5:     
6:      = Sample(, )
7:      = Sample(, )
8:      =
9:      = Sample(, , )
10:     TrainForOneStep(, )
11:  end for
12:  Outputs:
Algorithm 1 Hyper-network Training

3.3 Stabilizing Performance Estimation

In search stage, evaluating an architecture through the shared parameters is essential for exploring promising results. Previous work on one-shot method usually measure the network performance with fully trained hyper-network weights directly. In this section, we first demonstrate our observation on random shuffling of candidates architectures ranking in our search space. Then we introduce our approach to improve the ranking stability.

3.3.1 Instability of One-shot NAS

Since the hyper-network is trained iterations, the shared parameter obtained after training is denoted as . We define a accuracy function which maps the model architecture and hyper-network weights to the validation set accuracy. The value of function can be estimated by simply loading the weight used by and testing the model performance on validation set. The score function, denoted as , of previous approach is simply

(6)

However, the true score function should be the actual performance of the model on validation set: , where denotes the weight obtained by sampling and training only. One-shot approach takes a approximation to reuse the shared parameters for different architectures. Although this is empirically useful, we observe high variance of the model ranking in two cases: rankings at different epochs and rankings by different runs.

We randomly sample a set of architecture and obtain their independent weight. We rank their performance under shared parameters on validation set by different checkpoints at the last 20 epochs training of hyper-network. As shown in Fig. 5 , the rank of a single checkpoint fluctuates a lot during hyper-network training process and hardly distinguishes the performance of architectures. If we repeat the hyper-network training with different random seeds for 20 times and obtain shared parameters . We quantify the correlation between rank of each and ground truth rank by Kendall’s coefficient [18]. Here, we show the ranking performance of the best and worst runs in Fig. 4.

Figure 5: The ranks of 10 random architectures during the last 20 training epochs, where drastic fluctuation can be observed. Ranks of three architectures at two time steps are shown in the figure, and each of them has different ranks at the two steps.

These two observations imply the necessity of a stabilized evaluation strategy. To present our strategy, formulation of the instability need to be introduced. In this paper, we model the performance estimation randomness as an unbiased noise. Since the shared parameters is fundamentally different from the parameters trained independently, we use a function to model the affect of weight sharing. General consensus has been reached: empirical provides inaccurate but useful ranking, which demonstrates the desired rank preserving effect of . In summary, our model to describe the quantity relationship is:

(7)
(8)

It is obvious that the existence of the noise term would hurt the model ranking. The most trivial approach to alleviate the negative effects of is to train multiple hyper-networks, and eliminate the noise by taking expectation. However, this approach requires several times more computation resources for hyper-network training.

3.3.2 SG-MCMC Sampling

The sampling process is described in Algo. 2

. In order to obtain high-quality low correlation samples of optimized shared parameters efficiently, we investigate the rich literature of Markov Chain Monte Carlo (MCMC) sampling methods

[3]. Recently, a few works demonstrate that constant learning rate stochastic gradient decent could be modified to Stochastic Gradient Langevin Dynamics (SGLD) to realize a Stochastic Gradient MCMC method under mild assumption[6, 39]. Here, we apply SGLD [39, 38] to approximate iid samples of share parameters posterior. The update rule we use, is simply

(9)

Here is the number of data used to compute gradients (batch size). The step size is set to the final learning rate of sub-net training. To ensure the independence, we generate each sample after SGLD update iterates for a data epoch.

To generate the iid samples of shared weights, we load the weights of the hyper-network after its training finishes, and set as the initial sample. Then, for each , we apply SGLD to obtain the next sample of parameter posterior with the rule in Eq. (9). Thus we can obtain multiple samples of hyper-network parameters.

1:  Inputs: , , , , ,
2:   =
3:  for  do
4:     
5:     
6:      = Sample(, )
7:      = Sample(, )
8:      =
9:      = Sample(, , )
10:     
11:     if   then
12:        .Append()
13:     end if
14:  end for
15:  Outputs:
Algorithm 2 Shared Parameter Sampling by SGLD

3.3.3 Average Accuracy and Parameter

Once we have samples of which approximates the parameters obtained by different run. To eliminate the effect of random noise and stabilize the performance estimation, we propose two approaches: score expectation and parameter expectation.

Expectation over scores approach is to define the score of each model as the expectation of validation accuracy over sampled shared weights.

(10)

Expectation over parameters approach is to take the average of sampled shared parameters and use average parameters to evaluate the performance of each model.

(11)

3.3.4 Independent fine-tuning

When evaluating the single architecture performance, loads the weights from the hyper-network and resuming training the architecture independently should be able to get more architecture-relevant weights. Thus we test this approach in our experiment.

3.4 Evolution Algorithm

Inspired by recent work [26, 36], we apply evolution algorithm NSGA-II as the search agent. In this section, we first introduce some basic concept of NSGA-II. Next we discuss how we apply NSGA-II to our search space.

3.4.1 Nsga-Ii

We seek to obtain the model architecture with excellent performance under the constraint of computational expense. NSGA-II is the most popular choice among multi-objective evolutionary method. The core component of NSGA-II, is the Non Dominated Sorting which benefits the trade off between conflicting objectives. Since our optimization target is to minimize MAdds and maximize performance of architecture under different computational constraints.

3.4.2 Initialization

To reduce manual bias and explore the search space better, we use random initialization for all individuals of the first generation. More specifically, each architecture randomly select basic operators for each block in the search space.

3.4.3 Crossover and Mutation

Single-point crossover on random position is adopted in our evolution algorithm. For two certain individuals and , a single-point crossover strategy on position will result in a new individual .

We use random choice mutation to enhance generation diversity. When a mutation happens to an individual, a selected operation block in it is changed to another available choice randomly.

4 Experiments and Results

We verify the effectiveness of our method on a large classification benchmark, ImageNet [30]. In this section, we firstly describe our implementation details. Secondly, we present the performance of searching results on ImageNet as well as comparison with state of the art methods., Finally, we demonstrate the advantage of our designs via ablation study.

4.1 Experiments Settings

Datasets We conduct experiments on the ImageNet, a standard benchmark for classification task. It has training images and validation images.

Train Details of Hyper-network For the training of hyper-network, we adopt cosine learning rate scheduler with learning rate initialized as 0.1 and decaying to 2.5e-4 during epochs. A L2 regularization is used and its weight is set to 1e-4. The optimizer is mini-batch stochastic gradient decent (SGD) with batch size 512 and we set momentum as 0 to decouple the gradients of architectures sampled in different batches. Hyper-parameter is set to . The hyper-network is trained on 32 GTX-1080Ti GPUs. We implement the stabilized evaluation of our method by saving checkpoints at epochs described in Sec. 3.3.2. The fine-tune strategy mentioned above is conducted with learning rate 2.5e-4.

Search Details The evolution agent randomly generates individuals for initialization. Then it repeats the exploitation and exploration loop where it generates individuals via single-point crossover and random mutation. It conducts loops and evaluates models. At last, we choose the top ranked 2 models under different MAdds constraints.


Model
Search space Params MAdds Top-1 acc Top-5 acc
(M) (M) (%) (%)

DARTS [25]
Cell-based 4.7 574 73.3 91.3
Proxyless-R [5] Chain-structured - 320 74.6 92.2
Single-path NAS [32] Chain-structured 4.3 365 75.0 92.2
FairNAS-A [8] Chain-structured 4.6 388 75.3 92.4
FBNet-C [40] Chain-structured 5.5 375 74.9 -

SPOS [14]
Chain-structured - 328 74.7 -
BetaNet-A [12] Chain-structured 7.2 333 75.9 92.8

ST-NAS-A(ours)
Topology augmented 5.2 326 76.4 93.1





Table 2: Performance comparison between ST-NAS and efficient NAS methods on ImageNet. Our model ST-NAS-A archieve best top 1 accuracy with the least MAdds.

Training Details of Resulted Architecture For the independent training of resulted architectures, we use cosine learning rate scheduler with initial learning rate . We train the model for 300 epochs with batch size 2048 and adopt SGD optimizer with nesterov and momentum value 0.9. To prevent overfitting, we use L2 regularization with weight 1e-4 and standard augmentations including random crop and colorjitter.


Model
Search space Params MAdds Top-1 acc Top-5 acc
(M) (M) (%) (%)

*MNASNet-A1 [36]
Chain-structured 3.9 312 75.2 92.5
*MNASNet-A2 [36] Chain-structured 4.8 340 75.6 92.7
*RCNet-B [42] Chain-structured 4.7 471 74.7 92.0
*NASNet-B [46] Cell-based 5.3 488 72.8 91.3
*EfficientNet-B0 [37]) Chain-structured 5.3 390 76.3 93.2
ST-NAS-A (ours) Topology augmented 5.2 326 76.4 93.1

1.4-MobileNetV2 [31]
Chain-structured 6.9 585 74.7 92.5
2.0-ShuffleNetV2 [27] Chain-structured 7.4 591 74.9 -
*NASNet-C [46] Cell-based 4.9 558 72.5 91.0
*NASNet-A [46] Cell-based 5.3 564 74.0 91.6
*1.4-MNASNet-A1 [36] Chain-structured - 600 77.2 93.7
*RENASNet [7] Cell-based 5.4 580 75.7 92.6
*PNASNet [24] Cell-based 5.1 588 74.2 91.9
ST-NAS-B (ours) Topology augmented 7.8 503 77.9 93.8




Table 3: Performance comparison among ST-NAS, manual designed networks and sample based NAS methods on ImageNet. Notably sample based methods with mark takes much more computation resources. We show that the architecture discovered by ST-NAS perform better than both sample based NAS and manually designed architectures while maintaining least MAdds.

4.2 Main Results

ST-NAS looks for models with objectives of low MAdds and high accuracy. We select two resulted models separately under small and large MAdds constraints, namely, ST-NAS-A and ST-NAS-B. Architectures and performance of them compared with state-of-the-art methods are discussed in this subsection.

Performance on ImageNet. We compare ST-NAS method with efficient NAS methods, including DARTS, ProxyLessNAS and FBNet, in Table 2. Our model ST-NAS-A outperforms all of them while with the least MAdds and comparable parameter number.

For architectures resulted from high cost, i.e, manually designed networks and networks obtained by sample-based methods, we compare ST-NAS with them in two groups divided by MAdds, as shown in Table 3. At a much less search cost, our ST-NAS outperform all the methods in both MAdds groups.


Model
MAdds (backbone) mAP
(G)

MobileNetV2
0.33 31.7
ST-NAS-A 0.33 33.2

ResNet18
1.81 32.2
ST-NAS-B 0.50 35.3
ResNet50 4.09 36.9
ST-NAS-B 1.03 37.7

Table 4: Performance on COCO dataset. The channel number of ST-NAS-B is scaled to get ST-NAS-B. ST-NAS-A outperforms MobileNetV2 by 1.5% COCO AP while maintaining same MAdds. ST-NAS-B achieves 0.5% higher than ResNet50 but needs only a quarter MAdds.
Figure 6: Resulted architectures. Linear bottle-neck contains a group-wise convolution layer between two point-wise convolution layers. Expand ratio is defined as the ratio between group-wise convolution channels and point-wise convolution channels. We describe a linear bottle-neck with its expand ratio, i.e, the number after “E”, and its group-wise convolution kernel size, i.e, the number after “K”.

Performance on COCO. Our implementation is based on feature pyramid network (FPN)[23]. Different models pretrained on ImageNet is utilized as feature extractor. All the models are trained for 13 epochs, known as schedule. The results are shown in Table 4. Our ST-NAS-A backbone outperforms MobileNetV2. The ST-NAS-B performs comparably with ResNet50 with much less MAdds.

4.3 Ablation Studies

4.3.1 Rank Fluctuation

To explain the importance of our stabilization mechanism, we randomly sample a set of architecture and rank their performance under shared parameters on validation set at the last 10 epochs training of hyper-network. As shown in Fig. 5, the rank of a single checkpoint fluctuates a lot during hyper-network training process and hardly distinguishes the performance of architectures, implying the necessity of a stabilized evaluation strategy.

Estimation approach

Single checkpoint
0.71

Fine-tune
0.64

SGLD-param
0.84

SGLD-acc
0.81
Table 5: of different rank stabilization approach we proposed. The original baseline is single checkpoint achieves 0.71 of Kendall which is 0.07 higher than the fine-tune approach. The parameter expectation and accuracy expectation method is tied and outperform the baseline with a margin.

4.3.2 Ranking Verification

We further verify this reduction by quantifying the ranking ability of different evaluation strategies by correlation coefficient between ranks in hyper-network and the ground truth ranks. Kendall’s tau coefficient is adopted as the metric in our verification. We randomly sample 12 networks and train them form scratch to obtain the ground truth rank. To compare with single checkpoint, we make use of checkpoints of 10 epochs at the 591-th, 592-th, …, 600-th epoch to generate 10 ranks and get 10 correlation coefficients with the ground truth rank. The median of the 10 correlation coefficients is adopted to compare with other strategies. It is observed in Table 5 that SGLD consistently achieves higher correlation coefficients than fine-tune and single checkpoint, which verifies the effectiveness of SGLD in the reduction of parameter variance.

5 Conclusion

We proposed a topology-diverse search space and a novel search method, ST-NAS. In ST-NAS, we improve both the sampling strategy during hyper-network training and the architecture evaluation approach by rigorous theoretical analysis. Sound experiments demonstrate the effectiveness of our designs and achieve consistent improvements under different computation cost constraints.

References

  • [1] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §2.
  • [2] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pp. 549–558. Cited by: §2.
  • [3] C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §3.3.2.
  • [4] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2017) SMASH: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §2.
  • [5] H. Cai, L. Zhu, and S. Han (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2, §2, Table 2.
  • [6] C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin (2016) Bridging the gap between stochastic gradient mcmc and stochastic optimization. In Artificial Intelligence and Statistics, pp. 1051–1060. Cited by: §3.3.2.
  • [7] Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu, and X. Wang (2018) Reinforced evolutionary neural architecture search. arXiv preprint arXiv:1808.00193. Cited by: Table 3.
  • [8] X. Chu, B. Zhang, R. Xu, and J. Li (2019) FairNAS: rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845. Cited by: Table 2.
  • [9] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §2.
  • [10] X. Du, T. Lin, P. Jin, G. Ghiasi, M. Tan, Y. Cui, Q. V. Le, and X. Song (2019) SpineNet: learning scale-permuted backbone for recognition and localization. arXiv preprint arXiv:1912.05027. Cited by: §1.
  • [11] T. Elsken, J. Metzen, and F. Hutter (2017) Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528. Cited by: §1.
  • [12] M. Fang, Q. Wang, and Z. Zhong (2019) BETANAS: balanced training and selective drop for neural architecture search. arXiv preprint arXiv:1912.11191. Cited by: Table 2.
  • [13] M. Guo, Z. Zhong, W. Wu, D. Lin, and J. Yan (2019) Irlas: inverse reinforcement learning for architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9021–9029. Cited by: §2, §2, §2.
  • [14] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: §1, §2, §2, §3.2, Table 2.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §2, §3.1.1.
  • [16] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §2.
  • [17] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1, §2.
  • [18] M. G. Kendall (1938) A new measure of rank correlation. Biometrika 30 (1/2), pp. 81–93. Cited by: §3.3.1.
  • [19] C. Li, X. Yuan, C. Lin, M. Guo, W. Wu, J. Yan, and W. Ouyang (2019) AM-lfs: automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419. Cited by: §2.
  • [20] X. Li, C. Lin, C. Li, M. Sun, W. Wu, J. Yan, and W. Ouyang (2019) Improving one-shot nas by suppressing the posterior fading. arXiv preprint arXiv:1910.02543. Cited by: §2.
  • [21] F. Liang, C. Lin, R. Guo, M. Sun, W. Wu, J. Yan, and W. Ouyang (2019) Computation reallocation for object detection. arXiv preprint arXiv:1912.11234. Cited by: §2.
  • [22] C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin, and W. Ouyang (2019) Online hyper-parameter learning for auto-augmentation strategy. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6579–6588. Cited by: §2.
  • [23] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §4.2.
  • [24] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: Table 3.
  • [25] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2, §2, §2, Table 2.
  • [26] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and W. Banzhaf (2018)

    NSGA-net: a multi-objective genetic algorithm for neural architecture search

    .
    arXiv preprint arXiv:1810.03522. Cited by: §2, §2, §3.4.
  • [27] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: Table 3.
  • [28] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    .
    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §1, §2.
  • [29] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §1, §2.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.
  • [31] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §3.1.2, Table 3.
  • [32] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyantha, J. Liu, and D. Marculescu (2019) Single-path nas: designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: Table 2.
  • [33] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2, §2.
  • [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
  • [35] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §2, §2.
  • [36] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §2, §2, §3.4, Table 3.
  • [37] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: Table 3.
  • [38] Y. W. Teh, A. H. Thiery, and S. J. Vollmer (2016) Consistency and fluctuations for stochastic gradient langevin dynamics. The Journal of Machine Learning Research 17 (1), pp. 193–225. Cited by: §3.3.2.
  • [39] M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Cited by: §3.3.2.
  • [40] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: Table 2.
  • [41] S. Xie, A. Kirillov, R. Girshick, and K. He (2019) Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569. Cited by: §2.
  • [42] Y. Xiong, R. Mehta, and V. Singh (2019) Resource constrained neural network architecture search: will a submodularity assumption help?. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1901–1910. Cited by: §2, Table 3.
  • [43] Y. Zhang, Z. Lin, J. Jiang, Q. Zhang, Y. Wang, H. Xue, C. Zhang, and Y. Yang (2020) Deeper insights into weight sharing in neural architecture search. arXiv preprint arXiv:2001.01431. Cited by: §1.
  • [44] Z. Zhong, Z. Yang, B. Deng, J. Yan, W. Wu, J. Shao, and C. Liu (2018) Blockqnn: efficient block-wise neural network architecture generation. arXiv preprint arXiv:1808.05584. Cited by: §2, §2, §2.
  • [45] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §2, §2, §2.
  • [46] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §2, Table 3.