Neural architecture search (NAS), which automates the design of artificial neural networks (ANN), has received increasing attention in recent years. It is capable of finding ANNs which achieve similar or even better performance than manually designed ones. NAS is essentially a bi-level optimization task as shown in Fig. 1(a). Let denote the set of possible network architectures under a predefined search space. Let and denote an architecture in and its corresponding weights, respectively. The lower-level objective optimizes weights as
where is the loss criterion evaluated on the training dataset and denotes the network with architecture and weight . The upper-level objective optimizes the network architecture on the validation dataset with the weight that has been optimized by the lower-level task as
Rather than directly tackling the bi-level problem, some attempts veniat2018learning ; Wu2018FBNetHE ; saxena2016convolutional ; shin2018differentiable ; ahmed2017connectivity ; xie2018snas relax the discrete search space to a continues one denoted by , which can be written into where denotes the continuous relaxation and stands for the topology of the relaxed architecture. The weight and architecture are jointly optimized with a single objective function
Then the optimal architecture is derived by discretizing the continuous architecture . These methods greatly simplify the optimization problem and enable end-to-end training. However, since the validation set is not involved in Eq. (3), the search results are inevitably biased towards the training dataset.
More recent NAS methods tend to reduce the computational complexity by decoupling the bi-level optimization problem into a sequential one bender2018understanding ; brock2017smash ; guo2019single . Specifically, a super-network (one-shot model) is defined and the search space is constrained to contain only sub-networks of . As shown in Fig. 1(b), recent one-shot NAS methods first optimize weights for the super-network by solving
Then a number of sub-networks are sampled from and the best-performing sub-network is picked out with
where denotes the weights of architecture inherited from . The core assumption of this one-shot NAS method is that the best-performing sub-network shares weights with the optimal super-network, so that each sampled sub-network does not need to be re-trained in the searching process. This greatly boosts the efficiency of NAS. However, this assumption does not always hold. Clues can be found in the common practice that previous one-shot methods rely on fine-tuning to further improve the performance of the found best model. Previous research has also pointed out that the mismatch between weights and architectures of sampled sub-networks could jeopardize the following ranking results xie2018snas . Besides, the searching process is casually treated by random or uniform sampling. We believe there is large room for improvement in efficiency.
In this paper, we propose a novel NAS strategy, namely NAS through A posteriori distribution guided Sampling (NASAS). In NASAS, we propose to estimate a posteriori distribution over the architecture and weight pair () with a variational distribution , where denotes the variational parameters. The optimial , denoted by , can be found by
where measures the distance between two distributions. Note that finding is not a trivial problem and the details will be presented in Section 2. After is found, we can look for the optimal architecture by
In a nutshell, NASAS leverages the training dataset to estimate a posteriori distribution, based on which sampling is performed, and then uses the validation set for performance evaluation.
The flow chart of NASAS is illustrated in Fig. 1
(c). Our work has two main innovations compared with the recently proposed one-shot approaches. First, we greatly improve the efficiency of network search process by a guided sampling. As a result, the searching time can be reduced by orders of magnitude to achieve the best performance. Second, we approximate the joint distribution over architecture and weight to alleviate the mismatch problem mentioned earlier. This not only improves the reliability of ranking result, but also allows us to directly output the found best-performing network without fine-tuning. We evaluate our NASAS on image classification task. It is able to achieve 1.98% test error at 11.1 GPU days on Cifar-10, while the best network found by existing NAS methods is only able to achieve 2.07% test error at 200 GPU days. NASAS also achieves state-of-the-art performance with 14.8% test error at 8.7 GPU days on Cifar-100, and 24.80% test errors at around 40 GPU days on ImageNet under relaxed mobile setting.
In this section, we first formulate the target problem of our NASAS, and then propose an end-to-end trainable solution to estimate the joint a posteriori distribution over architectures and weights, followed by an efficient sampling and ranking scheme to facilitate the search process.
2.1 Notation and Problem Formulation
Given a one-shot model , let denote the convolution weight matrix for layer with spatial kernel size , and and denote the number of input and output channels, respectively. We use to denote the sliced kernel operated on the input channel dimension and use to denote the weights of the whole one-shot model. As deriving a sub-network in
is equivalent to deactivating a set of convolution kernels, sub-network architecture can be specified by a set of random variables, where indicates deactivating (zero) or activating (one) convolution kernel . Later on we will use boldface for random variables.
Although we need a joint a posteriori distribution over and , we do not have to explicitly derive the joint distribution since deactivating or activating a convolution kernel is also equivalent to multiplying a binary mask to the kernel. Instead, we combine them as a new random variable , where . Thus, the key problem in our NASAS is to estimate a posteriori distribution over the hybrid network representation . Mathematically,
where X and Y denote the training samples and labels, respectively. is likelihood that can be inferred by where denotes a sub-network defined by hybrid representation . is the a priori distribution of hybrid representation . Because the marginalized likelihood in Eq. (8) is intractable, we use a variational distribution to approximate the true a posteriori distribution and reformulate our target problem as
Here we choose KL divergence and accuracy to instantiate and , respectively.
2.2 A Posteriori Distribution Approximation
We employ Variational Inference(VI) to approximate the true a posteriori distribution with by minimizing the negative Evidence Lower Bound (ELBO)
2.2.1 Approximation by Network Training
We employ the re-parametrization trick kingma2013auto and choose a deterministic and differentiable transformation function that re-parameterizes the as , where
is a parameter-free distribution. Take a uni-variate Gaussian distributionas an example, its re-parametrization can be with , where and are the variational parameters . Gal et.al. in gal2016uncertainty ; gal2016dropout have shown that when the network weight is re-parameterized with
the function draw w.r.t. variational distribution over network weights can be efficiently implemented via network inference. Concretely, the function draw is equivalent to randomly drawing masked deterministic weight matrix in neural networks, which is known as the Dropout operations srivastava2014dropout . Similarly, we replace in our hybrid representation with , and reformulate as
In Eq. (12), we have an additional random variable that controls the activation of kernels whose distribution is unknown. Here we propose using the marginal probability to characterize its behavior, because the marginal can reflect the expected probability of selecting kernel given the training dataset. It exactly matches the real behavior if the selections of kernels in a one-shot model are independent.
Since the joint distribution of network architecture is a multivariate Bernoulli distribution
multivariate Bernoulli distribution, its marginal distribution obeys dai2013multivariate , where now is also the variational parameter that should be optimized. Therefore, we have
Here we omit the subscript in the original because the importance of branches which come from the same kernel size group and layer should be identical. By replacing with a new variable , Eq. (13) has the same form as Eq. (11). Now Eq. (10) can be rewritten as
where variational parameters are composed of both the deterministic kernel weights and the distribution of network architecture. The expected log likelihood (the integral term) in the equation above is usually estimated by Monte Carlo (MC) estimation
Eq. (15) indicates that the (negative) ELBO can be computed very efficiently. It is equivalent to the KL term minus the log likelihood that is inferenced by the one-shot network (now reparameterized as ). During each network inference, convolution kernels are randomly deactivated w.r.t. probability , which is exactly equivalent to a dropout neural network.
Now, approximating a posteriori distribution over the hybrid network representation is converted to optimizing the one-shot model with dropout and a KL regularization term. If the derivative of both terms is tractable, we can efficiently train it in an end-to-end fashion.
2.2.2 Network Optimization
In addition to the variational parameters , the variable in Eq. (13) should also be optimized (either via grid-search gal2016uncertainty or gradient-based method gal2017concrete ). So we need to compute
. If each convolution kernel is deactivated with a prior probabilityalong with a Gaussian weight prior , then the a priori distribution for the hybrid representation is exactly a spike and slab prior . Following gal2016dropout ; gal2017concrete , the derivatives of Eq. (15) can be computed as
where and denotes the number of input channels for convolution kernel of spatial size at layer . Please note that the above derivation is obtained by setting the prior to be zero, which indicates the network architecture prior is set to be the whole one-shot model. The motivation of employing is that a proper architecture prior is usually difficult to acquire or even estimate, but can be a reasonable one when we choose the over-parameterized network that proves effective on many tasks as our one-shot model. Besides, provides us a more stable way to optimize the gal2016uncertainty . So, we will use the one-shot models that are built upon manually designed networks in our experiments.
Since the first term in Eq. (16) involves computing the derivative of a non-differentiable Bernoulli distribution (remember in Eq. (13)), we thus employ the Gumbel-softmax jang2016categorical to relax the discrete distribution to a continuous space and the in Eq. (16) and Eq. (13) can be deterministically drawn with
is the temperature that decides how steep the sigmoid function is and ifgoes to infinite, the above parametrisation is exactly equivalent to drawing the sample from Bernoulli distribution. (Similar relaxation is used in gal2017concrete without using Gumbel-softmax.)
) can be propagated via chain rule. Combining the Eq. (8), Eq. (10) and Eq. (15) , one can see that the a posteriori distribution over the hybrid representation can be approximated by simply training the one-shot model in an end-to-end fashion with two additional regularization terms and dropout ratio .
2.3 Sampling and Ranking
Once the variational distribution is obtained, we sample a group of network candidates w.r.t. , where the is the number of samples. According to Eq. (13), our sampling process is performed by activating convolution kernels stochastically with the learned probability , which is equivalent to a regular dropout operation. Specifically, each candidate is sampled by randomly dropping convolution kernel w.r.t. the probability for every , and in the one-shot model. Then the sampled candidates are evaluated and ranked on a held-out validation dataset. Due to the hybrid network representation, we actually sample architecture-weight pairs which relieves the mismatch problem. At last, the best-performing one is selected by Eq. (7).
Please note that our a posteriori distribution guided sampling scheme, though not intentionally, leads to an adaptive dropout that reflects the importance of different parts in the one-shot model. It thus relieves the dependency on the hyper-parameter sensitive, carefully designed drop-out probability in the previous one-shot methods bender2018understanding .
To fully investigate the behavior of the NASAS, we test our NASAS on six one-shot super-networks. Because we use to facilitate Eq. (16), we construct the super-networks based on architecture priors perceived from manually designed networks. We evaluate the performance of our NASAS on three databases Cifar-10, Cifar-100 and ImageNet, respectively. For every one-shot super-network, we insert a dropout layer after each convolution layer according to Eq. (17) to facilitate the computation of Eq. (16
). This modification introduces parameters and FLOPS of negligible overheads. Our NASAS is trained in an end-to-end way with the Stochastic Gradient Descent (SGD) using a single P40 GPU card for Cifar-10/Cifar-100 and 4 M40 GPU cards for ImageNet. Once a model converges, we sample different convolution kernels w.r.t. the learned dropout ratio to get 1500/5000/1500 candidate architectures for Cifar-10, Cifar-100 and ImageNet, respectively. These 1500 candidates are ranked on a held-out validation dataset and the one with the best performance will be selected as the final search result.
3.1 Cifar-10 and Cifar-100
One-shot Model and Hyper-parameters. We test our NASAS with four super-networks, namely SupNet-M/MI and SupNet-E/EI, on Cifar-10 and Cifar-100. They are based on the manually designed multi-branch ResNet gastaldi2017shake and the architecture obtained by ENAS pham2018efficient , respectively. Please refer to the supplementary material for more details of the one-shot models and all hyper-parameter settings used in this paper.
|Method||Error(%)||GPUs Days||Params(M)||Search Method|
|shake-shake + cutout devries2017improved||2.56||-||26.2||-|
|NAS + more filters zoph2016neural||3.65||22400||37.4||RL|
|NASNET-A + cutout zoph2018learning||2.65||1800||3.3||RL|
|Micro NAS + Q-Learning zhong2018practical||3.60||96||-||RL|
|PathLevel EAS + cutout cai2018path||2.30||8.3||13.0||RL|
|ENAS + cutout pham2018efficient||2.89||0.5||4.6||RL|
|EAS (DenseNet) cai2018efficient||3.44||10||10.7||RL|
|AmoebaNet-A + cutout real2018regularized||3.34||3150||3.2||evolution|
|Hierachical Evo liu2017hierarchical||3.63||300||61.3||evolution|
|DARTS + cutout liu2018darts||2.83||4||3.4||gradient-based|
|SNAS + cutout xie2018snas||2.85||1.5||2.8||gradient-based|
|NAONet + cutout luo2018neural||2.07||200||128||gradient-based|
|One-Shot Top bender2018understanding||3.70||-||45.3||gradient-based|
|Method||Error(%)||GPUs Days||Params(M)||Search Method|
|NAONet + cutout luo2018neural||14.36||200||128||gradient-based|
Comparison with State-of-the-arts. Table. 1 shows the comparison results on Cifar-10. Here NASAS-X denotes the performance of our NASAS on the super-network SupNet-X. From top to bottom, the first group consists of state-of-the-art manually designed architectures on Cifar-10; the following three groups list the related NAS methods in which different search algorithms, e.g. RL, evolution, and gradient decent, are adopted; the last group exhibits the performance of our NASAS. It shows that our NASAS is capable of finding advanced architectures in a much efficient and effective way, e.g. it finds the architecture at the lowest errors 1.98% on 11.1 GPU days only.
We also enlist the two networks, Multi-branch ResNet gastaldi2017shake and ENAS pham2018efficient , that inspired our design of super-networks in Table 1. Our NASAS-E and NASAS-M outperform "ENAS+cutout" and "shake-shake+cutout" by 0.16% and 0.36% at smaller model sizes. In the inflated cases, our NASAS-MI/EI find architectures with even higher performance. Regarding the sampling based one-shot method "One-Shot Top" which achieves a competitive 3.7% classification error by randomly sampling 20000 network architectures, our NASAS attains a much higher performance by sampling only 1500 network architectures due to the a posteriori distribution guided sampling.
Table. 2 further demonstrate the performance of our NASAS on a much challenging dataset Cifar-100. Our NASAS achieves a good trade-off on efficiency and accuracy. It achieves 14.8% error rate with only 8.7 GPU days, which is very competitive in terms of both performance and search time.
Please note that results of our NASAS are achieved during search process without any additional fine-tuning on weights of the searched architectures, while those of other methods are obtained by fine-tuning the searched models. In the following ablation study, we will discuss more on this point.
Ablation Study and Parameter Analysis. We first evaluate the effect of our a posteriori distribution guided sampling method in Table. 3(a). Compared with the baseline "Random" sampling that is implemented by employing predefined dropout strategy as discussed in bender2018understanding , "NASAS" successfully finds better sub-networks which bring relatively 14% - 23% gain. Evidently, the a posteriori distribution guided sampling is much more effective, which validates that our approach can learn a meaningful distribution for efficient architecture search. Besides, as can be viewed in the table, there is usually a huge performance gap between the architecture searched with predefined distribution with and without fine-tuning, which reveals the mismatching problems.
Table. 3(b) discusses the weight prior in Eq. (17). We find that a good usually makes the term in Eq. (16) fall into a commonly used weight decay range. So we choose by grid search. As shown in this table, the weight prior affects both error rate and model size. The higher the is, the smaller the size of parameters. Since the objective of NAS is to maximize performance rather than size of parameters, we choose the one with the minimal error rate.
Table. 3(c) shows the impact of temperature value in Eq. (17). It shows that a smaller leads to a lower error, which is consistent with the analysis regarding to Eq. (17). The corresponding fine-tuned result of our NASAS also provides marginal improvement, which on the other hand demonstrates the reliability of our NASAS on sampling of both architecture and weights.
We further evaluate the impact of number of samples in Table. 3(d). The performance improves along with the increase of number of samples as well as the GPU days. Here we choose sampling 1500 architectures as a trade-off between the complexity and accuracy. Please also note that compared with other sampling-based NAS methods, our scheme achieves 2.17 % error rate by sampling only 50 architectures with the assistance of the estimated a poseteriori distribution. It further reveals the fact that the estimated distribution provides essential information of the distribution of architectures and thus significantly facilitates the sampling process in terms of both efficiency and accuracy.
We further evaluate our NASAS on ImageNet with two super-networks based on ResNet50 he2016deep and DenseNet121 huang2017densely , respectively. Please find detailed experimental settings in the supplementary material. Rather than transferring architectures searched on smaller dataset, the efficiency and flexibility of our method enable us to directly search architectures on ImageNet within few days.
We first provide test results of our NASAS on ImageNet in Table 4 using a relatively small search space by inflating ResNet50 without limiting the size of the model parameters. Hype-parameters and training process for the three models are identical for fair comparison. It can be observed that NASAS-R-50 outperforms the ResNet50 by 1.23% with a similar size of parameters. Table. 5 shows the comparison with the state-of-the-art results on ImageNet. In this test, we control the size of searched architecture to be comparable to those of other NAS methods in mobile setting. Still, our NASAS outperforms. Please note that the size control limits our choice on . As shown in Table. 3(b), it may prevent us from finding better architectures with advanced performance.
|Method||Error(%)(Top1/Top5)||GPUs Days||Params(M)||Search Method|
Weight Sharing. Weight sharing is a popular method adopted by one-shot models to greatly boost the efficiency of NAS. But it is not well understood why sharing weight is effective elsken2018neural ; bender2018understanding . In NASAS, as discussed in subsection 2.2, we find that weight sharing can be viewed as a re-parametrization that enables us to estimate the a posteriori distribution via an end-to-end network training.
Limitations and Future Works. One limitation of our NASAS is that it can not explicitly choose the non-parametric operations such as pooling. Another one is that our NASAS requires prior knowledge on architectures which is hard to achieve. Here we approaches the prior only by manually designed networks. So our future work may be 1) enabling selections on the non-parametric operations (e.g. assigning a 1x1 convolution after each pooling operation as a surrogate to decide whether we need this pooling branch or not.) 2) investigating the robustness of our NASAS to different prior architectures.
In this paper, we propose a new one-shot based NAS approach, i.e. NASAS, which explicitly approximates a posteriori distribution of network architecture and weights via network training to facilitate an more efficient search process. It enables candidate architectures to be sampled w.r.t. the a posteriori distribution approximated on training dataset rather than uniform or predefined distribution. It also alleviates the mismatching problem between architecture and shared weights by sampling architecture-weights pair, which makes the ranking results more reliable. The proposed NASAS is efficiently implemented and optimized in an end-to-end way, and thus can be easily extended to other large-scale tasks.
-  Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017.
-  Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
-  Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
-  Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
-  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In , pages 8697–8710, 2018.
-  Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
-  Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2423–2432, 2018.
-  Arber Zela, Aaron Klein, Stefan Falkner, and Frank Hutter. Towards automated deep learning: Efficient joint neural architecture and hyperparameter search. arXiv preprint arXiv:1807.06906, 2018.
-  Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823, 2017.
-  Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freeze-thaw bayesian optimization. arXiv preprint arXiv:1406.3896, 2014.
Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter.
Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves.In IJCAI, volume 15, pages 3460–8, 2015.
-  Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. 2016.
-  Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
-  Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
-  Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
-  Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
-  Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. arXiv preprint arXiv:1812.09926, 2018.
-  Tom Véniat and Ludovic Denoyer. Learning time/memory-efficient deep architectures with budgeted super networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3492–3500, 2018.
-  Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018.
-  Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems, pages 4053–4061, 2016.
-  Richard Shin, Charles Packer, and Dawn Song. Differentiable neural network architecture search. 2018.
-  Karim Ahmed and Lorenzo Torresani. Connectivity learning in multi-branch networks. arXiv preprint arXiv:1709.09582, 2017.
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc
Understanding and simplifying one-shot architecture search.
International Conference on Machine Learning, pages 549–558, 2018.
-  Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420, 2019.
Uncertainty in deep learning. PhD thesis, PhD thesis, University of Cambridge, 2016.
-  Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  Bin Dai, Shilin Ding, Grace Wahba, et al. Multivariate bernoulli distribution. Bernoulli, 19(4):1465–1483, 2013.
-  Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017.
-  Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
-  Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
-  Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
-  Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
-  Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level network transformation for efficient architecture search. arXiv preprint arXiv:1806.02639, 2018.
-  Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. AAAI, 2018.
-  Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems, pages 7826–7837, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
-  Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377, 2018.