Recently, Neural Architecture Search (NAS) 
which automates the process of model designing is gaining around in past several years. Computer vision tasks (e.g. image classification[1, 26], semantic segmentation 14, 19, 25]) can all be solved by NAS with surprising performance. However, early approaches [25, 26, 20] suffer from the issue of inefficiency. To solve this issue, some one-shot approaches [1, 19, 6, 14, 11] are proposed. Generally speaking, one-shot NAS approaches sample cells, a micro search space presented in , from a family of predefined candidate operations depending on a policy, and treat the sampled cells as building block of deep architecture, i.e. child model, whose performance is used for policy’s parameters update. These one-shot approaches avoid retraining each candidate deep architecture from scratch so that high efficiency can be promised.
In particular, Efficient Neural Architecture Search (ENAS)  delivers state-of-the-art efficiency, 0.45 GPU day, using parameter sharing and reinforcement learning. However, for promising the performance of learned architecture, ENAS has to use a deep child model whose propagation speed is slow as search paradigm. Therefore, ENAS suffers greatly from the issue of slow propagation speed of search model (child model) with deep topology.
In this paper, we propose a Broad version for ENAS (BENAS), an automatic architecture search approach with state-of-the-art efficiency. Different from other NAS approaches, in BENAS, an elaborately designed Broad Convolutional Neural Network (BCNN) instead of a deep one is discovered in a one-shot model by parameter sharing and reinforcement learning for solving the aforementioned limitation of ENAS. Particularly, we propose a new paradigm of search model, BCNN, which can obtain satisfactory performance with shallow topology, i.e. fast forward and backward propagation speed. The proposed BCNN extracts multi-scale features and enhancement representations, and feeds them into global average pooling layer to yield more reasonable and comprehensive representations so that the achieved performance of BCNN can be promised. Our contributions can be summarized as follows:
We propose a broad version of ENAS named BENAS to further improve the efficiency of ENAS by replacing the search model with BCNN which is elaborately designed for promising satisfactory performance and fast propagation speed simultaneously.
We achieve 2x less search cost (with a single GeForce GTX 1080Ti GPU on CIFAR-10 in 0.23 day) than ENAS . Furthermore, through extensive experiments on CIFAR-10, we show that the architecture learned by BENAS can be applied in different-size models with state-of-the-art performance in particular for small-size models.
We not only show the powerful transferability of the learned architecture of BENAS but also the multi-scale features extraction capacity of BCNN. The learned cells based BCNN achieves 25.7% top-1 error just using 3.9 millions parameters.
The remainder of this paper is organized as follows. In Section 2, we review related work with respect to this paper. Subsequently, the proposed approach is proposed in Section 3. Next, experiments on two data set are performed, and qualitative and quantitative analysis is given in Section 4. At last, we draw conclusions in Section 5.
2 Related Work
The proposed BENAS is related to previous work in Broad Learning System (BLS) and NAS. The related works of BLS and NAS are introduced below.
2.1 Broad Learning System
is a developed model of the Random Vector Functional-Link Neural Network (RVFLNN)[17, 18] who takes input data directly and builds enhancement nodes. Different from the RVFLNN, in BLS, a set of mapped features are established by the input data firstly for achieving better performance.
BLS consists of two parts, feature mapping nodes and enhancement nodes. Firstly, nonlinear transformation functions of feature mapping nodes are applied to generate the mapped features of the input data. Subsequently, the mapped features are enhanced to generate enhancement features by enhancement nodes with randomly generated weights. Finally, all the mapped features and enhancement features are used to deliver the final result. Chen et al.  introduce several variants of BLS, e.g. cascade of feature mapping nodes broad learning system, cascade of feature mapping nodes with its last group connects to the enhancement nodes broad learning system. Below, Cascade of Convolution Feature mapping nodes Broad Learning System (CCFBLS) who inspires us is introduced in details.
Feature mapping nodes and enhancement nodes make up the CCFBLS 
. In the feature mapping nodes, the mapped features are generated by the cascade of convolution and pooling operations. Then, these mapped features are enhanced by a nonlinear activation function to obtain a series of enhancement features. Finally, all of the mapped features and enhancement features are connected directly with the desired output. As described above, CCFBLS is not only broad but also deep. As a result, CCFBLS can extract multi-scale features and deep representations which are more reasonable and comprehensive compared with other models only with deep structure.
2.2 Neural Architecture Search
As a powerful tool for solving the architecture engineering issue with respect to some artificial intelligence related tasks, especially computer vision tasks, NAS achieves amazing performance in past several years. The unprecedented success of NAS is depending on the unacceptable computational resources.
There exist recent efforts introducing various methods to improve the search efficiency of NAS [12, 13, 8, 7]. For example, based on a Sequential Model-Based Optimization (SMBO) strategy, Progressive Neural Architecture Search (PNAS) 
searches the structure of convolutional neural networks in order of increasing complexity. A multi-objective evolutionary algorithm is proposed for improving the efficiency of architecture search in LEMONADE. However, these approaches are still not efficient enough due to need to retrain each child model from scratch.
A great number of one-shot approaches [1, 19, 14] which define all possible candidate architectures in one-shot model for avoiding the issue of each child model retraining from scratch have been presented for improving the efficiency of NAS further. SMASH  uses a hypernetwork to generate the weights of a designed architecture so that the search process can be accelerated greatly. Furthermore, Liu et al.  propose Differentiable ARchiTecture Search (DARTS) which discovers the computation cells within a continuous domain for formulating NAS in a differentiable way. DARTS achieves three orders of magnitude less expensive than previous approaches [25, 26]. In particular, a NAS approach with novel efficiency (uses a single GeForce GTX 1080Ti GPU for 0.45 day which is 3x faster than DARTS) named ENAS  is presented. In order to improve the efficiency, ENAS uses parameter sharing for avoiding each candidate deep architecture retraining from scratch.
3 The Proposed Approach
3.1 Efficient Neural Architecture Search
In the reinforcement learning based ENAS, an Long Short-Term Memory (LSTM) controller with parameter is trained in a loop: the LSTM first generates two types of cells, Normal cell and Reduction cell (more details can be found in previous works [26, 19]), with a list of tokens according a sampling policy for stacking up into a relative deep child model , and then the child model whose weights are inherited from the one-shot model is trained in a single step for measuring its validation accuracy on the desired task. Subsequently, the is treated as the reward of reinforcement learning to guide the LSTM controller for discovering various cells with better performance. Moreover, ENAS asks the LSTM controller to maximize the expected reward , where
Moreover, a gradient policy algorithm, REINFORCE  is applied to compute the policy gradient , where
After many iterations of this loop are repeated, novel cells with satisfactory performance can be found.
3.2 Problem Analysis
ENAS suffers from an issue of slow propagation speed of child model with deep topology. Below, we will discuss the reasons of that in details.
First of all, a priori knowledge should be given. As we all know, two deep neural networks with same parameters but different depths have various propagation speeds where the shallow one is faster than the deep one. Moreover, the performance of neural network with deep topology is better than the shallow one.
Furthermore, ENAS has to employ a child model with deep topology as search paradigm. Loosely speaking, there are two phases, architecture search and architecture deriving in ENAS. On one hand, in the state of architecture search, the cells sampled by LSTM are stacked up as building blocks of child model with layers. On the other hand, a deeper model with layers is stacked up by the sampled cells in the architecture deriving phase. Without loss of generality, the number of layers is set as large as possible to achieve a high accuracy. In the meanwhile, the number of search model’s layers should be set to a relative large value for reducing the differences between the models constructed in the above two phases, i.e. promising the stability and rationality of ENAS.
From the above, we can draw a conclusion that depth reduction of child model in the architecture search state of ENAS can improve the efficiency but lead to performance loss. In order to solve the above issue, we propose a broad version of ENAS named BENAS where a novel paradigm of child model named BCNN is elaborately designed as the search model of ENAS.
3.3 Broad Convolutional Neural Network
In BENAS, we propose BCNN who can deliver satisfactory performance and fast propagation speed simultaneously with broad topology as the search paradigm and also child model for automatic architecture designing. Moreover, a two-layers LSTM controller, reinforcement learning and parameter sharing (more details can be found in ENAS ) are also adopted for architecture sampling, controller’s parameter updating and accelerating architecture search process, respectively. As aforementioned, the proposed BCNN is a developed CCFBLS  which is not only broad but also deep. For intuitional understanding, the structure of BCNN and its two important components, convolution and enhancement blocks are depicted in Figure 1.
BCNN consists of convolution blocks denoted as and enhancement blocks denoted as which are used for feature extraction and enhancement, respectively. In the convolution block, there are convolution cells: deep cells and a single broad cell which are utilized to deep and broad features extraction, respectively. Moreover, is determined by the size of input images. For example, we set for the experiments on CIFAR-10 with pixels. The other two parameters and need to be defined by user. For convenient expression below, a simple notation, @@ is defined to indicate these three parameters in the BCNN. For instance, 3@2@2 means that there are 3 convolution blocks where each one contains 2 deep cells, and 2 enhancement blocks in BCNN.
In each convolution block, the deep cells and broad cell have same topologies but various strides: one for the deep and two for the broad. In order to extract broad features from the output features of final deep cell, the broad cell returns the feature maps with half width, half height and double channels, i.e. broad features. In each enhancement block, there is a single enhancement cell with one stride and different topology from those convolution cells. The proposed BCNN cascadesconvolution blocks one after another, and feeds the output of final convolution block into each enhancement block as the input for obtaining enhancement feature representations. The convolution and enhancement features from every convolution and enhancement block are all connected with the global average pooling layer to yield more reasonable and comprehensive representations for achieving promised performance of the proposed BCNN. For clear understanding, the formulaic expressions of BCNN are given below.
For convolution block
, its deep feature mappingand broad feature mapping can be defined as
where and are the weight, bias matrices of deep cells and broad cell in convolution block , respectively. Moreover, is a set of transformations (e.g. depthwise-separable convolution , pooling, skip connection) by the deep cells and broad cell. In other words, each cell in the convolution block uses the outputs of its previous two cells as the inputs for combining various features. However, there is a doubt in (3) that and are not defined. A complementary expression is given as
Moreover, as aforementioned, a convolution with kernel size is inserted in the front of BCNN to provide the input information for the first and second convolution cell. As a result, the output of the convolution can be represented as , where .
For enhancement block , its enhancement feature representations can be mathematically expressed as
where and are the weight and bias matrices of enhancement cell in enhancement block , respectively. Moreover, is a set of transformations by the enhancement cell.
In order to ensure as much as convolution and enhancement features can be aggregated as more reasonable and comprehensive representations for achieving promised performance of BCNN, all outputs of each convolution and enhancement block are connected directly with the global average pooling (GAP) layer. Here, the output of the last deep cell in each convolution block is connected for feeding all-scales features into the GAP layer so that the final output of GAP layer can be expressed as
where is a function combination of convolution, concatenating and global average pooling. Here, a priori knowledge is incorporated into BCNN. Depending on a great number of experiments, we find that those low-pixels feature maps are more important than those feature maps with high resolutions for achieving high performance. In other words, for designing BCNN with novel performance, more deep and broad feature maps of instead of should be fed into the GAP layer, where and . In order to insert the above priori knowledge into BCNN, a group of convolutions with kernel size are employed in each connection between the convolution block and GAP layer. These convolutions accept those feature representations from the final deep cell in each convolution block and output a group of feature maps with different importance. Moreover, the importance is represented by the number of output channels which the larger is the more important it is. Furthermore, these convolutions have different strides for concatenating all input feature maps with same size.
Just because of the above, the proposed BCNN can achieve high performance with a shallow topology so that the extreme fast forward and backward propagation speed needed by ENAS can be promised.
3.4 Training Strategy
An overview of training strategy of BENAS can be found in Algorithm 1. It is obvious that we are not only need to train the controller for generating better BCNNs but also child models with different paradigms. The training of BENAS is a dual optimization problem due to the interrelation between the complete model and controller. In order to reduce the computation complexity of the above optimization issue, we divide the training procedure of BENAS into two interleaving phases.
First of all, the parameters of LSTM are fixed in the first phase. And then each child model is sampled and trained on 45000 images of CIFAR-10. At last, the trained weights of child model are stored into the complete model for next child model restoring. In the second phase, the parameters of complete model are fixed firstly. Subsequently, the LSTM predicts a list of tokens with length which can be regard as a list of actions to represent a cell. And then, the sampled cell is stacked as the building block of child model following the paradigm of BCNN shown in Fig. 1. Moreover, the child model’s weights is restored from the complete model. Finally, the accuracy of the model on 5000 validation images of CIFAR-10 is consider as the loss function of LSTM and a policy gradient algorithm is applied for optimizing .
4 Experiments and Analysis
4.1 Architecture Search on CIFAR-10
Similarly, CIFAR-10 is chosen as the search dataset and applied a series of standard data augment techniques which can be found in ENAS  for details. In BENAS, we chose five candidate operations: depthwise-separable convolution, depthwise-separable convolution, max pooling, average pooling and skip connection as the components of convolution cell and enhancement cell with 7 nodes.
In the architecture search phase, for training the broad model with topology of 2@0@2 (the definition of this notation refers to Section 3.3
), the Nesterov momentum is adopted and the learning rate follows the cosine schedule with=0.05, =0.0005, =10 and =2 . Furthermore, the experiment runs for 150 epochs with batch size 128. For updating the parameters of LSTM, the Adam optimizer with a learning rate of 0.0035 is applied.
The diagrams of the top performing convolution cell and enhancement cell discovered by BENAS are shown in Figure 2. Based on the learned cells, a family of BCNNs with same topologies of 2@1@1 but different parameters by changing the number of channels are constructed. The comparisons of BENAS with other NAS approaches on CIFAR-10 for different-size models under identical training conditions are shown in Table 1. Moreover, a popular data augmentation technique, Cutout  is applied for BENAS in the architecture deriving phase rather than the search phase.
4.2 Transferability of Learned Architecture on ImageNet
A large scale image classification model stacked by the learned cells named BENASNet is built for ImageNet 2012. This experiment is not only performed for verifying the transferability of discovered architecture by BENAS, but also proving the powerful multi-scale features extraction capacity of the proposed BCNN.
Like the experiments on CIFAR-10, some data augment techniques, for instance, randomly cropping and flipping are applied on the input images whose size is . In this experiment, the BENASNet consists of five convolution blocks and a single enhancement block. Moreover, there are only one deep cell in the convolution block, i.e. the topology of BENASNet is 5@1@1. We train the BENASNet for 150 epoches with batch size 256 by using SGD optimizer with momentum 0.9 and weight decay
. The initial learning rate is set to 0.1 and decayed by a factor of 0.1 when arriving at epoch 70, 100 and 130. Other hyperparameters, e.g. label smoothing, gradient clipping bounds can be found in DARTS in details.
summaries the results from the point of view of accuracy and parameter, and compares with other state-of-the-art image classifiers on ImageNet.
|ShuffleNet (2x) ||29.1||10.2||10|
4.3 Results Analysis
For the experiments on CIFAR-10, based on the learned architecture of BENAS, three models with same topologies but various parameters, 0.5, 1.1 and 4.1 millions are constructed. In the first and second block of Table 1, DPP-Net  and LEMONADE  are chosen as the comparative NAS approaches. It is obvious that BENAS can deliver small-size BCNNs with the best accuracy for small scale image classification task. In particular, for the models with 0.5 million parameters, BENAS exceeds those comparative NAS methods almost 1% which is a great promotion. Furthermore, in the third block of Table 1, a large-size model is constructed and several state-of-the-art NAS approaches, AmoebaNet , NASNet , DARTS  and ENAS  are chosen for comparing with the proposed method. Apparently, BENAS achieves a competitive result which is 2.95% test error with 4.1 millions parameters.
Furthermore, two aspects, accuracy and parameter are compared for the experiment on ImageNet. Moreover, we not only choose the NAS approaches (second block of Table 2) but also manual design models (first block of Table 2) as the comparative methods. From the point of view of accuracy, BENASNet achieves 25.7% top-1 test error which is only 0.2% worse than state-of-the-art model designed by NAS, AmoebaNet-A . The transferability of learned architecture and the powerful multi-scale features extraction capacity of BCNN for large scale image classification task can be proven. For the perspective with respect to parameter, BENASNet obtains the above competitive accuracy with 3.9 millions parameters which is state-of-the-art for NAS approaches. Here, the multi-scale features extracted by BCNN are fused to yield more reasonable and comprehensive representations for image classification so that BENASNet can make more exact decisions for image classification problem with few parameters.
In addition to the above discussion, we find an interesting phenomenon in the learned cells is that there are all convolution operation and skip connection without any pooling operations as shown in Figure 2. One possible reason is that the convolution and skip connection are more suitable for broad topology where each block needs more convolution operations for extracting multi-scale features.
The extreme fast search speed of BENAS, 0.23 day on a single GeForce GTX 1080Ti GPU is state-of-the-art for NAS.
As shown in Table 1, the efficiency of BENAS is about 14000x and 8000x which are almost five orders of magnitude faster than AmoebaNet and NASNet, respectively. Compared BENAS with those relative efficient NAS methods, Hierarchical Evo , PNAS  and LEMONADE , BENAS uses about 1300x, 1000x and 350x less computational resources, respectively. Furthermore, several state-of-the-art efficient NAS approaches, DPP-Net , SMASH , DARTS  and ENAS  are compared in detail with the proposed BENAS below.
First of all, the comparisons of DPP-Net and SMASH between BENAS are given. It is obvious that BENAS is about 17x and 6.5x faster than the above two approaches, respectively. Moreover, the performance of BENAS is better as aforementioned. SMASH suffers from a low-rank restriction discussed in  so that the architecture discovered by SMASH can not outperform BENAS. Compared with DARTS, a novel gradient-based NAS approach, BENAS is about 6.5x and 17x faster than the above method with first-order and second-order approximation, respectively. However, the performance of BENAS exceeds DARTS with first-order approximation rather than second-order approximation which uses 17x more computational resources than our approach.
In particular, the search cost of BENAS is about 2x less than ENAS. As aforementioned, BENAS also uses LSTM controller, reinforcement learning and parameter sharing for architecture sampling, controller’s parameter updating and accelerating architecture search process, respectively. As a result, we can draw a conclusion that the proposed BCNN contributes to improve the efficiency of cell based NAS approach not merely ENAS.
In this paper, we propose a broad version for ENAS named BENAS. The core idea is designing a novel BCNN for replacing the deep search model in ENAS to accelerate the search process further. For efficiency, our approach delivers 0.23 GPU day on CIFAR-10, 2x less than ENAS. For performance, our approach achieves state-of-the-art performance for both small and large scales image classification task in particular for small-size model on CIFAR-10.
We only develop a broad learning system named CCFBLS as the search paradigm of BENAS in this paper. However, some other structural variations of BLS which possibly achieve more novel performance are also presented in . In the future, we will expand all variations of BLS for proposing better NAS approach.
This work was supported in part by Youth research fund of the state key laboratory of complex systems management and control No. 20190213 and No. GJHZ1849 International Partnership Program of Chinese Academy of Sciences, and was supported in part by No FA2018111061SOW12 Programe of the Huawei Technologies Co Ltd, and also was supported in part by Noahs Ark Lab, Huawei Technologies.
- Brock et al.  Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
- Chen and Liu  CL Philip Chen and Zhulin Liu. Broad learning system: An effective and efficient incremental learning system without the need for deep architecture. IEEE transactions on neural networks and learning systems, 29(1):10–24, 2017.
- Chen et al.  CL Philip Chen, Zhulin Liu, and Shuang Feng. Universal approximation capability of broad learning system and its structural variations. IEEE transactions on neural networks and learning systems, (99):1–14, 2018.
Xception: Deep learning with depthwise separable convolutions.In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
- DeVries and Taylor  Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
- Ding et al.  Zixiang Ding, Yaran Chen, Nannan Li, and Dongbin Zhao. Simplified space based neural architecture search. In The 2019 IEEE Symposium Series on Computational Intelligence, 2019.
- Dong et al.  Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun. Dpp-net: Device-aware progressive search for pareto-optimal neural architectures. In Proceedings of the European Conference on Computer Vision (ECCV), pages 517–531, 2018.
- Elsken et al.  Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efficient multi-objective neural architecture search via lamarckian evolution. arXiv preprint arXiv:1804.09081, 2018.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Howard et al.  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Li et al.  Nannan Li, Yaran Chen, Zixiang Ding, and Dongbin Zhao. Light-weight neural architecture search for resource-constrainted device. In 2019 Chinese Automation Congress, 2019.
- Liu et al.  Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017.
- Liu et al. [2018a] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
- Liu et al. [2018b] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
- Liu et al.  Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. arXiv preprint arXiv:1901.02985, 2019.
- Loshchilov and Hutter  Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Pao and Takefuji  Y-H Pao and Yoshiyasu Takefuji. Functional-link net computing: theory, system architecture, and functionalities. Computer, 25(5):76–79, 1992.
- Pao et al.  Yoh-Han Pao, Gwang-Hoon Park, and Dejan J Sobajic. Learning and generalization characteristics of the random vector functional-link net. Neurocomputing, 6(2):163–180, 1994.
- Pham et al.  Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
- Real et al.  Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
- Szegedy et al.  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- Williams  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Wu et al.  Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019.
- Zhang et al.  Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
- Zoph and Le  Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
- Zoph et al.  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.