1 Introduction
Neural architecture search (NAS) Zoph et al. (2018)
automates the neural architecture design process. In general, NAS approaches can be divided into the following categories: reinforcement learning based
Zoph et al. (2018), evolutionary algorithm guided
Real et al. (2019), Bayesian Optimization driven Kandasamy et al. (2018), graph generator based Xie et al. (2019), and differentiable Liu et al. (2019); Luo et al. (2018); Liang et al. (2019); Chen et al. (2019) methods. Given the neural architecture design choices are discrete, the approaches in the first four categories search directly over the discrete neural architecture space. However, these methods require to evaluate a large number of candidate neural architectures in a high dimensional discrete sampling space and thus suffer from high search time and computational costs.Gradient and differentiablebased methods like NAO Luo et al. (2018), DARTS Liu et al. (2019); Liang et al. (2019); Chen et al. (2019) were proposed to reduce the high search time and computational costs. Instead of directly searching in the discrete architecture space, differentiable methods map the architecture into a continuous space and optimize the projected continuous space. The advantages are twofold. First, the projected lowdimensional continuous space is more representative making gradientbased optimization methods easier to be applied; second, optimizing in a continuous space is more efficient because simplified space reduces the redundancy of irrelevant information. Specifically, DARTS Liu et al. (2019) projects the optimization objective through encoding the node operations by weighted combination. NAO Luo et al. (2018) employs an encoderdecoder structure integrated with an accuracy predictor. Neural architectures are first encoded to a continuous latent space with LSTM cells. The predictor then takes in the continuous representation and predicts its accuracy. Finally the decoder maps the continuous representation back to a neural architecture.
Recent works Xie et al. (2019); Wortsman et al. (2019) demonstrate the importance of graph topology in neural architecture design. However, existing differentiable methods Liu et al. (2019); Luo et al. (2018) cannot precisely preserve graph topology when mapping the architecture into a continuous space. This may result in an inaccurate representation and/or reduced representation capacity of the projected continuous latent space Luo et al. (2018), e.g., the graph distance in the latent space may not appropriately reflect the graph distance in the discrete space and thus the optimal cell found may not be optimal in the graph architecture. In addition, existing differentiable methods do not support flexible innercell search. For example, NAO Luo et al. (2018)
restricts a cell to have two input nodes and one output node. NAO also fixes the number of tensor aggregation in a cell. Such restrictions severely limit the representation capacity of the found cell.
We propose NASGEM (Neural Architecture Search via Graph Embedding Method) to overcome the above limitations of existing differentiable methods. By introducing graph kernel, similarity measure, and efficiency prediction, NASGEM delicately encodes graphs into a latent space and enables to search cells with high representation capacity. Our contributions are as follows: 1) NASGEM preserves the network structure when constructs a graphically meaningful latent space for neural architecture search. 2)
NASGEM employs an efficiency score predictor to model the relationship between cell structure and its performance. With the pretrained graph embedding, our predictor can achieve an accurate estimation of the model performance based on the representation of cell structures in a graph vector.
3) The exploration of optimal cell structures is further improved by bootstrap optimization, which guarantees the feasibility of graph vectors in the latent embedding space. 4) GEMNet performs better than models obtained by other differentiable based methods, thanks to the more efficient cell design offered by NASGEM. Specifically, 62% and 8% parameter reduction and 20.7% and 19.3% MultiplyAccumulates (MAC) reduction compared to NAO and DARTS, respectively. Evaluation using NASBench101 Ying et al. (2019) further verifies the effectiveness of graph embedding.2 Related Work
Graph Embedding. Graph embedding Goyal and Ferrara (2018); Grover and Leskovec (2016) maps topological graph structure into continuous latent space. Traditional vertex graph embedding maps each node to a lowdimensional feature vector meanwhile preserving the connection relationship between vertices. Factorizationbased methods Roweis and Saul (2000), randomwalk based methods Perozzi et al. (2014)
, and deeplearning based methods
Wang et al. (2016) are the popular approaches used in traditional vertex graph embedding. However, it is challenging to apply the existing graph embedding methods to DNNs for the following two reasons: (1) similarities among cell structures of deep neural networks (DNNs) cannot be explicitly derived from traditional graph embeddings; (2) sophisticated deeplearning based methods like DNGR Cao et al. (2016) and GCN Kipf and Welling (2016); Defferrard et al. (2016); Li et al. (2020)require complex mechanisms in training structural data. NASGEM addresses cell structure similarity by computing the cosine similarity of two embedded graph vectors. It also facilitates the training process during the formulation of graph embedding by employing an encoding structure.
Neural Architecture Search. Some studies adopt cellbased approach to utilize existing handcrafted architecture motifs (e.g., MobileNetV2 Sandler et al. (2018)) and methods that focus on optimization algorithms such as random Li and Talwalkar (2019), evolutionary Real et al. (2017); Elsken et al. (2019), progressive Liu et al. (2018), gradientbased Liu et al. (2019); Pham et al. (2018) and, Bayesian optimization Kandasamy et al. (2018); Zhou et al. (2019), etc. However, as the design space for the cell structures are inherently unchanged, this set of solution for optimal neural architectures still has the capacity limitation. Recent works Luo et al. (2018); Liu et al. (2019) define a differentiable search space and utilize gradientbased methods to optimize the architecture representation. However, such method usually lacks topological consideration. For example, DARTS Liu et al. (2019) and its variations Chen et al. (2019); Xu et al. (2019) consider only the optimal combination of operations as its optimization goal. Although NAO Luo et al. (2018) considers the optimization process within the graph embedding space, strong constraints are applied to the search space. First, NAO uses only 5 nodes when constructing the search space for cell structures, which is much smaller compared to 30 nodes in NASGEM. In addition, NAO enforces all existing nodes to establish connections with at most two previous nodes. As a comparison, NASGEM allows the exploration of neural architectures with more freedom. It is worth noting that several searching techniques have been proposed to further improve the differentiable searching such as early stopping, partially channel connection Liang et al. (2019); Xu et al. (2019). These improvements are orthogonal to the key design of NASGEM and can be combined with our approach for better performance, which we consider as our future work.
Network Wiring Exploration. Recently, neural architectures with sophisticated wiring superseded traditional chainlike neural architectures with similar computation cost Xie et al. (2019). Random wiring Xie et al. (2019)
shows that using random graphs as priors when generating neural architectures achieves significant performance boost on various tasks. However, unconstrained random wiring may suffer from the following problems: (1) simply tuning hyperparameters for random prior generation may inevitably constrain the search space which can be otherwise explored (e.g., by NASGEM). This limitation on search space may reduce the quality of cell structures explored by the random wiring. (2) Unconstrained aggregation of tensors leads to high memory consumption within wired neural architectures. NASGEM solves this problem by constraining the maximum number of concatenations within each cell structure during
bootstrap optimization.3 Methodology
3.1 An Overview of NASGEM
NASGEM applies a differentiable method to conduct efficient architecture search based on continuous vectorized representation of graphs. NASGEM represents DNN architectures as a combination of Directed Acyclic Graphs (DAGs) and targets on optimizing the inner cell structures. NASGEM has three main components: graph embedding, efficiency score predictor, and bootstrap optimization.
NASGEM employs graph embedding to project candidate graphs into a latent space. Specifically, we introduce a graph encoder to project the graphs from a discrete space into a continuous latent space. Under the supervision of a graph kernel, our mapping has physical meanings in a topology space, i.e., the distance of graph vectors in the embedding space reflects the similarity of graph structures. To model the relationship between the graph and the performance of its derived DNN architecture, an efficiency score predictor is adopted and trained through example graphs along the search iterations. We employ bootstrap optimization to explore the optimal cell with the efficiency score predictor.
Fig. 1 depicts the 3step workflow of NASGEM. In the first step, we construct a graph encoder to efficiently derive graph vectors from candidate graphs. The encoder is trained for representing graphs with the supervision of WeisfeilerLehman (WL) kernel Shervashidze et al. (2011). A large number of randomly sampled graph pairs from the supernet are adopted for training the graph encoder as shown in Fig. 1(a). In the second step, we utilize the pretrained graph encoder and conduct architecture search to explore the relationship between graph vectors and their corresponding performance on a small proxy dataset. An efficiency score predictor is introduced and updated during the exploration process as shown in Fig. 1(b). Finally, we obtain the target optimal cell structure by applying bootstrap optimization within a large sample space. The cell structure with the highest score given by the efficiency score predictor is adopted as the optimal cell structure of our NASGEM search as shown in Fig. 1(c).
3.2 Graph Embedding
We adopt graph embedding to model the similarity of different vectorized representations of DNN cell structures. Specifically, we use a graph vector to represent the topological graph structure in a continuous space to represent the node information (i.e., DNN operation) and its neighboring connections (i.e., distribution of tensors).
To connect graph representations in discrete space and continuous space, a graph encoder is introduced to map the discrete graph representations (i.e., adjacency matrix) to continuous vectors. To measure the similarity of graphs in discrete topological space, there are various approaches in graph theory, such as WL kernel Shervashidze et al. (2011). For continuous graph embedding space, cosine similarity is widely used to measure the similarity of graph vectors within the range . The encoded vectors aim at preserving graph similarity measured by cosine similarity in the continuous space, as shown in Fig. 2. Therefore, we train the graph encoder with the objective of minimizing the difference between WL kernel value in the discrete graph space and the cosine similarity value in the continuous space.
WeisfeilerLehman (WL) Kernel. We adopt WL kernel Shervashidze et al. (2011) as a similarity measure of two input graphs as follows:
(1) 
where are the adjacency matrices for graph , . denotes the iteration times of computing WL kernel. For graphs with nodes, the WL kernel can be computed with iterations in . Compared with other graph similarity metrics, WL kernel is able to measure large and complex graphs with relatively low computational complexity.
The procedure of measuring the similarity between two input graphs is as follows. As a start, each node is labeled with its degree. A lookup table is constructed to map each long string label to a unique integer, and the lookup table is initialized by orderly assigning an integer to each distinct label of the labeled graph. The kernel value is obtained by computing the dot product of the corresponding vectors of two graphs. Finally, the kernel value is used as the measurement of the graph similarity.
Similarity measurement of graph vectors. We adopt cosine similarity to measure the similarity of the vectorized graph representations (i.e., graph vectors). The cosine similarity measurement on the embedding space is defined as:
(2) 
where and are vectorized representation of graph and . The direction of the vector reflects the proximities of the original graphs. Compared with the standard Euclidean distance in , cosine similarity is scaleinvariant. In addition, since the range of cosine distance is bounded, it is comparable with the similarity value given by WL kernel. Thus we can supervise the training of the embedding function with the difference between the cosine similarity value and the WL kernel value.
Graph encoder. We employ a graph encoder to map discrete graph structures to a continuous graph vector space. We extract adjacency matrix from the graph as the most important feature for mapping, denoted as for a graph with n nodes. The graph encoder learns a mapping function , which takes in the adjacency matrix and projects the graph into a dimension real space. As graph similarities are always measured pairwise, the encoder is trained and evaluated on a large number of graph pairs (,
). To formulate a meaningful mapping to represent topological graph structure in a continuous space, the cosine similarity of the encoded graph vectors shall represent the similarity of the corresponding original graph in the original discrete topological space. To satisfy the above training objectives, we define the loss function with respect to a pair of input graph: (
, ), followed by the corresponding adjacency matrix (, ) as:(3) 
Therefore, the optimal mapping function needs to satisfy the condition below:
(4) 
Following most practices in autoencoders Hinton and Zemel (1994), we adopt a fullyconnected feedforward network to model the target function . The encoder is trained independently before the searching procedure. We randomly generate a number of graph structure pairs with the same numbers of nodes and apply WL kernel to provide ground truth labels.
3.3 Optimal Cell Exploration
Like most of the previous works Liu et al. (2019); Pham et al. (2018), we map DAGs to cell structures, which are used as building blocks of DNNs. Each node in the DAG stands for a valid DNN operation (e.g., convolution 11, separable convolution 33, etc.), and each edge stands for the flow of tensors from one node to another. When mapping cell structures to DNN architectures, nodes with no input connections (i.e., zero indegree) are dropped while nodes with indegree larger than one is inserted a concatenate operation. The output of building blocks can be constructed from the leaf nodes with zero outdegree. The concatenation of these leaf nodes along the last dimension gives the output feature maps.
Performance evaluation of graph vectors. To evaluate the performance of DNN architectures formed by the corresponding cell structures of graph vectors, we use efficiency score instead of accuracy as our search metric so that both performance and efficiency are taken into considerations. The efficiency score for candidate graph is formulated as:
(5) 
where is the DNN constructed with the cell represented by . is the validation accuracy on proxy dataset. is the number of MultiplyAdd operations of measured in Millions. is a penalty coefficient. We use MAC as the penalty term since it can be precisely measured across search iterations. Most compact models Howard et al. (2017); Sandler et al. (2018); Tan et al. (2019) have hundred millions of MACs while large models Szegedy et al. (2015) can have billions of MACs. Here we choose to control the MAC penalty in the same magnitude with accuracy for both small and large models. This penalty function can urge the search iteration towards improving the performance of small models or deflating the complexity of large models, and thus strike a balance between complexity and performance.
Efficiency score predictor. The efficiency score predictor maps the dimensional graph vector to a realvalue that indicates the performance of DNN architectures built upon this graph vector. For a candidate neural architecture, we use its graph vector as features and the computed efficiency score on the proxy dataset as labels for training the efficiency score predictor.
For simplicity, we adopt a twolayer neural network as the backbone for designing the efficiency score predictor which creates a differentiable function to map architectures to its predictive efficiency score. The predictor is trained iteratively during the search process. We maintain a set , where is a graph vector and is its efficiency score measured on proxy dataset. In each iteration, we add current selected graph vector/score pair into the set and train the predictor with this enlarged set. Algorithm 1 provides an overview of training the efficiency score predictor.
The predictor becomes more accurate by using the efficiency score of new samples for finetuning. More importantly, the predictor in NASGEM is built on top a smoother differentiable space, which is constructed through the graph embedding of unrestricted DAGs. Such accurate prediction allows to explore a wider search space (30 nodes) in extremely small search cost (0.4 GPU days).
Bootstrap Optimization. After the predictor is trained using a large number of neural architectures and their corresponding efficiency scores, our goal of finding the optimal cell structure in the topological graph space is equivalent to finding the graph vector that has the highest score according to the efficiency score predictor, formulated as:
(6) 
where is the embedded graph vector after passing adjacency matrix A into the pretrained graph encoder, and is the efficiency score predictor that estimates the efficiency score of a given cell .
NASGEM explores a continuous immense search space consisting of hypercomplex families of cell structures. For cells with nodes, the search space is , e.g., if , there are possible topologies to explore. For efficient exploration, we introduce an estimation method based on two empirical beliefs: (1) optimal cell structures within the search space is not unique as various architectures of the similar isomorphism can yield equally competitive results. (2) Finding the optimal graph vector in the continuous search space and then decoding may not discover a valid architecture, as the mapping from graph vectors to discrete topological cell structures may not be injective.
Bootstrap Optimization addresses the above issues by sampling the cell structure with replacement among a large sample space S and picking up the best one as our postsearching approximation. With the pretrained graph encoder , we randomly sample cell structure from the sample space and approximate the best candidate cell structure by predicting the efficiency score using the accuracy predictor :
(7) 
Bootstrap optimization serves as a similar function as the decoder in NAO Luo et al. (2018). However, a decoder is not always feasible when mapping the optimal embedding back to the adjacency matrix. Since NASGEM does not rely on decoder, the performance of the optimal cell structure found by NASGEM is sensitive to the size of the sample space. Empirically, a sampling size of 50,000 can sufficiently meet the requirements to get an optimal cell structure.
4 Experimental Evaluation
We apply NASGEM to search and evaluate efficient mobile neural architectures for different computation regimes on image classification tasks. We initiate architecture search with a complete DAG of so that a rich source of cells can be constructed. Each node can choose from either 11 convolution or 3
3 depthwise separable convolution as its candidate DNN operation and each convolution operation adopts a ConvolutionBatchNormReLU triplet. We use
of the CIFAR10 dataset as the proxy dataset for training the efficiency accuracy predictor. It is worth noting that enabling proxyless Cai et al. (2019)setting on ImageNet1K dataset may yield better cells, but the search cost would be drastically increased.
4.1 Learning Dynamics of NASGEM
We illustrate the learning dynamics of NASGEM by plotting the efficiency score surface with respect to latent vectors in the graph embedding space. The efficiency score surface is obtained by sampling 50,000 adjacency matrices, mapping them to the continuous embedding space, and passing them through the efficiency score predictor.
Fig. 3 illustrates the efficiency score surface of efficiency score predictors with and without using graph embedding as a latent vector. Under the guidance of graph kernel embedding, the efficiency score surface is smoother, which allows finding local optimum efficiently. This can be interpreted as the embedding of the optimal cell structure in the continuous embedding space. In contrast, in the case of without graph kernel guided embedding, the efficiency score surface cannot be constructed smoothly, which makes the optimization process more difficult and less efficient.
4.2 ImageNet Results
Table 1 summarizes the key performance metrics on the ImageNet dataset. For a fair comparison, we compare with the most relevant NAS works which adopts a differentiable method under similar search cost and search space (e.g., without predefined MobileNetV2 backbone). Note that our reported search cost includes all phases: encoder training, searching, and bootstrap optimization. Performance of GEMNetA is 1.4% more accurate than models crafted using another differentiable method, DARTS Liu et al. (2019), while having 8% and 19.3% reduction on parameters and MACs, respectively. Compared with NAONet, GEMNetA is more accurate on top1 accuracy while 62% and 20.7% reduction on parameters and MACs respectively.
Compared with other NAS models, both GEMNetA and GEMNetB have higher accuracy but significantly fewer parameters and MACs. It is worth to note that although NAS models with more MACs have higher accuracy, it is difficult to deploy them on edge and mobile devices due to the scarce computational resource and energy efficiency consideration. In addition, as GEMNet does not require two input information from previous cells as in DARTS Liu et al. (2019) and NAO Luo et al. (2018), it is a more preferable option for mobile applications.
4.3 Evaluation on NASBench101
To facilitate the reproducibility of architecture search and evaluate the selection of search strategy, NASBench101 Ying et al. (2019) is introduced to provide abundant results for various neural architectures (about 432k) in a large number of search spaces. To fairly justify the effectiveness of NASGEM, we further corroborate our searching on NASBench101. We randomly sample a fixed number of candidate architectures (1002000) from NASBench101 within a fixed operation list. Then, we train the efficiency score predictor with/without our proposed graph embedding on these sampled architecture respectively. Finally, we select the best architecture through bootstrap optimization by using the trained efficiency score predictor to evaluate a total of 50k architectures and pick the best one.
We measure the dispersion between NASGEM and NASBench101 by Global Prediction Bias, . Here is the global accuracy, which is the best accuracy in the whole search space given in NASBench101. is the predicted accuracy, which is the accuracy achieved by NASGEM. When predicted accuracy is very close to global accuracy, global prediction bias () is moving toward zero. Therefore, the smaller reflects the more accurate prediction.
Topologyaware NASGEM search with graph embedding can achieve better prediction of global accuracy than topologyagnostic search methods, as the embedded graph vectors contain extra information from the topological space. As shown in Fig. 4, with the guidance of graph kernel embedding, there is around 0.2% gain on predicting global accuracy on the NASBench101 dataset. Such topological information can facilitate the training of efficiency score predictor given insufficient data. We also observe the smaller search space in NASBench101 facilities the training of efficiency score predictor. However, when expanding to a DAG with 30 nodes, only a very small portion of architectures within this search space can be sampled to build the efficiency score predictor, which leads to insufficient examples for methods which utilized raw adjacency matrices. Thus, graph embedding is a very powerful mechanism in boosting the performance of efficiency score predictor. NASGEM also produces more stable results as structural knowledge represented by graph embedding generalizes better than binary adjacency matrices. Thus, the performance of NASGEM is less sensitive to the number of evaluated samples in the search space, leading to a better tradeoff between search cost and performance.
5 Conclusion
In this work, we propose NASGEM, a differentiable neural architecture search method via graph embedding method. NASGEM is the first of its kind differentiable NAS method that can search an efficient neural architecture cell topology from an unrestricted wide search space (e.g., 30 nodes in a cell). NASGEM tackles down the limitations of graph topology exploration in existing differentiable search methods by proposing the following ideas: (i) construct a topologically meaningful differentiable space by WL kernel guided graph embedding; (ii) employ an efficiency score predictor to precisely model the relationship between neural architectures and performance; (iii) use bootstrap optimization to explore the optimal cell structure. All these components work coherently to enable NASGEM to search a more efficient neural architecture from an unrestricted wide search space within a short time. Compared with neural architectures produced by existing differentiable methods, GEMNet crafted by NASGEM consistently achieves higher top1 accuracy while saving up to 62% parameters and 20.7% MACs, respectively. NASGEM is highly adaptable to different search spaces. Providing similar number of candidate nodes in search space as RandWire Xie et al. (2019), NASGEM provides 0.5% higher accuracy with 19% reduction on number of parameters. By combining our proposed graph embedding with NASBench101, it achieves a more precise and stable prediction compared to the version without graph embedding.
References
 [1] (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In Proceedings of the International Conference on Learning Representations, External Links: Link Cited by: §4.

[2]
(2016)
Deep neural networks for learning graph representations.
In
Thirtieth AAAI Conference on Artificial Intelligence
, Cited by: §2. 
[3]
(2019)
Progressive differentiable architecture search: bridging the depth gap between search and evaluation.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 1294–1303. Cited by: §1, §1, §2.  [4] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3844–3852. External Links: Link Cited by: §2.

[5]
(2019)
Searching for a robust neural architecture in four gpu hours.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1761–1770. Cited by: Table 1.  [6] (2019) Efficient multiobjective neural architecture search via lamarckian evolution. In Proceedings of the International Conference on Learning Representations, Cited by: §2.
 [7] (2018) Graph embedding techniques, applications, and performance: a survey. KnowledgeBased Systems 151, pp. 78–94. Cited by: §2.
 [8] (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.
 [9] (2020) Milenas: efficient neural architecture search via mixedlevel reformulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 1.
 [10] (1994) Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pp. 3–10. Cited by: §3.2.
 [11] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. In arXiv preprint arXiv:1704.04861, Cited by: §3.3.
 [12] (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: §1, §2.
 [13] (2016) Semisupervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, Cited by: §2.
 [14] (2019) Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638. Cited by: §2.
 [15] (2020) Neural graph embedding for neural architecture search. In The AAAI Conference on Artificial Intellegence, Cited by: §2, Table 1.
 [16] (2019) Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §1, §1, §2.
 [17] (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §2.
 [18] (2019) DARTS: differentiable architecture search. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §1, §1, §2, §3.3, §4.2, §4.2, Table 1.
 [19] (2018) Neural architecture optimization. In Advances in Neural Information Processing Systems, pp. 7827–7838. Cited by: §1, §1, §1, §2, §3.3, §4.2, Table 1.
 [20] (2014) DeepWalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, pp. 701–710. External Links: ISBN 9781450329569, Link, Document Cited by: §2.

[21]
(2018)
Efficient neural architecture search via parameter sharing.
In
International Conference on Machine Learning
, pp. 4092–4101. Cited by: §2, §3.3. 
[22]
(2019)
Regularized evolution for image classifier architecture search
. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §1.  [23] (2017) Largescale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2902–2911. Cited by: §2.
 [24] (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2.
 [25] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §2, §3.3.
 [26] (2011) Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §3.1, §3.2, §3.2.
 [27] (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §3.3.
 [28] (2019) Mnasnet: platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §3.3.
 [29] (2016) Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. Cited by: §2.
 [30] (2019) Discovering neural wirings. arXiv preprint arXiv:1906.00586. Cited by: §1.
 [31] (201910) Exploring randomly wired neural networks for image recognition. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2, Table 1, §5.
 [32] (2019) SNAS: stochastic neural architecture search. In International Conference on Learning Representations, External Links: Link Cited by: Table 1.
 [33] (2019) PCdarts: partial channel connections for memoryefficient architecture search. In International Conference on Learning Representations, Cited by: §2, Table 1.
 [34] (201909–15 Jun) NASbench101: towards reproducible neural architecture search. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 7105–7114. External Links: Link Cited by: §1, §4.3.
 [35] (2019) BayesNAS: a bayesian approach for neural architecture search. In ICML, Cited by: §2, Table 1.
 [36] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §1.