NASGEM: Neural Architecture Search via Graph Embedding Method

Neural Architecture Search (NAS) automates and prospers the design of neural networks. Recent studies show that mapping the discrete neural architecture search space into a continuous space which is more compact, more representative, and easier to optimize can significantly reduce the exploration cost. However, existing differentiable methods cannot preserve the graph information when projecting a neural architecture into a continuous space, causing inaccuracy and/or reduced representation capability in the mapped space. Moreover, existing methods can explore only a very limited inner-cell search space due to the cell representation limitation or poor scalability. To enable quick search of more sophisticated neural architectures while preserving graph information, we propose NASGEM which stands for Neural Architecture Search via Graph Embedding Method. NASGEM is driven by a novel graph embedding method integrated with similarity estimation to capture the inner-cell information in the discrete space. Thus, NASGEM is able to search a wider space (e.g., 30 nodes in a cell). By precisely estimating the graph distance, NASGEM can efficiently explore a large amount of candidate cells to enable a more flexible cell design while still keeping the search cost low. GEMNet, which is a set of networks discovered by NASGEM, has higher accuracy while less parameters (up to 62 compared to networks crafted by existing differentiable search methods. Our ablation study on NASBench-101 further validates the effectiveness of the proposed graph embedding method, which is complementary to many existing NAS approaches and can be combined to achieve better performance.


page 1

page 2

page 3

page 4


Neural Architecture Search of SPD Manifold Networks

In this paper, we propose a new neural architecture search (NAS) problem...

SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures

Designing neural architectures for edge devices is subject to constraint...

Inter-layer Transition in Neural Architecture Search

Differential Neural Architecture Search (NAS) methods represent the netw...

BARS: Joint Search of Cell Topology and Layout for Accurate and Efficient Binary ARchitectures

Binary Neural Networks (BNNs) have received significant attention due to...

EDAS: Efficient and Differentiable Architecture Search

Transferrable neural architecture search can be viewed as a binary optim...

CHASE: Robust Visual Tracking via Cell-Level Differentiable Neural Architecture Search

A strong visual object tracker nowadays relies on its well-crafted modul...

μDARTS: Model Uncertainty-Aware Differentiable Architecture Search

We present a Model Uncertainty-aware Differentiable ARchiTecture Search ...

1 Introduction

Neural architecture search (NAS) Zoph et al. (2018)

automates the neural architecture design process. In general, NAS approaches can be divided into the following categories: reinforcement learning based 

Zoph et al. (2018)

, evolutionary algorithm guided 

Real et al. (2019), Bayesian Optimization driven Kandasamy et al. (2018), graph generator based Xie et al. (2019), and differentiable Liu et al. (2019); Luo et al. (2018); Liang et al. (2019); Chen et al. (2019) methods. Given the neural architecture design choices are discrete, the approaches in the first four categories search directly over the discrete neural architecture space. However, these methods require to evaluate a large number of candidate neural architectures in a high dimensional discrete sampling space and thus suffer from high search time and computational costs.

Gradient and differentiable-based methods like NAO Luo et al. (2018), DARTS Liu et al. (2019); Liang et al. (2019); Chen et al. (2019) were proposed to reduce the high search time and computational costs. Instead of directly searching in the discrete architecture space, differentiable methods map the architecture into a continuous space and optimize the projected continuous space. The advantages are two-fold. First, the projected low-dimensional continuous space is more representative making gradient-based optimization methods easier to be applied; second, optimizing in a continuous space is more efficient because simplified space reduces the redundancy of irrelevant information. Specifically, DARTS Liu et al. (2019) projects the optimization objective through encoding the node operations by weighted combination. NAO Luo et al. (2018) employs an encoder-decoder structure integrated with an accuracy predictor. Neural architectures are first encoded to a continuous latent space with LSTM cells. The predictor then takes in the continuous representation and predicts its accuracy. Finally the decoder maps the continuous representation back to a neural architecture.

Recent works Xie et al. (2019); Wortsman et al. (2019) demonstrate the importance of graph topology in neural architecture design. However, existing differentiable methods Liu et al. (2019); Luo et al. (2018) cannot precisely preserve graph topology when mapping the architecture into a continuous space. This may result in an inaccurate representation and/or reduced representation capacity of the projected continuous latent space Luo et al. (2018), e.g., the graph distance in the latent space may not appropriately reflect the graph distance in the discrete space and thus the optimal cell found may not be optimal in the graph architecture. In addition, existing differentiable methods do not support flexible inner-cell search. For example, NAO Luo et al. (2018)

restricts a cell to have two input nodes and one output node. NAO also fixes the number of tensor aggregation in a cell. Such restrictions severely limit the representation capacity of the found cell.

We propose NASGEM (Neural Architecture Search via Graph Embedding Method) to overcome the above limitations of existing differentiable methods. By introducing graph kernel, similarity measure, and efficiency prediction, NASGEM delicately encodes graphs into a latent space and enables to search cells with high representation capacity. Our contributions are as follows: 1) NASGEM preserves the network structure when constructs a graphically meaningful latent space for neural architecture search. 2)

NASGEM employs an efficiency score predictor to model the relationship between cell structure and its performance. With the pretrained graph embedding, our predictor can achieve an accurate estimation of the model performance based on the representation of cell structures in a graph vector.

3) The exploration of optimal cell structures is further improved by bootstrap optimization, which guarantees the feasibility of graph vectors in the latent embedding space. 4) GEMNet performs better than models obtained by other differentiable based methods, thanks to the more efficient cell design offered by NASGEM. Specifically, 62% and 8% parameter reduction and 20.7% and 19.3% Multiply-Accumulates (MAC) reduction compared to NAO and DARTS, respectively. Evaluation using NASBench-101 Ying et al. (2019) further verifies the effectiveness of graph embedding.

2 Related Work

Graph Embedding. Graph embedding Goyal and Ferrara (2018); Grover and Leskovec (2016) maps topological graph structure into continuous latent space. Traditional vertex graph embedding maps each node to a low-dimensional feature vector meanwhile preserving the connection relationship between vertices. Factorization-based methods Roweis and Saul (2000), random-walk based methods Perozzi et al. (2014)

, and deep-learning based methods 

Wang et al. (2016) are the popular approaches used in traditional vertex graph embedding. However, it is challenging to apply the existing graph embedding methods to DNNs for the following two reasons: (1) similarities among cell structures of deep neural networks (DNNs) cannot be explicitly derived from traditional graph embeddings; (2) sophisticated deep-learning based methods like DNGR Cao et al. (2016) and GCN Kipf and Welling (2016); Defferrard et al. (2016); Li et al. (2020)

require complex mechanisms in training structural data. NASGEM addresses cell structure similarity by computing the cosine similarity of two embedded graph vectors. It also facilitates the training process during the formulation of graph embedding by employing an encoding structure.

Neural Architecture Search. Some studies adopt cell-based approach to utilize existing hand-crafted architecture motifs (e.g., MobileNetV2 Sandler et al. (2018)) and methods that focus on optimization algorithms such as random Li and Talwalkar (2019), evolutionary Real et al. (2017); Elsken et al. (2019), progressive Liu et al. (2018), gradient-based Liu et al. (2019); Pham et al. (2018) and, Bayesian optimization Kandasamy et al. (2018); Zhou et al. (2019), etc. However, as the design space for the cell structures are inherently unchanged, this set of solution for optimal neural architectures still has the capacity limitation. Recent works Luo et al. (2018); Liu et al. (2019) define a differentiable search space and utilize gradient-based methods to optimize the architecture representation. However, such method usually lacks topological consideration. For example, DARTS Liu et al. (2019) and its variations Chen et al. (2019); Xu et al. (2019) consider only the optimal combination of operations as its optimization goal. Although NAO Luo et al. (2018) considers the optimization process within the graph embedding space, strong constraints are applied to the search space. First, NAO uses only 5 nodes when constructing the search space for cell structures, which is much smaller compared to 30 nodes in NASGEM. In addition, NAO enforces all existing nodes to establish connections with at most two previous nodes. As a comparison, NASGEM allows the exploration of neural architectures with more freedom. It is worth noting that several searching techniques have been proposed to further improve the differentiable searching such as early stopping, partially channel connection Liang et al. (2019); Xu et al. (2019). These improvements are orthogonal to the key design of NASGEM and can be combined with our approach for better performance, which we consider as our future work.

Network Wiring Exploration. Recently, neural architectures with sophisticated wiring superseded traditional chain-like neural architectures with similar computation cost Xie et al. (2019). Random wiring Xie et al. (2019)

shows that using random graphs as priors when generating neural architectures achieves significant performance boost on various tasks. However, unconstrained random wiring may suffer from the following problems: (1) simply tuning hyperparameters for random prior generation may inevitably constrain the search space which can be otherwise explored (e.g., by NASGEM). This limitation on search space may reduce the quality of cell structures explored by the random wiring. (2) Unconstrained aggregation of tensors leads to high memory consumption within wired neural architectures. NASGEM solves this problem by constraining the maximum number of concatenations within each cell structure during

bootstrap optimization.

3 Methodology

Figure 1: Workflow of NASGEM. (a) Encoder Training: the encoder learns to map graphs into a continuous embedding space. (b) Searching: within the constructed embedding space, NASGEM builds an efficiency score predictor to learn the relationship between neural network topologies and estimated score. (c) Bootstrap Optimization: we utilize the trained efficiency score predictor to identify the best cell topology from a large amount of sampled graphs derived from the supernet.

3.1 An Overview of NASGEM

NASGEM applies a differentiable method to conduct efficient architecture search based on continuous vectorized representation of graphs. NASGEM represents DNN architectures as a combination of Directed Acyclic Graphs (DAGs) and targets on optimizing the inner cell structures. NASGEM has three main components: graph embedding, efficiency score predictor, and bootstrap optimization.

NASGEM employs graph embedding to project candidate graphs into a latent space. Specifically, we introduce a graph encoder to project the graphs from a discrete space into a continuous latent space. Under the supervision of a graph kernel, our mapping has physical meanings in a topology space, i.e., the distance of graph vectors in the embedding space reflects the similarity of graph structures. To model the relationship between the graph and the performance of its derived DNN architecture, an efficiency score predictor is adopted and trained through example graphs along the search iterations. We employ bootstrap optimization to explore the optimal cell with the efficiency score predictor.

Fig. 1 depicts the 3-step workflow of NASGEM. In the first step, we construct a graph encoder to efficiently derive graph vectors from candidate graphs. The encoder is trained for representing graphs with the supervision of Weisfeiler-Lehman (WL) kernel Shervashidze et al. (2011). A large number of randomly sampled graph pairs from the supernet are adopted for training the graph encoder as shown in Fig. 1(a). In the second step, we utilize the pretrained graph encoder and conduct architecture search to explore the relationship between graph vectors and their corresponding performance on a small proxy dataset. An efficiency score predictor is introduced and updated during the exploration process as shown in Fig. 1(b). Finally, we obtain the target optimal cell structure by applying bootstrap optimization within a large sample space. The cell structure with the highest score given by the efficiency score predictor is adopted as the optimal cell structure of our NASGEM search as shown in Fig. 1(c).

3.2 Graph Embedding

Figure 2: Our goal is to construct an embedding space such that the distance of graphs in the embedding space reflects its graph similarity. Graphs with more graph similarity (e.g., G3 and G4) have closer distance () in the embedding space.

We adopt graph embedding to model the similarity of different vectorized representations of DNN cell structures. Specifically, we use a graph vector to represent the topological graph structure in a continuous space to represent the node information (i.e., DNN operation) and its neighboring connections (i.e., distribution of tensors).

To connect graph representations in discrete space and continuous space, a graph encoder is introduced to map the discrete graph representations (i.e., adjacency matrix) to continuous vectors. To measure the similarity of graphs in discrete topological space, there are various approaches in graph theory, such as WL kernel Shervashidze et al. (2011). For continuous graph embedding space, cosine similarity is widely used to measure the similarity of graph vectors within the range . The encoded vectors aim at preserving graph similarity measured by cosine similarity in the continuous space, as shown in Fig. 2. Therefore, we train the graph encoder with the objective of minimizing the difference between WL kernel value in the discrete graph space and the cosine similarity value in the continuous space.

Weisfeiler-Lehman (WL) Kernel. We adopt WL kernel Shervashidze et al. (2011) as a similarity measure of two input graphs as follows:


where are the adjacency matrices for graph , . denotes the iteration times of computing WL kernel. For graphs with nodes, the WL kernel can be computed with iterations in . Compared with other graph similarity metrics, WL kernel is able to measure large and complex graphs with relatively low computational complexity.

The procedure of measuring the similarity between two input graphs is as follows. As a start, each node is labeled with its degree. A lookup table is constructed to map each long string label to a unique integer, and the lookup table is initialized by orderly assigning an integer to each distinct label of the labeled graph. The kernel value is obtained by computing the dot product of the corresponding vectors of two graphs. Finally, the kernel value is used as the measurement of the graph similarity.

Similarity measurement of graph vectors. We adopt cosine similarity to measure the similarity of the vectorized graph representations (i.e., graph vectors). The cosine similarity measurement on the embedding space is defined as:


where and are vectorized representation of graph and . The direction of the vector reflects the proximities of the original graphs. Compared with the standard Euclidean distance in , cosine similarity is scale-invariant. In addition, since the range of cosine distance is bounded, it is comparable with the similarity value given by WL kernel. Thus we can supervise the training of the embedding function with the difference between the cosine similarity value and the WL kernel value.

Input: a supernet with nodes, a pretrained graph encoder function , an efficiency score predictor P, search iterations T

  for i = 1, 2, …, T do
     randomly sample a subgraph from the supernet ;
     map the current sampled subgraph to continuous graph vectors by using its adjacency matrix A:
     build candidate DNN architecture from ;
     train on proxy dataset and fetch efficiency score:
     update the predictor P by using the embedded graph vector and feedback efficiency score ;
  end for


Algorithm 1 Training predictor in NASGEM

Graph encoder. We employ a graph encoder to map discrete graph structures to a continuous graph vector space. We extract adjacency matrix from the graph as the most important feature for mapping, denoted as for a graph with n nodes. The graph encoder learns a mapping function , which takes in the adjacency matrix and projects the graph into a -dimension real space. As graph similarities are always measured pairwise, the encoder is trained and evaluated on a large number of graph pairs (,

). To formulate a meaningful mapping to represent topological graph structure in a continuous space, the cosine similarity of the encoded graph vectors shall represent the similarity of the corresponding original graph in the original discrete topological space. To satisfy the above training objectives, we define the loss function with respect to a pair of input graph: (

, ), followed by the corresponding adjacency matrix (, ) as:


Therefore, the optimal mapping function needs to satisfy the condition below:


Following most practices in auto-encoders Hinton and Zemel (1994), we adopt a fully-connected feedforward network to model the target function . The encoder is trained independently before the searching procedure. We randomly generate a number of graph structure pairs with the same numbers of nodes and apply WL kernel to provide ground truth labels.

3.3 Optimal Cell Exploration

Like most of the previous works Liu et al. (2019); Pham et al. (2018), we map DAGs to cell structures, which are used as building blocks of DNNs. Each node in the DAG stands for a valid DNN operation (e.g., convolution 11, separable convolution 33, etc.), and each edge stands for the flow of tensors from one node to another. When mapping cell structures to DNN architectures, nodes with no input connections (i.e., zero in-degree) are dropped while nodes with in-degree larger than one is inserted a concatenate operation. The output of building blocks can be constructed from the leaf nodes with zero out-degree. The concatenation of these leaf nodes along the last dimension gives the output feature maps.

Performance evaluation of graph vectors. To evaluate the performance of DNN architectures formed by the corresponding cell structures of graph vectors, we use efficiency score instead of accuracy as our search metric so that both performance and efficiency are taken into considerations. The efficiency score for candidate graph is formulated as:


where is the DNN constructed with the cell represented by . is the validation accuracy on proxy dataset. is the number of Multiply-Add operations of measured in Millions. is a penalty coefficient. We use MAC as the penalty term since it can be precisely measured across search iterations. Most compact models Howard et al. (2017); Sandler et al. (2018); Tan et al. (2019) have hundred millions of MACs while large models Szegedy et al. (2015) can have billions of MACs. Here we choose to control the MAC penalty in the same magnitude with accuracy for both small and large models. This penalty function can urge the search iteration towards improving the performance of small models or deflating the complexity of large models, and thus strike a balance between complexity and performance.

Efficiency score predictor. The efficiency score predictor maps the -dimensional graph vector to a real-value that indicates the performance of DNN architectures built upon this graph vector. For a candidate neural architecture, we use its graph vector as features and the computed efficiency score on the proxy dataset as labels for training the efficiency score predictor.

For simplicity, we adopt a two-layer neural network as the backbone for designing the efficiency score predictor which creates a differentiable function to map architectures to its predictive efficiency score. The predictor is trained iteratively during the search process. We maintain a set , where is a graph vector and is its efficiency score measured on proxy dataset. In each iteration, we add current selected graph vector/score pair into the set and train the predictor with this enlarged set. Algorithm 1 provides an overview of training the efficiency score predictor.

The predictor becomes more accurate by using the efficiency score of new samples for fine-tuning. More importantly, the predictor in NASGEM is built on top a smoother differentiable space, which is constructed through the graph embedding of unrestricted DAGs. Such accurate prediction allows to explore a wider search space (30 nodes) in extremely small search cost (0.4 GPU days).

Bootstrap Optimization. After the predictor is trained using a large number of neural architectures and their corresponding efficiency scores, our goal of finding the optimal cell structure in the topological graph space is equivalent to finding the graph vector that has the highest score according to the efficiency score predictor, formulated as:


where is the embedded graph vector after passing adjacency matrix A into the pretrained graph encoder, and is the efficiency score predictor that estimates the efficiency score of a given cell .

NASGEM explores a continuous immense search space consisting of hyper-complex families of cell structures. For cells with nodes, the search space is , e.g., if , there are possible topologies to explore. For efficient exploration, we introduce an estimation method based on two empirical beliefs: (1) optimal cell structures within the search space is not unique as various architectures of the similar isomorphism can yield equally competitive results. (2) Finding the optimal graph vector in the continuous search space and then decoding may not discover a valid architecture, as the mapping from graph vectors to discrete topological cell structures may not be injective.

Bootstrap Optimization addresses the above issues by sampling the cell structure with replacement among a large sample space S and picking up the best one as our post-searching approximation. With the pretrained graph encoder , we randomly sample cell structure from the sample space and approximate the best candidate cell structure by predicting the efficiency score using the accuracy predictor :


Bootstrap optimization serves as a similar function as the decoder in NAO Luo et al. (2018). However, a decoder is not always feasible when mapping the optimal embedding back to the adjacency matrix. Since NASGEM does not rely on decoder, the performance of the optimal cell structure found by NASGEM is sensitive to the size of the sample space. Empirically, a sampling size of 50,000 can sufficiently meet the requirements to get an optimal cell structure.

4 Experimental Evaluation

We apply NASGEM to search and evaluate efficient mobile neural architectures for different computation regimes on image classification tasks. We initiate architecture search with a complete DAG of so that a rich source of cells can be constructed. Each node can choose from either 11 convolution or 3

3 depthwise separable convolution as its candidate DNN operation and each convolution operation adopts a Convolution-BatchNorm-ReLU triplet. We use

of the CIFAR-10 dataset as the proxy dataset for training the efficiency accuracy predictor. It is worth noting that enabling proxyless Cai et al. (2019)

setting on ImageNet-1K dataset may yield better cells, but the search cost would be drastically increased.

(a) With graph kernel guided embedding.
(b) Without graph kernel guided embedding.
Figure 3: Efficiency score surface of efficiency score predictor based on DNN performance on the proxy dataset and computational cost in terms of MACs. We use PCA to project the graph vectors into a 2-dimensional space named x and y and plot the efficiency score of graph vector using a heatmap for (a) with graph kernel guided embedding and (b) the case of without embedding.

4.1 Learning Dynamics of NASGEM

Architecture N Test Err.(%) #Params MACs Search Cost top-1 top-5 (M) (M) (GPU days)

NAONet Luo et al. (2018)
5 25.7 8.2 11.35 584 200
DARTS Liu et al. (2019) 5 26.7 8.7 4.7 574 4.0 SNAS Xie et al. (2019) 5 27.3 9.2 4.3 522 1.5 PC-DARTS Xu et al. (2019) 5 25.1 7.8 5.3 586 0.1 GDAS Dong and Yang (2019) 5 26.0 8.5 5.3 581 0.21 BayesNAS Zhou et al. (2019) 5 26.5 8.9 3.9 / 0.2 NGE Li et al. (2020) 5 25.3 7.9 5.0 563 0.1 MiLeNAS He et al. (2020) 5 25.4 7.9 4.9 570 0.3 RandWire-WS Xie et al. (2019) 32 25.30.25 7.80.15 5.6 583 / GEMNet-A (Ours) 30 25.3 7.9 4.3 463  0.4 GEMNet-B (Ours) 30 24.8 7.8 4.5 563  0.4
Table 1: ImageNet results with different computation budget. For fair comparison, the input image resolution is fixed at 224224. We sort different architectures based on MACs. N denotes the maximum number of DNN node operations explored within a cell structure. Note that the pareto front of reducing MACs and the number of parameters (#Params) is not linear. Further reducing computation cost under a small computation regime is very challenging as redundancy is already low.

We illustrate the learning dynamics of NASGEM by plotting the efficiency score surface with respect to latent vectors in the graph embedding space. The efficiency score surface is obtained by sampling 50,000 adjacency matrices, mapping them to the continuous embedding space, and passing them through the efficiency score predictor.

Fig. 3 illustrates the efficiency score surface of efficiency score predictors with and without using graph embedding as a latent vector. Under the guidance of graph kernel embedding, the efficiency score surface is smoother, which allows finding local optimum efficiently. This can be interpreted as the embedding of the optimal cell structure in the continuous embedding space. In contrast, in the case of without graph kernel guided embedding, the efficiency score surface cannot be constructed smoothly, which makes the optimization process more difficult and less efficient.

4.2 ImageNet Results

Table 1 summarizes the key performance metrics on the ImageNet dataset. For a fair comparison, we compare with the most relevant NAS works which adopts a differentiable method under similar search cost and search space (e.g., without predefined MobileNetV2 backbone). Note that our reported search cost includes all phases: encoder training, searching, and bootstrap optimization. Performance of GEMNet-A is 1.4% more accurate than models crafted using another differentiable method, DARTS Liu et al. (2019), while having 8% and 19.3% reduction on parameters and MACs, respectively. Compared with NAONet, GEMNet-A is more accurate on top-1 accuracy while 62% and 20.7% reduction on parameters and MACs respectively.

Compared with other NAS models, both GEMNet-A and GEMNet-B have higher accuracy but significantly fewer parameters and MACs. It is worth to note that although NAS models with more MACs have higher accuracy, it is difficult to deploy them on edge and mobile devices due to the scarce computational resource and energy efficiency consideration. In addition, as GEMNet does not require two input information from previous cells as in DARTS Liu et al. (2019) and NAO Luo et al. (2018), it is a more preferable option for mobile applications.

4.3 Evaluation on NASBench-101

To facilitate the reproducibility of architecture search and evaluate the selection of search strategy, NASBench-101 Ying et al. (2019) is introduced to provide abundant results for various neural architectures (about 432k) in a large number of search spaces. To fairly justify the effectiveness of NASGEM, we further corroborate our searching on NASBench-101. We randomly sample a fixed number of candidate architectures (1002000) from NASBench-101 within a fixed operation list. Then, we train the efficiency score predictor with/without our proposed graph embedding on these sampled architecture respectively. Finally, we select the best architecture through bootstrap optimization by using the trained efficiency score predictor to evaluate a total of 50k architectures and pick the best one.

We measure the dispersion between NASGEM and NASBench-101 by Global Prediction Bias, . Here is the global accuracy, which is the best accuracy in the whole search space given in NASBench-101. is the predicted accuracy, which is the accuracy achieved by NASGEM. When predicted accuracy is very close to global accuracy, global prediction bias () is moving toward zero. Therefore, the smaller reflects the more accurate prediction.

Figure 4: Efficiency score predictor’s performance on NASBench-101 dataset with and without graph kernel embedding.

Topology-aware NASGEM search with graph embedding can achieve better prediction of global accuracy than topology-agnostic search methods, as the embedded graph vectors contain extra information from the topological space. As shown in Fig. 4, with the guidance of graph kernel embedding, there is around 0.2% gain on predicting global accuracy on the NASBench-101 dataset. Such topological information can facilitate the training of efficiency score predictor given insufficient data. We also observe the smaller search space in NASBench-101 facilities the training of efficiency score predictor. However, when expanding to a DAG with 30 nodes, only a very small portion of architectures within this search space can be sampled to build the efficiency score predictor, which leads to insufficient examples for methods which utilized raw adjacency matrices. Thus, graph embedding is a very powerful mechanism in boosting the performance of efficiency score predictor. NASGEM also produces more stable results as structural knowledge represented by graph embedding generalizes better than binary adjacency matrices. Thus, the performance of NASGEM is less sensitive to the number of evaluated samples in the search space, leading to a better trade-off between search cost and performance.

5 Conclusion

In this work, we propose NASGEM, a differentiable neural architecture search method via graph embedding method. NASGEM is the first of its kind differentiable NAS method that can search an efficient neural architecture cell topology from an unrestricted wide search space (e.g., 30 nodes in a cell). NASGEM tackles down the limitations of graph topology exploration in existing differentiable search methods by proposing the following ideas: (i) construct a topologically meaningful differentiable space by WL kernel guided graph embedding; (ii) employ an efficiency score predictor to precisely model the relationship between neural architectures and performance; (iii) use bootstrap optimization to explore the optimal cell structure. All these components work coherently to enable NASGEM to search a more efficient neural architecture from an unrestricted wide search space within a short time. Compared with neural architectures produced by existing differentiable methods, GEMNet crafted by NASGEM consistently achieves higher top-1 accuracy while saving up to 62% parameters and 20.7% MACs, respectively. NASGEM is highly adaptable to different search spaces. Providing similar number of candidate nodes in search space as RandWire Xie et al. (2019), NASGEM provides 0.5% higher accuracy with 19% reduction on number of parameters. By combining our proposed graph embedding with NASBench-101, it achieves a more precise and stable prediction compared to the version without graph embedding.


  • [1] H. Cai, L. Zhu, and S. Han (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In Proceedings of the International Conference on Learning Representations, External Links: Link Cited by: §4.
  • [2] S. Cao, W. Lu, and Q. Xu (2016) Deep neural networks for learning graph representations. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [3] X. Chen, L. Xie, J. Wu, and Q. Tian (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 1294–1303. Cited by: §1, §1, §2.
  • [4] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3844–3852. External Links: Link Cited by: §2.
  • [5] X. Dong and Y. Yang (2019) Searching for a robust neural architecture in four gpu hours. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1761–1770. Cited by: Table 1.
  • [6] T. Elsken, J. H. Metzen, and F. Hutter (2019) Efficient multi-objective neural architecture search via lamarckian evolution. In Proceedings of the International Conference on Learning Representations, Cited by: §2.
  • [7] P. Goyal and E. Ferrara (2018) Graph embedding techniques, applications, and performance: a survey. Knowledge-Based Systems 151, pp. 78–94. Cited by: §2.
  • [8] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.
  • [9] C. He, H. Ye, L. Shen, and T. Zhang (2020) Milenas: efficient neural architecture search via mixed-level reformulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 1.
  • [10] G. E. Hinton and R. S. Zemel (1994) Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pp. 3–10. Cited by: §3.2.
  • [11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. In arXiv preprint arXiv:1704.04861, Cited by: §3.3.
  • [12] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: §1, §2.
  • [13] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, Cited by: §2.
  • [14] L. Li and A. Talwalkar (2019) Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638. Cited by: §2.
  • [15] W. Li, S. Gong, and Z. Xiatian (2020) Neural graph embedding for neural architecture search. In The AAAI Conference on Artificial Intellegence, Cited by: §2, Table 1.
  • [16] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, and Z. Li (2019) Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §1, §1, §2.
  • [17] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §2.
  • [18] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §1, §1, §2, §3.3, §4.2, §4.2, Table 1.
  • [19] R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In Advances in Neural Information Processing Systems, pp. 7827–7838. Cited by: §1, §1, §1, §2, §3.3, §4.2, Table 1.
  • [20] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) DeepWalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, pp. 701–710. External Links: ISBN 978-1-4503-2956-9, Link, Document Cited by: §2.
  • [21] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. In

    International Conference on Machine Learning

    pp. 4092–4101. Cited by: §2, §3.3.
  • [22] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §1.
  • [23] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §2.
  • [24] S. T. Roweis and L. K. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2.
  • [25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §2, §3.3.
  • [26] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §3.1, §3.2, §3.2.
  • [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §3.3.
  • [28] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §3.3.
  • [29] D. Wang, P. Cui, and W. Zhu (2016) Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. Cited by: §2.
  • [30] M. Wortsman, A. Farhadi, and M. Rastegari (2019) Discovering neural wirings. arXiv preprint arXiv:1906.00586. Cited by: §1.
  • [31] S. Xie, A. Kirillov, R. Girshick, and K. He (2019-10) Exploring randomly wired neural networks for image recognition. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2, Table 1, §5.
  • [32] S. Xie, H. Zheng, C. Liu, and L. Lin (2019) SNAS: stochastic neural architecture search. In International Conference on Learning Representations, External Links: Link Cited by: Table 1.
  • [33] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, and H. Xiong (2019) PC-darts: partial channel connections for memory-efficient architecture search. In International Conference on Learning Representations, Cited by: §2, Table 1.
  • [34] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter (2019-09–15 Jun) NAS-bench-101: towards reproducible neural architecture search. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 7105–7114. External Links: Link Cited by: §1, §4.3.
  • [35] H. Zhou, M. Yang, J. Wang, and W. Pan (2019) BayesNAS: a bayesian approach for neural architecture search. In ICML, Cited by: §2, Table 1.
  • [36] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §1.