I Introduction
Graph neural networks (GNNs) [battaglia2018relational, gori2005new] have been extensively researched in the past five years, and show promising results on various graphbased tasks, e.g., node classification [hamilton2017inductive, velivckovic2017graph, wang2019heterogeneous], recommendation [ying2018graph, wang2018billion, xiao2019beyond, li2020hierarchical, zheng2020price], fraud detection [zhang2020fakedetector], chemistry [gilmer2017neural] and travel data analysis [yang2018did]. In the literature, various GNN architectures [kipf2016semi, hamilton2017inductive, velivckovic2017graph, gao2018large, xu2018powerful, xu2018representation, liu2019geniepath] have been designed for different tasks, and most of these approaches are relying on a neighborhood aggregation (or message passing) schema [gilmer2017neural] (see the example in Figure 1(a) and (b)). Despite the success of GNN models, they are facing a major challenge. That is there is no single model that can perform the best on all tasks and no optimal model on different datasets even for the same task (see the experimental results in Table VI). Thus given a new task or dataset, huge computational and expertise resources would be invested to find a good GNN architecture, which limits the application of GNN models. Moreover, existing GNN models do not make full use of the best architecture design practice. For example, existing GNN models tend to stack multiple layers with the same aggregation function to aggregate hidden features of neighbors, however, it remains to be seen whether different combinations of aggregation functions can further improve the performance. In one word, it is left to be explored whether we can obtain dataspecific GNN architectures beyond existing ones. This problem is dubbed architecture challenge.
To address this architecture challenge, researchers turn to neural architecture search (NAS) [zoph2016neural, baker2016designing]
, which has been a hot topic since it shows promising results in automatically designing novel and better neural architectures beyond humandesigned ones. For example, in computer vision, the searched architectures by NAS can beat the stateoftheart humandesigned ones on CIFAR10 and ImageNet datasets by a large margin
[cai2018proxylessnas, tan2019efficientnet]. Motivated by such a success, very recently, two preliminary works, GraphNAS [gao2019graphnas] and AutoGNN [zhou2019auto], made the first attempt to tackle the architecture challenge in GNN with the NAS approaches. However, it is nontrivial to apply NAS to GNN. One of the key components of NAS approaches [elsken2018neural, Bender18oneshot] is the design of search space, i.e., defining what to search, which directly affects the effectiveness and efficiency of the search algorithms. A natural search space is to include all hyperparameters related to GNN models, e.g., the hidden embedding size, aggregation functions, and number of layers, as done in GraphNAS and AutoGNN. However, this straightforward design of search space in GraphNAS and AutoGNN have two problems. The first one is that various GNN architectures, e.g., GeniePath [liu2019geniepath], or JKNetwork [xu2018representation], are not included, thus the best performance might not be that good. The second one is that it makes the architecture search process too expensive in GraphNAS/AutoGNN by incorporating too many hyperparameters into the search space. In NAS literature, it remains a challenging problem to design a proper search space, which should be expressive (large) and compact (small) enough, thus a good balance between accuracy and efficiency can be achieved.Besides, existing NAS approaches for GNNs are facing an inherent challenge, which is that they are extremely expensive due to the trialanderror nature, i.e., one has to train from scratch and evaluate as many as possible candidate architectures over the search space before obtaining a good one [zoph2016neural, liu2018darts, elsken2018neural]. Even in small graphs, in which most existing humandesigned GNN architectures are tuned, the search cost of NAS approaches, i.e., GraphNAS and AutoGNN, can be quite expensive. This challenge is dubbed computational challenge.
In this work, to address the architecture and computational challenges, we propose a novel NAS framework, which tries to Search to Aggregate NEighborhood (SANE) for automatic architecture search in GNN. By revisiting extensive GNN models, we define a novel and expressive search space, which can emulate more humandesigned GNN architectures than existing NAS approaches, i.e., GraphNAS and AutoGNN. To accelerate the search process, we adopt the advanced oneshot NAS paradigm [liu2018darts], and design a differentiable search algorithm, which trains a supernet subsuming all candidate architectures, thus greatly reducing the computational cost. We further conduct extensive experiments on three types of tasks, including transductive, inductive, and database (DB) tasks, to demonstrate the effectiveness and efficiency of the proposed framework. To summarize, the contributions of this work are in the following:

[leftmargin=*]

In this work, to address the architecture challenge in GNN models, we propose the SANE framework based on NAS. By designing a novel and expressive search space, SANE can emulate more humandesigned GNN architectures than existing NAS approaches.

To address the computational challenge, we propose a differentiable architecture search algorithm, which is efficient in nature comparing to the trialanderror based NAS approaches. To the best of our knowledge, this is the first differentiable NAS approach for GNN.

Extensive experiments on five realworld datasets are conducted to compare SANE with humandesigned GNN models and NAS approaches. The experimental results demonstrate the superiority of SANE in terms of effectiveness and efficiency compared to all baseline methods.
Notations. Formally, let be a simple graph with node features , where and represent the node and edge sets, respectively. represents the number of the nodes, and is the dimension of node features, We use to represent the firstorder neighbors of a node in , i.e., . In the literature, we further create a new set , which is the neighbor set including itself, i.e., . A new graph is always created by adding selfloop to every .
Ii Related Works
Iia Graph Neural Network (GNN)
GNN was first proposed in [gori2005new] and many of its variants [kipf2016semi, hamilton2017inductive, velivckovic2017graph, gao2018large, xu2018powerful, xu2018representation, liu2019geniepath] have been proposed in the past five years. Generally, these GNN models can be unified by a neighborhood aggregation or message passing schema [gilmer2017neural], where the representation of each node is learned by iteratively aggregating the embeddings (“message”) of its neighbors. A typical layer GNN in the neighborhood aggregation schema can be written as follows (see the illustrative example in Figure 1(a) and (b)): the th layer updates for each node as
(1) 
where represents the hidden features of a node learned by the th layer, and is the corresponding dimension. is a trainable weight matrix shared by all nodes in the graph, and
is a nonlinear activation function, e.g., sigmoid or ReLU.
is the key component, i.e., a predefined aggregation function, which varies across different GNN models. For example, a weighted summation function is designed as in [kipf2016semi], and different functions, e.g., mean and maxpooling, are proposed as the node aggregators in
[hamilton2017inductive]. Further, to weigh the importance of different neighbors, attention mechanism is incorporated to design the node aggregators [velivckovic2017graph]. For more details of different node aggregators, we refer readers to Table XI in Appendix B.Motivated by the success of residual network [he2016deep], residual mechanisms are incorporated to improve the performance of GNN models. In [chen2020simple]
, two simple residual connections are designed to improve the performance of the vanilla GCN model. And in
[xu2018representation], skipconnections are used to propagate message from intermediate layers to the last layer, then the final representation of the node is computed by a layer aggregator aswhere can be different operations, e.g., maxpooling, concatenation. In this way, neighbors in long ranges are used together to learn the final representation of each node, and the performance gain is reported with the layer aggregator [xu2018representation]. With the node and layer aggregators, we point out the two key components of exiting GNN models, i.e., the neighborhood aggregation function and the range of the neighborhood, which are tuned depending on the tasks [xu2018representation]. In Section IIIA, we will introduce the search space of SANE based on these two components.
IiB Neural Architecture Search (NAS)
Neural architecture search (NAS) [baker2016designing, zoph2016neural, elsken2018neural, yao2018taking] aims to automatically find unseen and better architectures comparing to expertdesigned ones, which have shown promising results in searching for convolutional neural networks (CNN) [baker2016designing, zoph2016neural, elsken2018neural, liu2018darts, zoph2018learning, tan2019efficientnet, yao2019differentiable]. Early NAS approaches follow a trialanderror pipeline, which firstly samples a candidate architecture from a predefined search space, then trains it from scratch, and finally gets the validation accuracy. This process is repeated many times before obtaining an architecture with satisfying performance. Representative methods are reinforcement learning (RL) algorithms [baker2016designing, zoph2016neural], which are inherently timeconsuming to train thousands of candidate architectures during the search process. To address the efficiency problem, a series of methods adopt weight sharing strategy to reduce the computational cost. To be specific, instead of training one by one thousands of separate models from scratch, one can train a single large network (supernet) capable of emulating any architecture in the search space. Then each architecture can inherit the weights from the supernet, and the best architecture can be obtained more efficiently. This paradigm is also referred to as “oneshot NAS”, and representative methods are [pham2018efficient, Bender18oneshot, liu2018darts, cai2018proxylessnas, yao2019differentiable, xie2018snas, zhou2019bayesnas, akimoto2019adaptive].
Recently, to obtain dataspecific GNN architectures, several works based on NAS were proposed. GraphNAS [gao2019graphnas] and AutoGNN[zhou2019auto] made the first attempt to introduce NAS into GNN. In [peng2019learning], an evolutionarybased search algorithm is proposed to search architectures on top of the Graph Convolutional Network (GCN) [kipf2016semi] model, and action recognition problem is considered. However, in this work, we focus on the node representation learning problem, which is an important one in GNN literature. In [lai2020policy], a RLbased method is proposed to search for nodespecific layer numbers given a GNN model, e.g., GCN or GAT, and in [ding2020propagation], propagation matrices in the message passing framework are searched. These two works can be regarded as orthogonal works of our framework. For GraphNAS and AutoGNN, they are RLbased methods, thus very expensive in nature. Besides, the search space of the proposed SANE is more expressive than those of GraphNAS and AutoGNN. In [zhao2020simplifying], the search space of GraphNAS is further simplified by introducing node and layer aggregators. However, the search method is still RLbased one. Further, to address the computational challenges, we design a differentiable search algorithm based on the oneshot method [liu2018darts]. To the best of our knowledge, this is the first differentiable NAS approach for architecture search in GNN.
Iii The Proposed Framework
Iiia The Search Space Design
Operations  

SAGESUM, SAGEMEAN, SAGEMAX, GCN, GAT,GATSYM, GATCOS, GATLINEAR, GATGENLINEAR, GIN, GeniePath  
CONCAT, MAX, LSTM  
IDENTITY, ZERO 
Model  Node aggregators  Layer aggregators  Emulate by SANE  

GCN [kipf2016semi]  GCN  
SAGE [hamilton2017inductive]  SAGESUM/MEAN/MAX  
GAT [velivckovic2017graph]  GAT, GATSYM/COS/ LINEAR/GENLINEAR  
Humandesigned  GIN [xu2018powerful]  GIN  
architectures  LGCN [gao2018large]  CNN  
GeniePath [liu2019geniepath]  GeniePath  
JKNetwork [xu2018representation]  depends on the base GNN  
NAS  SANE  learned combination of aggregators 
In the literature [Bender18oneshot, elsken2018neural], designing a good search space is very important for NAS approaches. On one hand, a good search space should be large and expressive enough to emulate various existing GNN models, thus ensures the competitive performance (see Table VI). On the other hand, the search space should be small and compact enough for the sake of computational resources, i.e., searching time (see Table VII). In the wellestablished work [xu2018powerful], the authors shows that the expressive capability is dependent on the properties of different aggregation functions, thus to design an expressive yet simplified search space, we focus on two key important components: node and layer aggregators, which are introduced in the following:

[leftmargin=*]

Layer aggregators: We choose 3 layer aggregators as shown in Table I. Besides, we have two more operations, IDENTITY and ZERO, related to skipconnections. Instead of requiring skipconnections between all intermediate layers and the last layer in JKNetwork, in this work, we generalize this option by proposing to search for the existence of skipconnections between each intermediate layer and the last layer. To connect, we choose IDENTITY, and ZERO otherwise. We denote the layer aggregator set by and skip operation set by .
To show the expressive capability of the designed search space, here we further give a detailed comparison between SANE and existing GNN models in Table II, from which we can see that SANE can emulate existing models. Besides, we also discuss the connections between SANE and more recent advanced GNN baselines in Appendix A.
IiiB Differentiable Architecture Search
In this part, we first introduce how to represent the search space of SANE as a supernet, which is a directed acyclic graph (DAG) in Figure 1(c), and then how to use gradient descent for the architecture search.
IiiB1 Continuous Relaxation of the Search Space
Assume we use a layer GNN with JKNetwork as backbone ( = 3 in Figure 1(c)), and the supernet has nodes, where each node is a latent representation, e.g., the input features of a node, or the embeddings in the intermediate layers. Each directed edge is associated with an operation that transforms , e.g., GAT aggregator. Without loss of generality, we have one input node, one output node, and one node representing the set of all skipped intermediated layers, thus we have node for the supernet in total. Then the task is transformed to find a proper operation on each edge, leading to a discrete search space, which is difficult in nature.
Motivated by the differentiable architecture search in [liu2018darts], we relax the categorical choice of a particular operation to a softmax over all possible operations:
(2) 
where the operation mixing weights for a pair of nodes are parameterized by a vector , and is chosen from the three operation sets: as introduced in Section IIIA. Then we have the corresponding . represents the input hidden features for a GNN layer, e.g., .
Let and are the mixed operations from based on (2), respectively, and we remove the superscript of for simplicity when there is no misunderstanding. Then given a node in the graph, the neighborhood aggregation process by SANE is
(3) 
where is shared by candidate architectures from the search space by each node aggregator. Then, for the last layer for the node , the embeddings can be computed by
(4)  
(5) 
where represents we stack all the embeddings from intermediate layers for the last layer. From the above equations, we can see that the computing process is the summation of all operations, i.e., aggregator or skip, from the corresponding set, which is what a “supernet” means. After we obtain the final representation of the node , we can inject it to different types of loss depending on the given task. Thus, SANE is to solve a bilevel optimization problem as:
(6)  
(7) 
where and are the training and validation loss, respectively. represents a network architecture, and the corresponding weights after training. In the experiments, we focus on the node classification task, thus crossentropy loss is used. Thus, the task of architecture search by SANE is to learn three continuous variables with each .
Search space  Search algorithm  
Node aggregators  Layer aggregators  
GraphNAS, AutoGNN  RL  
PolicyGNN  RL  
SANE  Differentiable 
IiiB2 Optimization by gradient descent
We can observe that the SANE problem is a bilevel optimization problem, where the architecture parameters is optimized on validation data (i.e., (6)), and the network weights w is optimized on training data (i.e., (7)). With the above continuous relaxation, the advantage is that the recently developed oneshot NAS approach [liu2018darts] can be applied.
The optimizing detail is given in Algorithm 1. Specifically, following [liu2018darts, yao2019differentiable], we give the gradientbased approximation to update the architecture parameters, i.e.,
(8) 
where w is the current weight, and is the learning rate for a step of inner optimization in (7) for w. Thus, instead of obtaining the optimized , we only need to approximate it by adapting w using only a single training step. After training, we retain the top strongest operations, i.e, the largest weights according (2), and form the complete architecture with the searched operations. Then the searched architecture is retrained from scratch and tuned on the validation data to obtain the best hyperparameters. Note that in the experiment, we set for simplicity, which means that a discrete architecture is obtained by replacing each mixed operation with the operation of the largest weight, i.e, .
IiiC Comparison with Existing NAS Methods
In this part, we further give a comparison between SANE and GraphNAS [gao2019graphnas], AutoGNN [zhou2019auto], and PolicyGNN [lai2020policy], which are the latest NAS methods for GNN in nodelevel representation learning. The results are shown in Table III. We can see that the advantages of the expressive search space and differentiable search algorithm are evident. Besides, PolicyGNN focuses on searching for the number of layers given a GNN backbone, which can be regarded as an orthogonal work of SANE, then here we discuss more about the comparisons between GraphNAS/AutoGNN and SANE.
The key difference is that SANE does not include parameters like hidden embedding size, number of attention heads, which tends to be called hyperparameters of GNN models. The underlying philosophy is that the expressive capability of GNN models is mainly relying on the properties of the aggregation functions, as shown in the wellestablished work [xu2018powerful], thus we focus on including more node aggregators to guarantee as powerful as possible the searched architectures. The layer aggregators are further included to alleviate the oversmoothing problems in deeper GNN models [xu2018representation]. Moreover, this simplified and compact search space has a side advantage, which is that the search space is made smaller in orders, thus the cost of architecture search is reduced in orders. For example, when considering the search space for a layer GNN in our experiments, the total number of architectures in the search space is . While in AutoGNN, there are candidate architectures to be searched [zhou2019auto]. Finally, since the hyperparameters are tuned by retraining the derived GNN architectures by Algorithm 1, which is also a standard practice in CNN architecture search [liu2018darts, xie2018snas], SANE actually decouples the architecture search and hyperparameters tuning, while GraphNAS/AutoGNN mix them up. In Section IVE3, we show that by running GraphNAS over the search space of SANE, the performance can be improved, which means that better architectures can be obtained given the same time budget, thus demonstrating the advantages of the decoupling process as well as the simplified and compact search space.
Iv Experiment
In this section, we conduct extensive experiments to demonstrate the superiority of the propose SANE in three tasks: transductive task, inductive task, and DB task.
Iva Experimental Settings
IvA1 Datasets and Tasks
Here, we introduce details of different tasks and corresponding datasets (Table IV). Note that transductive and inductive tasks are standard ones in the literature [kipf2016semi, hamilton2017inductive, xu2018representation]. We further add one popular database (DB) task, entity alignment in crosslingual knowledge base (KB), to show the capability of SANE in broader domains.
Task  Dataset  N  E  F  C  


Cora  2,708  5,278  1,433  7  
CiteSeer  3,327  4,552  3,703  6  
PubMed  19,717  44,324  500  3  
Inductive  PPI  56,944  818,716  121  50 
Transductive Task. Only a subset of nodes in one graph is allowed to access as training data, and other nodes are used as validation and test data. For this setting, we use three benchmark datasets: Cora, CiteSeer, PubMed. They are all citation networks, provided by [sen2008collective]
. Each node represents a paper, and each edge represents the citation relation between two papers. The dataset contains bagofwords features for each paper (node), and the task is to classify papers into different subjects based on the citation networks. We split the nodes in all graphs into 60%, 20%, 20% for training, validation, and test.
Inductive Task. In this task, we use a number of graphs as training data, and other completely unseen graphs as validation/test data. For this setting, we use the PPI dataset, provided by [hamilton2017inductive], on which the task is to classify protein functions. PPI consists of 24 graphs, each corresponds to a human tissue. Each node has positional gene sets, motif gene sets, and immunological signatures as features and gene ontology sets as labels. 20 graphs are used for training, 2 graphs are used for validation and the rest for test.
DB Task. For the DB task, we choose the crosslingual entity alignment, which matches entities referring to the same instances in different languages in two KBs . In the literature [wang2018cross, xu2019cross], GNN methods have been incorporated to this task to make use of the structure information underlying the crosslingual KBs. We use the DBLP15K datasets built by [sun2017cross]
, which were generated from DBpedia, a largescale multilingual KB containing rich interlanguage links between different language versions. We choose the subset of Chinese and English in our experiments, and the statistics of the dataset are in Table
V.For the experimental setting, we follow [wang2018cross]
and use 30% of interlanguage links for training, 10% for validation and the remaining 60% for test. For the evaluation metric, we use
to evaluate the performance of SANE. For the sake of space, we refer readers to [wang2018billion] for more details.IvA2 Compared Methods
We compare SANE with two groups of stateoftheart methods: humandesigned GNN architectures and NAS approaches.
Humandesigned GNNs. As shown in Table II, the humandesigned GNN architectures are: GCN [kipf2016semi], GraphSAGE [hamilton2017inductive], GAT [velivckovic2017graph], GIN [xu2018powerful], LGCN [gao2018large], and GeniePath [liu2018darts]. For models with variants, like different aggregators in GraphSAGE, we report the best performance across the variants. Besides, we add JKNetwork to all models except for LGCN, and obtain 5 more baselines: GCNJK, GraphSAGEJK, GATJK, GINJK, GeniePathJK. For LGCN, we use the code released by the authors^{2}^{2}2https://github.com/HongyangGao/LGCN
, and for other baselines, we use the popular opensource library PyTorch Geometric (PyG)
[fey2019fast], which implements various GNN models. For all baselines, we train it from scratch with the obtained best hyperparameters on validation datasets, and get the test performance. We repeat this process 5 times, and report the final mean accuracy with standard deviation.
#Entities  #Relations  #Attributes  #Rel.triples  #Attr.triples  

Chinese  66,469  2,830  8.113  153,929  379,684 
English  98,125  2,317  7,173  237,674  567,755 
Transductive  Inductive  
Methods  Cora  CiteSeer  PubMed  PPI  
Humandesigned architectures  GCN  0.8811 (0.0101)  0.7666 (0.0202)  0.8858 (0.0030)  0.6500 (0.0000) 
GCNJK  0.8820 (0.0118)  0.7763 (0.0136)  0.8927 (0.0037)  0.8078(0.0000)  
GraphSAGE  0.8741 (0.0159)  0.7599 (0.0094)  0.8834 (0.0044)  0.6504 (0.0000)  
GraphSAGEJK  0.8841 (0.0015)  0.7654 (0.0054)  0.8942 (0.0066)  0.8019 (0.0000)  
GAT  0.8719 (0.0163)  0.7518 (0.0145)  0.8573 (0.0066)  0.9414 (0.0000)  
GATJK  0.8726 (0.0086)  0.7527 (0.0128)  0.8674 (0.0055)  0.9749 (0.0000)  
GIN  0.8600 (0.0083)  0.7340 (0.0139)  0.8799 (0.0046)  0.8724 (0.0002)  
GINJK  0.8699 (0.0103)  0.7651 (0.0133)  0.8878 (0.0054)  0.9467 (0.0000)  
GeniePath  0.8670 (0.0123)  0.7594 (0.0137)  0.8846 (0.0039)  0.7138 (0.0000)  
GeniePathJK  0.8776 (0.0117)  0.7591 (0.0116)  0.8868 (0.0037)  0.9694 (0.0000)  
LGCN  0.8687 (0.0075)  0.7543 (0.0221)  0.8753 (0.0012)  0.7720 (0.0020)  
NAS approaches  Random  0.8594 (0.0072)  0.7062 (0.0042)  0.8866(0.0010)  0.9517 (0.0032) 
Bayesian  0.8835 (0.0072)  0.7335 (0.0006)  0.8801(0.0033)  0.9583 (0.0082)  
GraphNAS  0.8840 (0.0071)  0.7762 (0.0061)  0.8896 (0.0024)  0.9692 (0.0128)  
GraphNASWS  0.8808 (0.0101)  0.7613 (0.0156)  0.8842 (0.0103)  0.9584 (0.0415)  
oneshot NAS  SANE  0.8926 (0.0123)  0.7859 (0.0108)  0.9047 (0.0091)  0.9856 (0.0120) 
NAS approaches for GNN. We consider the following methods: (i). Random search (denoted as “Random”) [bergstra2012random]: a simple baseline in NAS, which uniform randomly samples architectures from the search space; (ii). Bayesian optimization^{3}^{3}3https://github.com/hyperopt/hyperopt (denoted as “Bayesian”) [bergstra2011algorithms]
: a popular sequential modelbased global optimization method for hyperparameter optimization, which uses treestructured Parzen estimator as the measurement for expected improvement; (iii). GraphNAS
^{4}^{4}4https://github.com/GraphNAS/GraphNAS [gao2019graphnas], a RLbased NAS approach for GNN, which has two variants based on the adoption of weight sharing mechanism. We denoted as GraphNASWS the one using weight sharing. Note that AutoGNN [zhou2019auto] is not compared for three reasons: 1) the search spaces of AutoGNN and GraphNAS are actually the same; 2) both of these two works use the RL method; 3) the code of AutoGNN is not publicly available.Random and Bayesian are searching on the designed search space of SANE, where a GNN architecture is sampled from the search space, and trained till convergence to obtain the validation performance. 200 models are sampled in total and the architecture with the best validation performance is trained from scratch, and do some hyperparameters tuning on the validation dataset, and obtain the test performance. For GraphNAS, we set the epoch of training the RLbased controller to 200, and in each epoch, a GNN architecture is sampled, and trained for enough epochs ( depending on datasets), update the parameters of RLbased controller. In the end, we sample 10 architectures and collect the top 5 architectures that achieve the best validation accuracy. Then the best architecture is trained from scratch. Again, we do some hyperparameters tuning based on the validation dataset, and report the best test performance. Note that we repeat the retraining of the architecture for five times, and report the final mean accuracy with standard deviation.
IvA3 Implementation details of SANE
Our experiments are running with Pytorch (version 1.2) [paszke2019pytorch] on a GPU 2080Ti (Memory: 12GB, Cuda version: 10.2). We implement SANE on top of the building code provided by DARTS^{5}^{5}5https://github.com/quark0/darts and PyG (version 1.2)^{6}^{6}6https://github.com/rusty1s/pytorch_geometric. More implementing details are given in Appendix C. Note that we set in (8) in our experiments, which means we are using firstorder approximation as introduced in [liu2018darts]. It is more efficient and the performance is good enough in our experiments. For all tasks, we run the search process for 5 times with different random seeds, and retrieve top1 architecture each time. By collecting the best architecture out of the 5 top1 architectures on validation datasets, we repeat 5 times the process in retraining the best one, finetuning hyperparameters on validation data, and reporting the test performance. Again, the final mean accuracy with standard deviations are reported.
IvB Performance on Transductive and Inductive Tasks
The results of transductive and inductive tasks are given in Table VI, and we give detailed analyses in the following.
IvB1 Transductive Task
Overall, we can see that SANE consistently outperforms all baselines on three datasets, which demonstrates the effectiveness of the searched architectures by SANE. When looking at the results of humandesigned architectures, we can first observe that GCN and GraphSAGE outperform other more complex models, e.g., GAT or GeniePath, which is similar to the observation in the paper of GeniePath [liu2019geniepath]. We attribute this to the fact that these three graphs are not that large, thus the complex aggregators might be easily prone to the overfitting problem. Besides, there is no absolute winner among humandesigned architectures, which further verifies the need of searching for dataspecific architectures. Another interesting observation is that when adding JKNetwork to base GNN architectures, the performance increases consistently, which aligns with the experimental results in JKNetwork [xu2018representation]. It demonstrates the importance of introducing the layer aggregators into the search space of SANE.
On the other hand, when looking at the performance of NAS approaches, the superiority of SANE is also clear from the gain on performance. Recall that, Random Bayesian and GraphNAS all search in a discrete search space, while SANE searches in a differentiable one enabled by Equation (2). This shows a differentiable search space is easier for search algorithms to find a better local optimal. Such an observation is also previously made in [liu2018darts, yao2019differentiable], which search for CNN in a differentiable space.
IvB2 Inductive Task
We can see a similar trend in the inductive task that SANE performs consistently better than all baselines. However, among humandesigned architectures, the best two are GAT and GeniePath with JKnetwork, which is not the same as that from the transductive task. This further shows the importance to search dataspecific GNN architectures.
IvB3 Searched Architectures
We visualize the searched architectures (top1) by SANE on different datasets in Figure 2. As can be seen, first, these architecture are datadependent and new to the literature. Then, searching for skipconnections indeed make a difference, as the last GNN layer in Figure 2(b) and the middle layer in Figure 2(c) do not connect to the output layer. Finally, as attentionbased node aggregators are more expressive than nonattentive ones, thus GAT (and its variants) are more popularly used.
IvC Search Efficiency
In this part, we compare the efficiency of SANE and NAS baselines by showing the test accuracy w.r.t the running time on transductive and inductive tasks. From Figure 3, we can observe that the efficiency improvements are in orders of magnitude, which aligns with the experiments in previous oneshot NAS approaches, like DARTS [liu2018darts] and NASP [yao2019differentiable]. Besides, as in Section IVB, we can see that SANE can obtain an architecture with higher test accuracy than random search, Bayesian, and GraphNAS.
To further show the efficiency improvements, we record the search time of each method to obtain an architecture, where the epochs of SANE and GraphNAS are set to 200, and the number of trialanderror process in Random and Bayesian is set to 200 as well, i.e., explore 200 candidate architectures, and the results are given in Table VII. The search time cost of SANE is two orders of magnitude smaller than those of NAS baselines, which further demonstrates the superiority of SNAE in terms of search efficiency.
Transductive task  Inductive task  
Cora  CiteSeer  PubMed  PPI  
Random  1,500  2,694  3,174  13,934 
Bayesian  1,631  2,895  4,384  14,543 
GraphNAS  3,240  3,665  5,917  15,940 
SANE  14  35  54  298 
ZHEN  ENZH  

@1  @10  @50  @1  @10  @50  
JAPE  33.32  69.28  86.40  33.02  66.92  85.15 
GCNAlign  41.25  74.38  86.23  36.49  69.94  82.45 
SANE  42.10  74.51  88.12  38.41  70.23  85.43 
Methods  Cora  CiteSeer  PubMed  PPI 

GraphNAS  0.8840 (0.0071)  0.7762 (0.0061)  0.8896 (0.0024)  0.9698 (0.0128) 
GraphNASWS  0.8808 (0.0101)  0.7613 (0.0156)  0.8842 (0.0103)  0.9584 (0.0415) 
GraphNAS(SANE search space)  0.8826 (0.0023)  0.7707 (0.0064)  0.8877 (0.0012)  0.9887 (0.0010) 
GraphNASWS(SANE search space)  0.8895 (0.0051)  0.7695 (0.0069)  0.8942 (0.0010)  0.9875 (0.0006) 
IvD DB task
Since DB task is different from the benchmark tasks, we adjust the settings of SANE following the proposed GCNAlign [wang2018cross], which uses two 2layer GCN in their experiments. To be specific, we set the number of layers to 2, which is different from that in the transductive task, and remove the layer aggregator because we observe that in our experiments the performance decreases when simply adding the layer aggregator to the GCN architecture in [wang2018cross]. Therefore, we use SANE to search for different combinations of node aggregators for the entity alignment task, and the results are shown in Table VIII. We can see that the performance of SANE is better than GCNAlign and JAPE, which demonstrates the effectiveness of SANE on the entity alignment task. We further emphasize the following observations:

[leftmargin=*]

The performance gains of SANE compared to GCNAlign is evident, which demonstrates the advantage of different combinations of node aggregators. And The searched architecture is “GATGeniePath”, and more hyperparameters are given in Appendix C.

The experimental results show the capability of SANE in broader domains. A takingaway message is that SANE can further improve the performance of a task where a regular GNN model, e.g., GCN, can work.
We notice that there are following works of GCNAlign, e.g., [xu2019cross, cao2019multi], however their modifications are orthogonal to node aggregators, thus they can be integrated with SANE to further improve the performance. We leave this for future work.
IvE Ablation Study
IvE1 The influence of differentiable search
In Algorithm 1 in Section IIIB, we can see that during the search process, the is updated w.r.t. to the validation loss and used to update the weights of the mixed operations in (3) to (4). Here we introduce a random explore parameter
, which is the probability that we randomly sample a single operation in each edge of the supernet, and only update the weights corresponding to the sampled operations. Then when
, the algorithm is the same to Algorithm 1, and when , it is equivalent to random search with weight sharing. Thus we can show the influence of the differentiable search algorithm by varying . In this part, we conduct experiments by varying , and show the performance trending in Figure 4(a) on the transductive and inductive tasks.From Figure 4(a), on all four datasets, we can see that the test accuracy decreases with the increasing of , and arrives the worst when . This means that the gradient descent method outperforms random sampling for architecture search. In other words, it demonstrates the effectiveness of the proposed differentiable search algorithm in Section IIIB.
IvE2 The influence of
In Section IV, we choose a layer GNN (K=3) as the backbone in our experiments for its empirically good performance. In this work, we focus on searching for shallow GNN architectures (). In this part, we conduct experiments by varying , and show the performance trending in Figure 4(b). We can see that with the increase of , the test accuracy increases firstly and then decreases, which aligns with the motivation of setting in the experiments in Section IVB.
IvE3 The efficacy of the designed search space
In Section IIIC, we discuss the advantages of search space between SANE and GraphNAS/AutoGNN. In this part, we conduct experiments to further show the advantages. To be specific, we run GraphNAS over its own and SANE’s search space, given the same time budget (20 hours), and compare the final test accuracy of the searched architectures in Table IX. From Table IX, we can see that despite the simplicity of the search space, SANE can obtain better or at least close accuracy compared to GraphNAS, which means better architectures can be obtained given the same time budget, thus demonstrating the efficacy of the designed search space.
IvE4 Failure of searching for universal approximators
In this part, we further show the performance of searching MultiLayer Perception (MLP) as node aggregators since it is a universal function approximator: any continuous function on a compact set can be approximated with a large enough MLP [leshno1993multilayer]. And in [xu2018powerful], Xu et al. shows that using MLP aggregators can be as powerful as the WeisfeilerLehman (WL) test under some conditions, which upper bounds the expressive capability of existing GNN models. However, in practice, it is very challenging to obtain satisfying performance by MLPs without prior knowledge, since it is too general to design the MLP of a suitable structure given a task. This motivates us to incorporate various existing GNN models as node aggregators in our search space (Table I), thus the performance of any search algorithm can be guaranteed. Then to show the challenges of directly searching for MLP as node aggregators, we propose to search for a specific MLP as node aggregators with NAS approaches. To be specific, we set the parameter space of the MLP as and , where and represents the hidden embedding size (width) and the depth of the MLP, respectively. For simplicity, we adopt the Random and Bayesian as the NAS approaches, and search on the four benchmark datasets, where the settings are the same as those in Section IVB. The performance is shown in Table X, from which we can see that the performance gap between searching for MLP and SANE are large. This observation demonstrates the difficulties of searching MLP as node aggregators despite its powerful expressive capability as WL test. On the other hand, it demonstrates the necessity of the designed search space of SANE, which includes existing humandesigned GNN models, thus the performance can be guaranteed in practice.
Dataset  Random  Bayesian  SANE 

Cora  0.8698 (0.0011)  0.8470 (0.0032)  0.8926 (0.0123) 
CiteSeer  0.7298 (0.0078)  0.7103 (0.0057)  0.7859 (0.0108) 
PubMed  0.8662 (0.0030)  0.8699 (0.0065)  0.9047 (0.0091) 
PPI  0.8166 (0.0089)  0.8685 (0.0017)  0.9856 (0.0120) 
V Conclusion
In this work, to address the architecture and computational challenges facing existing GNN models and NAS approaches, we propose to Search to Aggregate NEighborhood (SANE) for graph neural architecture search. By reviewing various humandesigned GNN architectures, we define an expressive search space including node and layer aggregators, which can emulate more unseen GNN architectures beyond existing ones. A differentiable architecture search algorithm is further proposed, which leads to a more efficient search algorithm than existing NAS methods for GNNs. Extensive experiments are conducted on five realworld datasets in transductive, inductive, and DB tasks. The experimental results demonstrate the superiority of SANE comparing to GNN and NAS baselines in terms of effectiveness and efficiency.
For future directions, we will explore more advanced NAS approaches to further improve the performance of SANE. Besides, we can explore beyond node classification tasks and focus on more graphbased tasks, e.g., the whole graph classification [hu2020open]. In these cases, different graph pooling methods can be searched for the whole graph representations.
Vi Acknowledgments
This work is supported by National Key R&D Program of China (2019YFB1705100). We also thank Lanning Wei to implement several experiments in this work. We further thank all anonymous reviewers for their constructive comments, which help us to improve the quality of this manuscript.
GNN models  Symbol in the paper  Key explanations 

GCN [kipf2016semi]  GCN  
GraphSAGE [hamilton2017inductive]  SAGEMEAN, SAGEMAX, SAGESUM  Apply mean, max, or sum operation to . 
GAT [velivckovic2017graph]  GAT  Compute attention score: . 
GATSYM  .  
GATCOS  .  
GATLINEAR  .  
GATGENLINEAR  .  
GIN [xu2018powerful]  GIN  . 
LGCN [gao2018large]  CNN  Use 1D CNN as the aggregator, equivalent to a weighted summation aggregator. 
GeniePath [liu2019geniepath]  GeniePath  Composition of GAT and LSTMbased aggregators 
JKNetwork [xu2018representation]  Depending on the base above GNN 
Cora  CiteSeer  PubMed  PPI  
Head num  8  2  2  4  4 
Hidden embedding size  256  64  64  512  512 
Learning rate  4.150e4  5.937e3  2.408e3  1.002e3  2.039e3 
Norm  1.125e4  2.007e5  8.850e5  0  3.215e4 
Activation function  relu  relu  relu  relu  relu 