1 Introduction
Recently, a large number of research efforts are dedicated to applying deep learning methods to graphs, known as graph neural networks (GNNs)
[Kipf and Welling, 2016, Veličković et al., 2017], achieving great success in modeling nonstructured data, e.g., social networks and recommendation systems.Learning an effective lowdimensional embedding to represent each node in the graph is arguably the most important task for GNN learning, wherein the node embedding is obtained by aggregating information with its direct and indirect neighboring nodes passed through GNN layers [Gilmer et al., 2017]. Earlier GNN works usually aggregate with neighboring nodes that are within short range. For many graphs, this may cause the socalled underreaching issue [Alon and Yahav, 2020] – informative yet distant nodes are not involved, leading to unsatisfactory results. Consequently, lots of techniques that try to aggregate highorder neighbors are proposed by deepening or widening the network [Li et al., 2018, Xu et al., 2018a, b, Li et al., 2019, Zhu and Koniusz, 2021]. However, when we aggregate information from too many neighbors, especially highorder ones, the socalled oversmoothing problem Li et al. [2018] may occur, causing nodes less distinguishable [Chen et al., 2020]. To alleviate this problem, several hopaware GNN aggregation schemes are proposed in the literature [AbuElHaija et al., 2019, Zhang et al., 2020, Zhu et al., 2019, Wang et al., 2019].
While showing promising results, in most existing GNN representation learning works, the messages passed among nodes are mingled together. From a communication perspective, mixing the information from clean sources (mostly loworder neighbors) and that from noisy sources (mostly highorder neighbors) would inevitably cause difficulty for the receiver (i.e., the target node) to extract information. Motivated by the above observation, we propose a simple yet effective ladderstyle GNN architecture, namely LadderGNN. Specifically, the contributions of this paper include:

We take a communication perspective on GNN message passing. That is, we regard the target node for representation learning as the receiver and group the set of neighboring nodes with the same distance to it as a transmitter that carries both information and noise. The dimension of the message can be regarded as the capacity of the communication channel. Then, aggregating neighboring information from multiple hops becomes a multisource communication problem with multiple transmitters over the communication channel.

To simplify node representation learning, we propose to separate the messages from different transmitters (i.e., Hop neighbors), each occupying a proportion of the communication channel (i.e., disjoint message dimensions). As the informationtonoise ratio from highorder neighbors is usually lower than loworder neighbors, the resulting hopaware representation is usually unbalanced with more dimensions allocated to loworder neighbors, leading to a ladderstyle aggregation scheme.

We propose a reinforcement learningbased neural architecture search (NAS) strategy to determine the dimension allocation for neighboring nodes from different hops. Based on the search results, we propose an approximate hopdimension relation function, which can generate close results to the NAS solution without applying computeexpensive NAS.
To verify the effectiveness of the proposed simple hopaware representation learning solution, we verify it on several semisupervised node classification datasets. Experimental results show that the proposed LadderGNN solution can achieve stateoftheart performance on most of them.
2 Related Work and Motivation
Graph neural networks adopt message passing to learn node embeddings, which involves two steps for each node: neighbor aggregation and linear transformation
[Gilmer et al., 2017]. The following formula presents the mathematical form of message passing in a graph convolutional network (GCN) [Kipf and Welling, 2016]. Given an undirected graph = (, ) with nodes and adjacency matrix , we can aggregate node features at the th layer as:(1) 
where is the augmented normalized adjacency matrix of .
is the identity matrix and
. is the trainable weight matrix at the th layer used to update node embeddings.is an activation function.
The original GCN work does not differentiate the messages passed from direct neighbors. Graph attention network (GAT) [Veličković et al., 2017] applies a multihead selfattention mechanism to compute the importance of neighboring nodes during aggregation and achieves better solutions than GCN in many cases. Later, considering highorder nodes also carry relevant information , some GNN architectures [Li et al., 2019, Xu et al., 2018b, a] stack deeper networks to retrieve information from highorder neighbors recursively. To mitigate the possible overfitting (due to model complexity) issues when aggregating highorder neighbors, SGC [Wu et al., 2019] removes the redundant nonlinear operations and directly aggregates node representations from multiple hops. Aggregating highorder neighbors, however, may cause the socalled oversmoothing problem that results in less discriminative node representations (due to overmixing) Li et al. [2018]. Consequently, various hopaware aggregation solutions are proposed in the literature. Some of them (e.g. HighOrder Morris et al. [2019], MixHop [AbuElHaija et al., 2019], NGCN AbuElHaija et al. [2020], GBGNN Oono and Suzuki [2020]) employ multiple convolutional branches to aggregate neighbors from different hops. Others (e.g. AMGCN Wang et al. [2020], HWGCN Liu et al. [2019a], MultiHop Zhu et al. [2019]) try to learn adaptive attention scores when aggregating neighboring nodes from different hops.
on the Pubmed dataset.
No doubt to say, the critical issue in GNN representation learning is how to retrieve information effectively while suppressing noise during message passing. In Figure 1, we plot the homophily ratio for nodes in the Pubmed dataset. As can be observed, with the increase of distance, the percentage of neighboring nodes with the same label decreases, indicating a diminishing informationtonoise ratio for messages ranging from loworder neighbors to highorder neighbors. However, existing GNN architectures mix the information from clean sources and noisy sources in representation learning (even for hopaware aggregation solutions), making discriminative feature extraction challenging. These motivate the proposed LadderGNN solution, as detailed in the following section.
3 Approach
In Sec. 3.1, we take a communication perspective on GNN message passing and representation learning. Then, we give an overview of the proposed LadderGNN framework in Sec. 3.2. Next, we explore the dimensions of different hops with an RLbased NAS strategy in Sec. 3.3. Finally, to reduce the computational complexity of the NASbased solution, we propose an approximate hopdimension relation function in Sec. 3.4.
3.1 GNN Representation Learning from a Communication Perspective
In GNN representation learning, messages are passed from neighboring nodes to the target node and update its embedding. Figure 2 presents a communication perspective on GNN message passing, wherein we regard the target node as the receiver. Considering neighboring nodes from different hops tend to contribute unequally (see Figure 1), we group the set of neighboring nodes with the same distance as one transmitter, and hence we have transmitters if we would like to aggregate up to hops. The dimension of the message can be regarded as the communication channel capacity. Then, GNN message passing becomes a multisource communication problem.
Some existing GNN messagepassing schemes (e.g., SGC Wu et al. [2019], JKNet Xu et al. [2018b], and SGC Zhu and Koniusz [2021]) aggregate neighboring nodes before transmission, as shown in Figure 2(b), which mix clean information source and noisy information source directly. The other hopaware GNN messagepassing schemes (e.g., AMGCN Wang et al. [2020], MultiHop Zhu et al. [2019], and MixHop AbuElHaija et al. [2019]) as shown in Figure 2(c)) first conduct aggregation within each hop (i.e., using separate weight matrix) before transmission over the communication channel, but they are again mixed afterward.
Different from a conventional communication system that employs a welldeveloped encoder for the information source, one of the primary tasks in GNN representation learning is to learn an effective encoder that extracts useful information with the help of supervision. Consequently, the mixing of clean information sources (mostly loworder neighbors) and noisy information sources (mostly highorder neighbors) makes the extraction of discriminative features challenging.
The above motivates us to perform GNN message passing without mixing up messages from different hops, as shown in Figure 2(d). At the receiver, we concatenate the messages from various hops, and such disentangled representations facilitate extracting useful information from various hops with little impact on each other. Moreover, dimensionality significantly impacts any neural networks’ generalization and representation capabilities [Srivastava et al., 2014, Liu and Gillies, 2016, Alon and Yahav, 2020, Bartlett et al., 2020], as it controls the amount of quality information learned from data. In GNN message passing, the informationtonoise ratio of loworder neighbors is usually higher than that of highorder neighbors. Therefore, we tend to allocate more dimensions to close neighbors than distant ones, leading to a ladderstyle aggregation scheme.
3.2 The Proposed LadderAggregation Framework
With the above, Figure 3 shows the node representation update procedure in the proposed LadderGNN architecture. For a particular target node (the center node in the figure), we first conduct node aggregations within each hop, which can be performed with existing GNN aggregation methods (e.g., GCN or GAT). Next, we determine the dimensions for the aggregated messages from different hops and then concatenate them (instead of mixing them up) for interhop aggregation. Finally, we perform a linear transformation to generate the updated node representation.
Specifically, given the graph for representation learning, is the maximum number of neighboring hops for node aggregation. For each group of neighboring nodes at Hop, we determine their respective optimal dimensions and then concatenate their embeddings into as follows:
(2) 
where is the normalized adjacency matrix of the hop and is the input feature. A learnable matrix controls the output dimension of the hop as . Encoding messages from different hops with distinct avoids the overmixing of neighbors, thereby alleviating the impact of noisy information sources on clean information sources during GNN message passing.
Accordingly,
is a hopaware disentangled representation of the target node. Then, with the classifier
after the linear layer , we have:(3) 
where is the output softmax values. Given the supervision of some nodes, we can use a crossentropy loss to calculate gradients and optimize the above weights in an endtoend manner.
3.3 HopAware Dimension Search
Neural architecture search (NAS) aims to automatically design deep neural networks with comparable or even higher performance than manual designs by experts, and it has been extensively researched in recent years (e.g., [Bello et al., 2017, Tan et al., 2019, Zoph et al., 2018, Liu et al., 2018, 2019b]) Existing works in GNNs [Gao et al., 2020, Zhou et al., 2019, Shi et al., 2020] search the graph architectures (e.g., 1hop aggregators, activation function, aggregation type, attention type, etc) and hyperparameters to reach better performance. However, they ignore to aggregate multihop neighbors, not to mention the dimensionality of each hop. In this section, we propose to automatically search an optimal dimension for each hop in our LadderGNN architecture.
Search Space: Different from previous works in GNNs [Gao et al., 2020, Zhou et al., 2019, You et al., 2020], our search space is the dimension of each hop, called hopdimension combinations. To limit the possible search space for hopdimension combinations, we apply two sampling strategies: (1) exponential sampling , , ,…,,…,, ; (2) evenly sampling , , ,…,,…,, . and are hyperparameters, representing the index and sampling granularity to cover the possible dimensions. For each strategy, the search space should also cover the dimension of initial input feature .
Basic Search Algorithm: Given the search space , we target finding the best model to maximize the expected validation accuracy. We choose reinforcement learning strategy since its reward is easy to customize for our problem. As shown in Figure 4, a basic strategy uses a LSTM controller based on the parameters to generate a sequence of actions with length , where each hop dimension () is sampled from the search space mentioned above. Then, we can build a model mentioned in Sec. 3.2
, and train it with a crossentropy loss function. Then, we test it on the validation set
to get an accuracy . Next, we can use the accuracy as a reward signal and perform a policy gradient algorithm to update the parameters , so that the controller can generate better hopdimension combinations iteratively. The objective function of the model is shown in:(4) 
Conditionally Progressive Search Algorithm: Considering the extremely large search space with the basic search algorithm, e.g., the search space size will be for the exponential sampling with hops. This makes it more difficult to search for the right combinations with limited computational resources. In order to improve the efficiency of the search and observe that there are a large number of redundant actions in our search space, we are inspired to propose a conditionally progressive search algorithm. That is, instead of searching the entire space all at once, we divide the searching process into multiple phases, starting with a relatively small number of hops, e.g., . After obtaining the NAS results, we only keep those hopdimension combinations (regarded as the conditional search space) with high . Next, we conduct the hopdimension search for the (+) hop based on the conditional search space filtered from the last step, and again, keep those combinations with high . This procedure is conducted progressively until aggregating more hops cannot boost performance. With this algorithm, we can largely reduce the redundant search space to enhance search efficiency.
3.4 The Approximate HopDimension Relation Function
The computational resources required to conduct NAS are extremely expensive, even with the proposed progressive search algorithm. After analyzing the hopdimension combinations returned from NAS in Sec. 4.1, we find that most of the satisfactory combinations show rather consistent results.
This motivates us to propose a simple yet effective hopdim relation function to approximate NAS solutions. The output dimension of hop is:
(5) 
where is the dimension compression ratio, and is the dimension of the input feature. With such an approximate function, there is only one hyperparameter to determine, significantly reducing the computational cost. Moreover, under the approximate solution, the loworder neighbors within hops are directly aggregated with the initial feature dimensions while highorder neighbors are associated with an exponentially decreasing dimensions.
Combination with Other Node Aggregation Schemes: The proposed hopaware aggregation strategy is orthogonal with existing nodewise aggregation methods within each hop. For example, the loworder neighbors () can be aggregated with SGC aggregation scheme (e.g., ) while highorder neighbors () are aggregated with the proposed LadderGNN. This formulation is similar to the searched framework, where loworder neighbors tend to remain their dimensions. We can also call this architecture LadderSGC. Similarly, we could employ GAT aggregation mechanism as Hop1 aggregator while the remaining hops are aggregated with the proposed LadderGNN. We call this architecture as LadderGAT.
4 Experiment
In this section, we validate the effectiveness of LadderGNN on some widelyused semisupervised node classification datasets. We first analyze the NAS results in Sec. 4.1. Then, we compare results with existing GNN representation learning works in Sec. 4.2. Last, we show an ablation study on the proposed hopdim relation function in Sec. 4.3.
Data description: For the semisupervised node classification task, we evaluate our method on six
datasets: Cora
Yang et al. [2016], Citeseer
Yang et al. [2016], Pubmed [Yang et al., 2016], Flicker Meng et al. [2019], Arxiv Hu et al. [2020] and Products Hu et al. [2020]. We split the training, validation and test set following earlier works.We also conduct experiments on metapathbased heterogeneous graphs, wherein we differentiate attributes and aggregates them in a similar disentangled manner. Due to page limits, more details about dataset descriptions, data preprocessing procedure, experimental settings, along with heterogeneous graph results are listed in the Appendix.
4.1 Results from Neural Architecture Search
There exist a number of NAS approaches for GNN models, including random search (e.g., AGNN Zhou et al. [2019]), reinforcement learningbased solution (e.g., GraphNAS Gao et al. [2020] and AGNN Zhou et al. [2019]
) and evolutionary algorithm (e.g., GeneticGNN
Shi et al. [2020]). In this work, instead of searching GNN model structures, NAS is used to search appropriate dimensions allocated to each hop.Method  Cora  Citeseer  Pubmed 

Randomw/o share Zhou et al. [2019]  81.4  72.9  77.9 
Randomwith share Zhou et al. [2019]  82.3  69.9  77.9 
GraphNASw/o share Gao et al. [2020]  82.7  73.5  78.8 
GraphNASwith share Gao et al. [2020]  83.3  72.4  78.1 
AGNNw/o share Zhou et al. [2019]  83.6  73.8  79.7 
AGNNwith share Zhou et al. [2019]  82.7  72.7  79.0 
GeneticGNN Shi et al. [2020]  83.8  73.5  79.2 
Oursw/o cond.  82.0  72.9  79.6 
Ourswith cond.  83.5  74.8  80.9 
OursApprox  83.3  74.7  80.0 
In particular, we search the hopdimension combinations of hops on Cora, Citeseer, and Pubmed datasets and show experimental results in Table 1. Compared with existing NAS methods, our NAS method can achieve better results with conditional progressive search algorithm in Citeseer and Pubmed dataset, improving over GeneticGNN by 1.4% and 2.1%, respectively. Meanwhile, we achieve comparable accuracy in the Cora dataset only by considering the hopdimension combinations. Moreover, compared with w/o cond., we can find 2.6% improvements on conditional progressive search. And we leave the analysis on the conditional strategy in Appendix.
Figure 5 presents the histogram of the potential dimension assignment for different hops. We can observe: (i) for loworder neighbors, i.e., when hop is less than in this case, most of the found solutions with high accuracy keep the initial feature dimension; (ii) most of the possible dimensions of the hop are only in single digits, which inspires us to design the conditional strategy to reduce the search space greatly; (iii) The dimensionality tends to be reduced for highorder neighbors, and approximating it with exponentially decreasing dimensions occupies a relatively large proportion of the solutions. Accordingly, we could use the approximate relation function to search for proper hopdimension combinations with comparable performance to the NAS solution (see Table 1).
4.2 Comparison with Related Work
Table 2 compares LadderGNN with three groups of existing methods (general GNNs, GNNs with modified graph structures, hopaware GNNs) in terms of Top1 accuracy (%) on six datasets. The results from Ours of {Cora, Citeseer, Pubmed} dataset are conducted with conditionally progressive search, and we use the proposed hopdim relation function for the other datasets and all datasets in . Our methods can achieve stateoftheart performance in most cases, and surpass all hopaware GNNs for all datasets. Specifically, our method shows more improvements on Flicker (improvement by 9.1%). Flicker contains a higher edgenode ratio and a large feature size, which contains more noise and requires more dimension compression to denoise, especially highorder neighbors. A greater degree of noise can be reduced with the proposed LadderGNN, resulting in a greater enhancement. Meanwhile, We boost 1.5% with GXN on Citeseer, 0.5% with DisenGCN on Pubmed, 0.4% with GAT on Arxiv, and 1.6% with GAT on Products, respectively. Compared with Cora and Pubmed, Citeseer has a lower graph homophily rate Pei et al. [2020], which makes highorder neighbors more noisy and hard to extract information. Our method can boost relatively more on Citeseer, which proves the effectiveness of our method to handle noisy graphs. Last, Although the best GNN models for different tasks vary significantly You et al. [2020], our simple hopdimension relation function consistently outperforms other hopaware methods on all datasets, which indicates its effectiveness.
Method  Cora  Citeseer  Pubmed  Flicker  Arxiv  Products  

General GNNs  ChebNet Defferrard et al. [2016]  81.2  69.8  74.4  23.3     
GCN Kipf and Welling [2016]  81.5  70.3  79.0  41.1  71.7  75.6  
GraphSage Hamilton et al. [2017]  81.3  70.6  75.2  57.4  71.5  78.3  
GAT* Veličković et al. [2017]  78.9  71.2  79.0  46.9  73.6  79.5  
JKNet Xu et al. [2018b]  80.2  67.6  78.1  56.7  72.2    
SGC Wu et al. [2019]  81.0  71.9  78.9  67.3  68.9  68.9  
APPNP Klicpera et al. [2018]  81.8  72.6  79.8    71.4    
ALaGCN Xie et al. [2020]  82.9  70.9  79.6        
MCN Lee et al. [2018]  83.5  73.3  79.3        
DisenGCN Ma et al. [2019]  83.7  73.4  80.5        
FAGCN Bo et al. [2021]  84.1  72.7  79.4        
Structure 
DropEdgeGCN Rong et al. [2019]  82.0  71.8  79.6  61.4     
DropEdgeIncepGCN Rong et al. [2019]  82.9  72.7  79.5        
AdaEdgeGCN Chen et al. [2020]  82.3  69.7  77.4        
PDTNetGCN Luo et al. [2021]  82.8  72.7  79.8        
GXN Zhao et al. [2020]  83.2  73.7  79.6        
SPAGAN Yang et al. [2021]  83.6  73.0  79.6        
Hopaware GNNs  HighOrder Morris et al. [2019]  76.6  64.2  75.0       
MixHop AbuElHaija et al. [2019]  80.5  69.8  79.3  39.6      
GBGNN Oono and Suzuki [2020]  80.8  70.8  79.2        
HWGCN Liu et al. [2019a]  81.7  70.7  79.3        
MultiHop Zhu et al. [2019]  82.4  79.4  71.5        
HLHG Lei et al. [2019]  82.7  71.5  79.3        
AMGCN Wang et al. [2020]  82.6  73.1  79.6        
NGCN AbuElHaija et al. [2020]  83.0  72.2  79.5        
SGC Zhu and Koniusz [2021]  83.5  73.6  80.2    72.0  76.8  

Ours  83.5  74.8  80.9  73.4  72.1  78.7 
LadderGAT  82.6  73.8  80.6  71.4  73.9  80.8 
Our method focuses on the hopaware aggregation, hence, we can easily combine with any nodeaware aggregation, like GAT, SGC. We investigate the combinations of two aggregation framework among nodes with our proposed LadderGNN to boost the performance and robustness when aggregating highorder neighbors.
Improving the highorder aggregation of GAT. GAT Veličković et al. [2017] tends to distinguish the relationships among direct nodes, showing better performance than GCN in many graphs. For GAT, we first explore whether GAT itself can aggregate higherorder neighbors with eight multiheads and four kinds of channel sizes {1, 4, 8, 16} for each head in their selfattention scheme. To aggregate hop neighbors, both a deeper structure (stacking layers), named as DGAT in blue lines and a wider (one layer computes and aggregates multipleorder neighbors) network WGAT in orange lines are compared in Figure 6(a). As we can see, the DGAT will suffer from oversmoothing problems suddenly (=), especially for the larger channel size. Moreover, WGAT will degrade performance gradually, caused by overfitting problems. Thus, both of the two aggregations drop their performance as hops increase. On the contrary, the proposed LadderGAT in purple line is robust to the increasing of hops since the proposed LadderGNN can relieve the above problems when aggregating highorder hops.
Improving the highorder aggregation of SGC. SGC Wu et al. [2019] shows its effectiveness when aggregating loworder hops. In Figure 6(b), we compare SGC with LadderSGC, where the loworder hops are aggregated without dimension reduction and compress the dimensions of hop (L hop K). The purple line is set as an interesting setting: only compress the last hop to a lower dimension and =. Under the same horizontal coordinate hops, LadderSGC achieves consistent improvement compared with SGC (in orange line).
To explain the reasons behind results, we find the weights updated in SGC will be affected by (i) the informationtonoise ratios within different hops and (ii) oversquashing phenomena Alon and Yahav [2020]
– information from the exponentiallygrowing receptive field is compressed by fixedlength node vectors, and causing it is difficult to make
hop neighbors play a role. This observation suggests that compressing the dimension on the last hop can mitigate the oversquashing problem in SGC, which thus consistently improves the performance of highorder aggregation.4.3 Ablation Study
To analyze effects on the hopdimension function in Eq. (5), we conduct experiments by varying two dominant hyperparameters: the furthest hop , the dimension compression rate . Although we design a NAS to explore for prior about these hyperparameters, the ablation study in this section will help understand the impact of each hyperparameter. Due to the space limitation, we only present the results on Citeseer. Results on other datasets are included in Appendix.
In Table 3, the compression rate varies from to () as comparison. We can observe that (i) by increasing the furthest hop with fixed , the performance will increase to saturation when . This is because less information from neighbors is aggregated when increasing . Moreover, LadderGNN can learn to suppress noise by dimension compression. (ii) by decreasing the decay rate on fixed , the performance first increases and then drops under most situations. The reason for increasing is that the decreased compression rate will map the distant nodes to a lower and suitable dimension space, suppressing the interference of the distant nodes. However, there is an upper bound for these improvements under . When , the reduced dimension is too low to preserve the overall structural information, leading to worse performance in most cases. (iii) the effective rate is mainly on {0.125, 0.0625}, which can achieve better results for most . If and , we obtain the best accuracy of 74.7%. (iv) notice that there are significant improvements with dimension compression comparing to dimension increase (), which validates the effectiveness of the basic principle of dimension compression.
=  60.7  67.6  64.8  63.0  61.2  58.9  55.8  50.2 

=  65.5  71.5  72.3  72.8  73.0  73.1  73.2  73.7 
=  68.8  72.9  73.3  73.9  73.6  73.5  73.4  73.0 
=  71.0  73.5  74.1  74.0  74.2  74.3  74.0  72.8 
=  69.3  73.1  73.6  74.7  74.3  74.3  73.8  74.2 
=  67.2  71.4  72.7  73.0  73.2  73.8  73.5  73.3 

5 Conclusion
In this work, we propose a simple yet effective ladderstyle GNN architecture, namely LadderGNN. To be specific, we take a communication perspective for the GNN representation learning problem, which motivates us to separate messages from different hops and assign different dimensions for them before concatenating them to obtain the node representation. The resulted ladderstyle aggregation scheme facilitates extracting discriminative features effectively when compared with exiting solutions. Experimental results on various semisupervised node classification tasks show that the proposed simple LadderGNN solution can achieve stateoftheart performance on most datasets.
References

AbuElHaija et al. [2019]
S. AbuElHaija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman,
H. Harutyunyan, G. Ver Steeg, and A. Galstyan.
Mixhop: Higherorder graph convolutional architectures via sparsified
neighborhood mixing.
In
international conference on machine learning
, pages 21–29. PMLR, 2019. 
AbuElHaija et al. [2020]
S. AbuElHaija, A. Kapoor, B. Perozzi, and J. Lee.
Ngcn: Multiscale graph convolution for semisupervised node
classification.
In
uncertainty in artificial intelligence
, pages 841–851. PMLR, 2020.  Alon and Yahav [2020] U. Alon and E. Yahav. On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205, 2020.

Bartlett et al. [2020]
P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler.
Benign overfitting in linear regression.
Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.  Bello et al. [2017] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le. Neural optimizer search with reinforcement learning. ArXiv, abs/1709.07417, 2017.
 Bo et al. [2021] D. Bo, X. Wang, C. Shi, and H. Shen. Beyond lowfrequency information in graph convolutional networks. arXiv preprint arXiv:2101.00797, 2021.
 Chen et al. [2020] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun. Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. In AAAI, pages 3438–3445, 2020.
 Defferrard et al. [2016] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
 Gao et al. [2020] Y. Gao, H. Yang, P. Zhang, C. Zhou, and Y. Hu. Graph neural architecture search. In Proceedings of the TwentyNinth International Joint Conference on Artificial Intelligence, IJCAI20, pages 1403–1409, 7 2020. URL https://doi.org/10.24963/ijcai.2020/195.
 Gilmer et al. [2017] J. Gilmer, S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. ArXiv, abs/1704.01212, 2017.
 Hamilton et al. [2017] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017.
 Hu et al. [2020] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
 Kipf and Welling [2016] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Klicpera et al. [2018] J. Klicpera, A. Bojchevski, and S. Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
 Lee et al. [2018] J. B. Lee, R. A. Rossi, X. Kong, S. Kim, E. Koh, and A. Rao. Higherorder graph convolutional networks. arXiv preprint arXiv:1809.07697, 2018.
 Lei et al. [2019] F. Lei, X. Liu, Q. Dai, B. W.K. Ling, H. Zhao, and Y. Liu. Hybrid loworder and higherorder graph convolutional networks. arXiv preprint arXiv:1908.00673, 2019.

Li et al. [2019]
G. Li, M. Muller, A. Thabet, and B. Ghanem.
Deepgcns: Can gcns go as deep as cnns?
In
Proceedings of the IEEE International Conference on Computer Vision
, pages 9267–9276, 2019.  Li et al. [2018] Q. Li, Z. Han, and X.M. Wu. Deeper insights into graph convolutional networks for semisupervised learning. arXiv preprint arXiv:1801.07606, 2018.
 Liu et al. [2018] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 Liu and Gillies [2016] R. Liu and D. F. Gillies. Overfitting in linear feature extraction for classification of highdimensional image data. Pattern Recognition, 53:73–86, 2016.
 Liu et al. [2019a] S. Liu, L. Chen, H. Dong, Z. Wang, D. Wu, and Z. Huang. Higherorder weighted graph convolutional networks. arXiv preprint arXiv:1911.04129, 2019a.
 Liu et al. [2019b] Y. Liu, J. Liu, A. Zeng, and X. Wang. Differentiable kernel evolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019b.
 Luo et al. [2021] D. Luo, W. Cheng, W. Yu, B. Zong, J. Ni, H. Chen, and X. Zhang. Learning to drop: Robust graph neural network via topological denoising. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 779–787, 2021.
 Ma et al. [2019] J. Ma, P. Cui, K. Kuang, X. Wang, and W. Zhu. Disentangled graph convolutional networks. In International Conference on Machine Learning, pages 4212–4221. PMLR, 2019.
 Meng et al. [2019] Z. Meng, S. Liang, H. Bao, and X. Zhang. Coembedding attributed networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 393–401, 2019.
 Morris et al. [2019] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higherorder graph neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4602–4609, 2019. doi: 10.1609/aaai.v33i01.33014602. URL https://ojs.aaai.org/index.php/AAAI/article/view/4384.
 Oono and Suzuki [2020] K. Oono and T. Suzuki. Optimization and generalization analysis of transduction through gradient boosting and application to multiscale graph neural networks. arXiv preprint arXiv:2006.08550, 2020.
 Pei et al. [2020] H. Pei, B. Wei, K. C.C. Chang, Y. Lei, and B. Yang. Geomgcn: Geometric graph convolutional networks. arXiv preprint arXiv:2002.05287, 2020.
 Rong et al. [2019] Y. Rong, W. Huang, T. Xu, and J. Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2019.
 Shi et al. [2020] M. Shi, D. A. Wilson, X. Zhu, Y. Huang, Y. Zhuang, J. Liu, and Y. Tang. Evolutionary architecture search for graph neural networks. arXiv preprint arXiv:2009.10199, 2020.
 Srivastava et al. [2014] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
 Tan et al. [2019] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet: Platformaware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
 Veličković et al. [2017] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
 Wang et al. [2019] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu. Heterogeneous graph attention network. In The World Wide Web Conference, pages 2022–2032, 2019.
 Wang et al. [2020] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei. Amgcn: Adaptive multichannel graph convolutional networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1243–1253, 2020.
 Wu et al. [2019] F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger. Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019.
 Xie et al. [2020] Y. Xie, S. Li, C. Yang, R. C.W. Wong, and J. Han. When do gnns work: Understanding and improving neighborhood aggregation. In C. Bessiere, editor, Proceedings of the TwentyNinth International Joint Conference on Artificial Intelligence, IJCAI20, pages 1303–1309. International Joint Conferences on Artificial Intelligence Organization, 7 2020. doi: 10.24963/ijcai.2020/181. URL https://doi.org/10.24963/ijcai.2020/181. Main track.
 Xu et al. [2018a] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018a.
 Xu et al. [2018b] K. Xu, C. Li, Y. Tian, T. Sonobe, K.i. Kawarabayashi, and S. Jegelka. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018b.
 Yang et al. [2021] Y. Yang, X. Wang, M. Song, J. Yuan, and D. Tao. Spagan: Shortest path graph attention network. arXiv preprint arXiv:2101.03464, 2021.

Yang et al. [2016]
Z. Yang, W. Cohen, and R. Salakhudinov.
Revisiting semisupervised learning with graph embeddings.
In International conference on machine learning, pages 40–48. PMLR, 2016.  You et al. [2020] J. You, Z. Ying, and J. Leskovec. Design space for graph neural networks. Advances in Neural Information Processing Systems, 33, 2020.
 Zhang et al. [2020] L. Zhang, Y. Ge, and H. Lu. Hophop relationaware graph neural networks. CoRR, abs/2012.11147, 2020.
 Zhao et al. [2020] T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah. Data augmentation for graph neural networks. arXiv preprint arXiv:2006.06830, 2020.
 Zhou et al. [2019] K. Zhou, Q. Song, X. Huang, and X. Hu. Autognn: Neural architecture search of graph neural networks. ArXiv, abs/1909.03184, 2019.
 Zhu and Koniusz [2021] H. Zhu and P. Koniusz. Simple spectral graph convolution. In International Conference on Learning Representations, 2021.
 Zhu et al. [2019] Q. Zhu, B. Du, and P. Yan. Multihop convolutions on weighted graphs. arXiv preprint arXiv:1911.04978, 2019.
 Zoph et al. [2018] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
Comments
There are no comments yet.