Hop-Aware Dimension Optimization for Graph Neural Networks

In Graph Neural Networks (GNNs), the embedding of each node is obtained by aggregating information with its direct and indirect neighbors. As the messages passed among nodes contain both information and noise, the critical issue in GNN representation learning is how to retrieve information effectively while suppressing noise. Generally speaking, interactions with distant nodes usually introduce more noise for a particular node than those with close nodes. However, in most existing works, the messages being passed among nodes are mingled together, which is inefficient from a communication perspective. Mixing the information from clean sources (low-order neighbors) and noisy sources (high-order neighbors) makes discriminative feature extraction challenging. Motivated by the above, we propose a simple yet effective ladder-style GNN architecture, namely LADDER-GNN. Specifically, we separate messages from different hops and assign different dimensions for them before concatenating them to obtain the node representation. Such disentangled representations facilitate extracting information from messages passed from different hops, and their corresponding dimensions are determined with a reinforcement learning-based neural architecture search strategy. The resulted hop-aware representations generally contain more dimensions for low-order neighbors and fewer dimensions for high-order neighbors, leading to a ladder-style aggregation scheme. We verify the proposed LADDER-GNN on several semi-supervised node classification datasets. Experimental results show that the proposed simple hop-aware representation learning solution can achieve state-of-the-art performance on most datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/21/2020

Hop-Hop Relation-aware Graph Neural Networks

Graph Neural Networks (GNNs) are widely used in graph representation lea...
08/19/2019

Graph Neural Networks with High-order Feature Interactions

Network representation learning, a fundamental research problem which ai...
02/12/2021

Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks

Most graph neural networks (GNN) perform poorly in graphs where neighbor...
06/09/2020

On the Bottleneck of Graph Neural Networks and its Practical Implications

Graph neural networks (GNNs) were shown to effectively learn from highly...
08/23/2020

Tree Structure-Aware Graph Representation Learning via Integrated Hierarchical Aggregation and Relational Metric Learning

While Graph Neural Network (GNN) has shown superiority in learning node ...
04/09/2020

HopGAT: Hop-aware Supervision Graph Attention Networks for Sparsely Labeled Graphs

Due to the cost of labeling nodes, classifying a node in a sparsely labe...
03/08/2021

Learning Graph Neural Networks with Positive and Unlabeled Nodes

Graph neural networks (GNNs) are important tools for transductive learni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, a large number of research efforts are dedicated to applying deep learning methods to graphs, known as graph neural networks (GNNs) 

[Kipf and Welling, 2016, Veličković et al., 2017], achieving great success in modeling non-structured data, e.g., social networks and recommendation systems.

Learning an effective low-dimensional embedding to represent each node in the graph is arguably the most important task for GNN learning, wherein the node embedding is obtained by aggregating information with its direct and indirect neighboring nodes passed through GNN layers [Gilmer et al., 2017]. Earlier GNN works usually aggregate with neighboring nodes that are within short range. For many graphs, this may cause the so-called under-reaching issue [Alon and Yahav, 2020] – informative yet distant nodes are not involved, leading to unsatisfactory results. Consequently, lots of techniques that try to aggregate high-order neighbors are proposed by deepening or widening the network [Li et al., 2018, Xu et al., 2018a, b, Li et al., 2019, Zhu and Koniusz, 2021]. However, when we aggregate information from too many neighbors, especially high-order ones, the so-called over-smoothing problem Li et al. [2018] may occur, causing nodes less distinguishable [Chen et al., 2020]. To alleviate this problem, several hop-aware GNN aggregation schemes are proposed in the literature [Abu-El-Haija et al., 2019, Zhang et al., 2020, Zhu et al., 2019, Wang et al., 2019].

While showing promising results, in most existing GNN representation learning works, the messages passed among nodes are mingled together. From a communication perspective, mixing the information from clean sources (mostly low-order neighbors) and that from noisy sources (mostly high-order neighbors) would inevitably cause difficulty for the receiver (i.e., the target node) to extract information. Motivated by the above observation, we propose a simple yet effective ladder-style GNN architecture, namely Ladder-GNN. Specifically, the contributions of this paper include:

  • We take a communication perspective on GNN message passing. That is, we regard the target node for representation learning as the receiver and group the set of neighboring nodes with the same distance to it as a transmitter that carries both information and noise. The dimension of the message can be regarded as the capacity of the communication channel. Then, aggregating neighboring information from multiple hops becomes a multi-source communication problem with multiple transmitters over the communication channel.

  • To simplify node representation learning, we propose to separate the messages from different transmitters (i.e., Hop- neighbors), each occupying a proportion of the communication channel (i.e., disjoint message dimensions). As the information-to-noise ratio from high-order neighbors is usually lower than low-order neighbors, the resulting hop-aware representation is usually unbalanced with more dimensions allocated to low-order neighbors, leading to a ladder-style aggregation scheme.

  • We propose a reinforcement learning-based neural architecture search (NAS) strategy to determine the dimension allocation for neighboring nodes from different hops. Based on the search results, we propose an approximate hop-dimension relation function, which can generate close results to the NAS solution without applying compute-expensive NAS.

To verify the effectiveness of the proposed simple hop-aware representation learning solution, we verify it on several semi-supervised node classification datasets. Experimental results show that the proposed Ladder-GNN solution can achieve state-of-the-art performance on most of them.

2 Related Work and Motivation

Graph neural networks adopt message passing to learn node embeddings, which involves two steps for each node: neighbor aggregation and linear transformation 

[Gilmer et al., 2017]. The following formula presents the mathematical form of message passing in a graph convolutional network (GCN) [Kipf and Welling, 2016]. Given an undirected graph = (, ) with nodes and adjacency matrix , we can aggregate node features at the -th layer as:

(1)

where is the augmented normalized adjacency matrix of .

is the identity matrix and

. is the trainable weight matrix at the -th layer used to update node embeddings.

is an activation function.

The original GCN work does not differentiate the messages passed from direct neighbors. Graph attention network (GAT) [Veličković et al., 2017] applies a multi-head self-attention mechanism to compute the importance of neighboring nodes during aggregation and achieves better solutions than GCN in many cases. Later, considering high-order nodes also carry relevant information , some GNN architectures [Li et al., 2019, Xu et al., 2018b, a] stack deeper networks to retrieve information from high-order neighbors recursively. To mitigate the possible over-fitting (due to model complexity) issues when aggregating high-order neighbors, SGC [Wu et al., 2019] removes the redundant nonlinear operations and directly aggregates node representations from multiple hops. Aggregating high-order neighbors, however, may cause the so-called over-smoothing problem that results in less discriminative node representations (due to over-mixing) Li et al. [2018]. Consequently, various hop-aware aggregation solutions are proposed in the literature. Some of them (e.g. HighOrder Morris et al. [2019], MixHop [Abu-El-Haija et al., 2019], N-GCN Abu-El-Haija et al. [2020], GB-GNN Oono and Suzuki [2020]) employ multiple convolutional branches to aggregate neighbors from different hops. Others (e.g. AM-GCN Wang et al. [2020], HWGCN Liu et al. [2019a], MultiHop Zhu et al. [2019]) try to learn adaptive attention scores when aggregating neighboring nodes from different hops.

(a) Hop-1
(b) Hop-2
(c) Hop-4
(d) Hop-8
Figure 1: The histogram for the number of nodes ( axis) with different homophily ratio Pei et al. [2020], i.e., the percentage of neighbors with the same label as the target node at Hop-

on the Pubmed dataset.

No doubt to say, the critical issue in GNN representation learning is how to retrieve information effectively while suppressing noise during message passing. In Figure 1, we plot the homophily ratio for nodes in the Pubmed dataset. As can be observed, with the increase of distance, the percentage of neighboring nodes with the same label decreases, indicating a diminishing information-to-noise ratio for messages ranging from low-order neighbors to high-order neighbors. However, existing GNN architectures mix the information from clean sources and noisy sources in representation learning (even for hop-aware aggregation solutions), making discriminative feature extraction challenging. These motivate the proposed Ladder-GNN solution, as detailed in the following section.

3 Approach

In Sec. 3.1, we take a communication perspective on GNN message passing and representation learning. Then, we give an overview of the proposed Ladder-GNN framework in Sec. 3.2. Next, we explore the dimensions of different hops with an RL-based NAS strategy in Sec. 3.3. Finally, to reduce the computational complexity of the NAS-based solution, we propose an approximate hop-dimension relation function in Sec. 3.4.

Figure 2: An illustration of GNN message passing and representation learning from a communication perspective. (a) A communication system contains transmitters that encode source information, communication channel, and receivers that decode the original information; (b) GNN representation learning with existing node aggregation scheme; (c) GNN representation learning with existing hop-aware aggregation scheme; (d) GNN representation learning with the proposed ladder-style aggregation scheme.

3.1 GNN Representation Learning from a Communication Perspective

In GNN representation learning, messages are passed from neighboring nodes to the target node and update its embedding. Figure 2 presents a communication perspective on GNN message passing, wherein we regard the target node as the receiver. Considering neighboring nodes from different hops tend to contribute unequally (see Figure 1), we group the set of neighboring nodes with the same distance as one transmitter, and hence we have transmitters if we would like to aggregate up to hops. The dimension of the message can be regarded as the communication channel capacity. Then, GNN message passing becomes a multi-source communication problem.

Some existing GNN message-passing schemes (e.g., SGC Wu et al. [2019], JKNet Xu et al. [2018b], and SGC Zhu and Koniusz [2021]) aggregate neighboring nodes before transmission, as shown in Figure 2(b), which mix clean information source and noisy information source directly. The other hop-aware GNN message-passing schemes (e.g., AMGCN Wang et al. [2020], MultiHop Zhu et al. [2019], and MixHop Abu-El-Haija et al. [2019]) as shown in Figure 2(c)) first conduct aggregation within each hop (i.e., using separate weight matrix) before transmission over the communication channel, but they are again mixed afterward.

Different from a conventional communication system that employs a well-developed encoder for the information source, one of the primary tasks in GNN representation learning is to learn an effective encoder that extracts useful information with the help of supervision. Consequently, the mixing of clean information sources (mostly low-order neighbors) and noisy information sources (mostly high-order neighbors) makes the extraction of discriminative features challenging.

The above motivates us to perform GNN message passing without mixing up messages from different hops, as shown in Figure 2(d). At the receiver, we concatenate the messages from various hops, and such disentangled representations facilitate extracting useful information from various hops with little impact on each other. Moreover, dimensionality significantly impacts any neural networks’ generalization and representation capabilities [Srivastava et al., 2014, Liu and Gillies, 2016, Alon and Yahav, 2020, Bartlett et al., 2020], as it controls the amount of quality information learned from data. In GNN message passing, the information-to-noise ratio of low-order neighbors is usually higher than that of high-order neighbors. Therefore, we tend to allocate more dimensions to close neighbors than distant ones, leading to a ladder-style aggregation scheme.

3.2 The Proposed Ladder-Aggregation Framework

With the above, Figure 3 shows the node representation update procedure in the proposed Ladder-GNN architecture. For a particular target node (the center node in the figure), we first conduct node aggregations within each hop, which can be performed with existing GNN aggregation methods (e.g., GCN or GAT). Next, we determine the dimensions for the aggregated messages from different hops and then concatenate them (instead of mixing them up) for inter-hop aggregation. Finally, we perform a linear transformation to generate the updated node representation.

Figure 3: Illustration of the process of Ladder-GNN of one target node. The height of bars represents the dimension of embeddings from each node. Colors are with respect to hops.

Specifically, given the graph for representation learning, is the maximum number of neighboring hops for node aggregation. For each group of neighboring nodes at Hop-, we determine their respective optimal dimensions and then concatenate their embeddings into as follows:

(2)

where is the normalized adjacency matrix of the hop and is the input feature. A learnable matrix controls the output dimension of the hop as . Encoding messages from different hops with distinct avoids the over-mixing of neighbors, thereby alleviating the impact of noisy information sources on clean information sources during GNN message passing.

Accordingly,

is a hop-aware disentangled representation of the target node. Then, with the classifier

after the linear layer , we have:

(3)

where is the output softmax values. Given the supervision of some nodes, we can use a cross-entropy loss to calculate gradients and optimize the above weights in an end-to-end manner.

3.3 Hop-Aware Dimension Search

Neural architecture search (NAS) aims to automatically design deep neural networks with comparable or even higher performance than manual designs by experts, and it has been extensively researched in recent years (e.g., [Bello et al., 2017, Tan et al., 2019, Zoph et al., 2018, Liu et al., 2018, 2019b]) Existing works in GNNs [Gao et al., 2020, Zhou et al., 2019, Shi et al., 2020] search the graph architectures (e.g., 1-hop aggregators, activation function, aggregation type, attention type, etc) and hyper-parameters to reach better performance. However, they ignore to aggregate multi-hop neighbors, not to mention the dimensionality of each hop. In this section, we propose to automatically search an optimal dimension for each hop in our Ladder-GNN architecture.

Search Space: Different from previous works in GNNs [Gao et al., 2020, Zhou et al., 2019, You et al., 2020], our search space is the dimension of each hop, called hop-dimension combinations. To limit the possible search space for hop-dimension combinations, we apply two sampling strategies: (1) exponential sampling , , ,…,,…,, ; (2) evenly sampling , , ,…,,…,, . and are hyper-parameters, representing the index and sampling granularity to cover the possible dimensions. For each strategy, the search space should also cover the dimension of initial input feature .

Basic Search Algorithm: Given the search space , we target finding the best model to maximize the expected validation accuracy. We choose reinforcement learning strategy since its reward is easy to customize for our problem. As shown in Figure 4, a basic strategy uses a LSTM controller based on the parameters to generate a sequence of actions with length , where each hop dimension () is sampled from the search space mentioned above. Then, we can build a model mentioned in Sec. 3.2

, and train it with a cross-entropy loss function. Then, we test it on the validation set

to get an accuracy . Next, we can use the accuracy as a reward signal and perform a policy gradient algorithm to update the parameters , so that the controller can generate better hop-dimension combinations iteratively. The objective function of the model is shown in:

(4)

Conditionally Progressive Search Algorithm: Considering the extremely large search space with the basic search algorithm, e.g., the search space size will be for the exponential sampling with hops. This makes it more difficult to search for the right combinations with limited computational resources. In order to improve the efficiency of the search and observe that there are a large number of redundant actions in our search space, we are inspired to propose a conditionally progressive search algorithm. That is, instead of searching the entire space all at once, we divide the searching process into multiple phases, starting with a relatively small number of hops, e.g., . After obtaining the NAS results, we only keep those hop-dimension combinations (regarded as the conditional search space) with high . Next, we conduct the hop-dimension search for the (+) hop based on the conditional search space filtered from the last step, and again, keep those combinations with high . This procedure is conducted progressively until aggregating more hops cannot boost performance. With this algorithm, we can largely reduce the redundant search space to enhance search efficiency.

Figure 4: An illustration of RL-based NAS for the hop-dimension exploration. A recurrent network (controller) generates descriptions of the dimension for each hop. Once the controller generates a framework , it will be trained on the training set and then tested on the validate set . The validation result is taken as the reward to update the recurrent controller.

3.4 The Approximate Hop-Dimension Relation Function

The computational resources required to conduct NAS are extremely expensive, even with the proposed progressive search algorithm. After analyzing the hop-dimension combinations returned from NAS in Sec. 4.1, we find that most of the satisfactory combinations show rather consistent results.

This motivates us to propose a simple yet effective hop-dim relation function to approximate NAS solutions. The output dimension of hop is:

(5)

where is the dimension compression ratio, and is the dimension of the input feature. With such an approximate function, there is only one hyper-parameter to determine, significantly reducing the computational cost. Moreover, under the approximate solution, the low-order neighbors within hops are directly aggregated with the initial feature dimensions while high-order neighbors are associated with an exponentially decreasing dimensions.

Combination with Other Node Aggregation Schemes: The proposed hop-aware aggregation strategy is orthogonal with existing node-wise aggregation methods within each hop. For example, the low-order neighbors () can be aggregated with SGC aggregation scheme (e.g., ) while high-order neighbors () are aggregated with the proposed Ladder-GNN. This formulation is similar to the searched framework, where low-order neighbors tend to remain their dimensions. We can also call this architecture Ladder-SGC. Similarly, we could employ GAT aggregation mechanism as Hop-1 aggregator while the remaining hops are aggregated with the proposed Ladder-GNN. We call this architecture as Ladder-GAT.

4 Experiment

In this section, we validate the effectiveness of Ladder-GNN on some widely-used semi-supervised node classification datasets. We first analyze the NAS results in Sec. 4.1. Then, we compare results with existing GNN representation learning works in Sec. 4.2. Last, we show an ablation study on the proposed hop-dim relation function in Sec. 4.3.

Data description: For the semi-supervised node classification task, we evaluate our method on six

datasets: Cora 

Yang et al. [2016]

, Citeseer 

Yang et al. [2016], Pubmed [Yang et al., 2016], Flicker Meng et al. [2019], Arxiv Hu et al. [2020] and Products Hu et al. [2020]. We split the training, validation and test set following earlier works.

We also conduct experiments on meta-path-based heterogeneous graphs, wherein we differentiate attributes and aggregates them in a similar disentangled manner. Due to page limits, more details about dataset descriptions, data pre-processing procedure, experimental settings, along with heterogeneous graph results are listed in the Appendix.

4.1 Results from Neural Architecture Search

There exist a number of NAS approaches for GNN models, including random search (e.g., AGNN Zhou et al. [2019]), reinforcement learning-based solution (e.g., GraphNAS Gao et al. [2020] and AGNN Zhou et al. [2019]

) and evolutionary algorithm (e.g., Genetic-GNN 

Shi et al. [2020]). In this work, instead of searching GNN model structures, NAS is used to search appropriate dimensions allocated to each hop.

Method Cora Citeseer Pubmed
Random-w/o share Zhou et al. [2019] 81.4 72.9 77.9
Random-with share Zhou et al. [2019] 82.3 69.9 77.9
GraphNAS-w/o share Gao et al. [2020] 82.7 73.5 78.8
GraphNAS-with share Gao et al. [2020] 83.3 72.4 78.1
AGNN-w/o share Zhou et al. [2019] 83.6 73.8 79.7
AGNN-with share Zhou et al. [2019] 82.7 72.7 79.0
Genetic-GNN Shi et al. [2020] 83.8 73.5 79.2
Ours-w/o cond. 82.0 72.9 79.6
Ours-with cond. 83.5 74.8 80.9
Ours-Approx 83.3 74.7 80.0
Table 1: The accuracy(%) comparison for different NAS strategies. Ours-with cond. in our method means using the conditional search strategy. Ours- Approx indicates the results is obtained from the hop-dimension relation function. The best result is in bold. The second-place result is underlined.

In particular, we search the hop-dimension combinations of hops on Cora, Citeseer, and Pubmed datasets and show experimental results in Table 1. Compared with existing NAS methods, our NAS method can achieve better results with conditional progressive search algorithm in Citeseer and Pubmed dataset, improving over Genetic-GNN by 1.4% and 2.1%, respectively. Meanwhile, we achieve comparable accuracy in the Cora dataset only by considering the hop-dimension combinations. Moreover, compared with w/o cond., we can find 2.6% improvements on conditional progressive search. And we leave the analysis on the conditional strategy in Appendix.

Figure 5 presents the histogram of the potential dimension assignment for different hops. We can observe: (i) for low-order neighbors, i.e., when hop is less than in this case, most of the found solutions with high accuracy keep the initial feature dimension; (ii) most of the possible dimensions of the hop are only in single digits, which inspires us to design the conditional strategy to reduce the search space greatly; (iii) The dimensionality tends to be reduced for high-order neighbors, and approximating it with exponentially decreasing dimensions occupies a relatively large proportion of the solutions. Accordingly, we could use the approximate relation function to search for proper hop-dimension combinations with comparable performance to the NAS solution (see Table 1).

Figure 5: The histogram of the filtered dimensions for different hops on the Cora dataset. X-axis illustrates Hop- with potential dimension assignment, while Y-axis shows the rate (%) of the dimension for the corresponding hop. The colored bars represent the size of different dimensions, as shown in the upper legend.

4.2 Comparison with Related Work

Table 2 compares Ladder-GNN with three groups of existing methods (general GNNs, GNNs with modified graph structures, hop-aware GNNs) in terms of Top-1 accuracy (%) on six datasets. The results from Ours of {Cora, Citeseer, Pubmed} dataset are conducted with conditionally progressive search, and we use the proposed hop-dim relation function for the other datasets and all datasets in . Our methods can achieve state-of-the-art performance in most cases, and surpass all hop-aware GNNs for all datasets. Specifically, our method shows more improvements on Flicker (improvement by 9.1%). Flicker contains a higher edge-node ratio and a large feature size, which contains more noise and requires more dimension compression to denoise, especially high-order neighbors. A greater degree of noise can be reduced with the proposed Ladder-GNN, resulting in a greater enhancement. Meanwhile, We boost 1.5% with GXN on Citeseer, 0.5% with DisenGCN on Pubmed, 0.4% with GAT on Arxiv, and 1.6% with GAT on Products, respectively. Compared with Cora and Pubmed, Citeseer has a lower graph homophily rate Pei et al. [2020], which makes high-order neighbors more noisy and hard to extract information. Our method can boost relatively more on Citeseer, which proves the effectiveness of our method to handle noisy graphs. Last, Although the best GNN models for different tasks vary significantly You et al. [2020], our simple hop-dimension relation function consistently outperforms other hop-aware methods on all datasets, which indicates its effectiveness.

Method Cora Citeseer Pubmed Flicker Arxiv Products
General GNNs ChebNet Defferrard et al. [2016] 81.2 69.8 74.4 23.3 - -
GCN Kipf and Welling [2016] 81.5 70.3 79.0 41.1 71.7 75.6
GraphSage Hamilton et al. [2017] 81.3 70.6 75.2 57.4 71.5 78.3
GAT* Veličković et al. [2017] 78.9 71.2 79.0 46.9 73.6 79.5
JKNet Xu et al. [2018b] 80.2 67.6 78.1 56.7 72.2 -
SGC Wu et al. [2019] 81.0 71.9 78.9 67.3 68.9 68.9
APPNP Klicpera et al. [2018] 81.8 72.6 79.8 - 71.4 -
ALaGCN Xie et al. [2020] 82.9 70.9 79.6 - - -
MCN Lee et al. [2018] 83.5 73.3 79.3 - - -
DisenGCN Ma et al. [2019] 83.7 73.4 80.5 - - -
FAGCN Bo et al. [2021] 84.1 72.7 79.4 - - -

Structure
DropEdge-GCN Rong et al. [2019] 82.0 71.8 79.6 61.4 - -
DropEdge-IncepGCN Rong et al. [2019] 82.9 72.7 79.5 - - -
AdaEdge-GCN Chen et al. [2020] 82.3 69.7 77.4 - - -
PDTNet-GCN Luo et al. [2021] 82.8 72.7 79.8 - - -
GXN Zhao et al. [2020] 83.2 73.7 79.6 - - -
SPAGAN Yang et al. [2021] 83.6 73.0 79.6 - - -
Hop-aware GNNs HighOrder Morris et al. [2019] 76.6 64.2 75.0 - - -
MixHop Abu-El-Haija et al. [2019] 80.5 69.8 79.3 39.6 - -
GB-GNN Oono and Suzuki [2020] 80.8 70.8 79.2 - - -
HWGCN Liu et al. [2019a] 81.7 70.7 79.3 - - -
MultiHop Zhu et al. [2019] 82.4 79.4 71.5 - - -
HLHG Lei et al. [2019] 82.7 71.5 79.3 - - -
AM-GCN Wang et al. [2020] 82.6 73.1 79.6 - - -
N-GCN Abu-El-Haija et al. [2020] 83.0 72.2 79.5 - - -
SGC Zhu and Koniusz [2021] 83.5 73.6 80.2 - 72.0 76.8

Ours 83.5 74.8 80.9 73.4 72.1 78.7
Ladder-GAT 82.6 73.8 80.6 71.4 73.9 80.8
Table 2: The accuracy (%) comparison with existing methods.

Our method focuses on the hop-aware aggregation, hence, we can easily combine with any node-aware aggregation, like GAT, SGC. We investigate the combinations of two aggregation framework among nodes with our proposed Ladder-GNN to boost the performance and robustness when aggregating high-order neighbors.

Improving the high-order aggregation of GAT. GAT Veličković et al. [2017] tends to distinguish the relationships among direct nodes, showing better performance than GCN in many graphs. For GAT, we first explore whether GAT itself can aggregate higher-order neighbors with eight multi-heads and four kinds of channel sizes {1, 4, 8, 16} for each head in their self-attention scheme. To aggregate hop neighbors, both a deeper structure (stacking layers), named as D-GAT in blue lines and a wider (one layer computes and aggregates multiple-order neighbors) network W-GAT in orange lines are compared in Figure 6(a). As we can see, the D-GAT will suffer from over-smoothing problems suddenly (=), especially for the larger channel size. Moreover, W-GAT will degrade performance gradually, caused by over-fitting problems. Thus, both of the two aggregations drop their performance as hops increase. On the contrary, the proposed Ladder-GAT in purple line is robust to the increasing of hops since the proposed  Ladder-GNN can relieve the above problems when aggregating high-order hops.

Improving the high-order aggregation of SGC. SGC Wu et al. [2019] shows its effectiveness when aggregating low-order hops. In Figure 6(b), we compare SGC with Ladder-SGC, where the low-order hops are aggregated without dimension reduction and compress the dimensions of hop (L hop K). The purple line is set as an interesting setting: only compress the last hop to a lower dimension and =-. Under the same horizontal coordinate hops, Ladder-SGC achieves consistent improvement compared with SGC (in orange line).

To explain the reasons behind results, we find the weights updated in SGC will be affected by (i) the information-to-noise ratios within different hops and (ii) over-squashing phenomena Alon and Yahav [2020]

– information from the exponentially-growing receptive field is compressed by fixed-length node vectors, and causing it is difficult to make

hop neighbors play a role. This observation suggests that compressing the dimension on the last hop can mitigate the over-squashing problem in SGC, which thus consistently improves the performance of high-order aggregation.

(a) Cora Dataset
(b) Pubmed Dataset
Figure 6: Accuracy comparison of different hop-level aggregation methods of different datasets.

4.3 Ablation Study

To analyze effects on the hop-dimension function in Eq. (5), we conduct experiments by varying two dominant hyper-parameters: the furthest hop , the dimension compression rate . Although we design a NAS to explore for prior about these hyper-parameters, the ablation study in this section will help understand the impact of each hyper-parameter. Due to the space limitation, we only present the results on Citeseer. Results on other datasets are included in Appendix.

In Table 3, the compression rate varies from to () as comparison. We can observe that (i) by increasing the furthest hop with fixed , the performance will increase to saturation when . This is because less information from neighbors is aggregated when increasing . Moreover, Ladder-GNN  can learn to suppress noise by dimension compression. (ii) by decreasing the decay rate on fixed , the performance first increases and then drops under most situations. The reason for increasing is that the decreased compression rate will map the distant nodes to a lower and suitable dimension space, suppressing the interference of the distant nodes. However, there is an upper bound for these improvements under . When , the reduced dimension is too low to preserve the overall structural information, leading to worse performance in most cases. (iii) the effective rate is mainly on {0.125, 0.0625}, which can achieve better results for most . If and , we obtain the best accuracy of 74.7%. (iv) notice that there are significant improvements with dimension compression comparing to dimension increase (), which validates the effectiveness of the basic principle of dimension compression.

= 60.7 67.6 64.8 63.0 61.2 58.9 55.8 50.2
= 65.5 71.5 72.3 72.8 73.0 73.1 73.2 73.7
= 68.8 72.9 73.3 73.9 73.6 73.5 73.4 73.0
= 71.0 73.5 74.1 74.0 74.2 74.3 74.0 72.8
= 69.3 73.1 73.6 74.7 74.3 74.3 73.8 74.2
= 67.2 71.4 72.7 73.0 73.2 73.8 73.5 73.3

Table 3: Comparison of different compression rate under different furthest hop of the proposed Ladder-GNN.

5 Conclusion

In this work, we propose a simple yet effective ladder-style GNN architecture, namely Ladder-GNN. To be specific, we take a communication perspective for the GNN representation learning problem, which motivates us to separate messages from different hops and assign different dimensions for them before concatenating them to obtain the node representation. The resulted ladder-style aggregation scheme facilitates extracting discriminative features effectively when compared with exiting solutions. Experimental results on various semi-supervised node classification tasks show that the proposed simple Ladder-GNN solution can achieve state-of-the-art performance on most datasets.

References