Graph Symbiosis Learning

06/10/2021
by   Liang Zeng, et al.
Tsinghua University
13

We introduce a framework for learning from multiple generated graph views, named graph symbiosis learning (GraphSym). In GraphSym, graph neural networks (GNN) developed in multiple generated graph views can adaptively exchange parameters with each other and fuse information stored in linkage structures and node features. Specifically, we propose a novel adaptive exchange method to iteratively substitute redundant channels in the weight matrix of one GNN with informative channels of another GNN in a layer-by-layer manner. GraphSym does not rely on specific methods to generate multiple graph views and GNN architectures. Thus, existing GNNs can be seamlessly integrated into our framework. On 3 semi-supervised node classification datasets, GraphSym outperforms previous single-graph and multiple-graph GNNs without knowledge distillation, and achieves new state-of-the-art results. We also conduct a series of experiments on 15 public benchmarks, 8 popular GNN models, and 3 graph tasks – node classification, graph classification, and edge prediction – and show that GraphSym consistently achieves better performance than existing popular GNNs by 1.9%∼3.9% on average and their ensembles. Extensive ablation studies and experiments on the few-shot setting also demonstrate the effectiveness of GraphSym.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/07/2020

Bayesian Graph Neural Networks with Adaptive Connection Sampling

We propose a unified framework for adaptive connection sampling in graph...
06/08/2020

Eigen-GNN: A Graph Structure Preserving Plug-in for GNNs

Graph Neural Networks (GNNs) are emerging machine learning models on gra...
10/17/2021

Graph-less Neural Networks: Teaching Old MLPs New Tricks via Distillation

Graph Neural Networks (GNNs) have recently become popular for graph mach...
07/01/2020

A Novel Higher-order Weisfeiler-Lehman Graph Convolution

Current GNN architectures use a vertex neighborhood aggregation scheme, ...
09/02/2020

Self-supervised Smoothing Graph Neural Networks

This paper studies learning node representations with GNNs for unsupervi...
11/25/2021

AutoHEnsGNN: Winning Solution to AutoGraph Challenge for KDD Cup 2020

Graph Neural Networks (GNNs) have become increasingly popular and achiev...
12/13/2021

Hybrid Graph Neural Networks for Few-Shot Learning

Graph neural networks (GNNs) have been used to tackle the few-shot learn...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs, composed of objects (i.e., nodes) and their complex interactions (i.e

., edges), are widely used in modeling real-life applications, such as social networks, knowledge graphs, and biology networks 

easley2012networks; barabasi2013network; hamilton2020graph. In real-world scenarios, numerous systems are organized in the form of multiple graph views, where linkage structures and node features often convey information from multiple perspectives hamilton2020graph; battaglia2018relational; jing2021hdmi; wu2020comprehensive; liang2020multi. It is a common practice to construct multiple graph views generated from the original graph to provide complementary information of a graph hassani2020contrastive; zhu2020graph; feng2020graph. Multiple generated graph views with different edge connections or node features contain rich information from different aspects, which is instrumental in our understanding of aggregating information from different generated graph views to learn better graph representations jing2021hdmi; hassani2020contrastive; feng2020graph; you2020graph; poole2019variational; velivckovic2018deep; sun2019infograph; fan2020one2multi; xie2020mgat; ma2020multi.

Recently, there are two main threads of research in multiple-graph GNNs111In this paper, multiple-graph GNNs refer to GNNs trained on multiple generated graph views.. The first thread seeks to utilize graph augmentations to corrupt the positive samples to generate negative ones in contrastive learning on graphs hassani2020contrastive; you2020graph; poole2019variational; velivckovic2018deep; sun2019infograph; qiu2020gcc, such as constructing graph views by dropping edges and masking node features. The second aims to jointly model multiple graph views by co-regularization or architectural designs feng2020graph; fan2020one2multi; xie2020mgat; ma2020multi; hu2020multi; li2017multi; adaloglou2020multi; cheng2020multi; geng2019spatiotemporal; wang2020gcn. Both of them need specific domain knowledge to determine which sample is positive or design the network architecture and regularization term manually. In this work, we instead suppose that the integration of information from multiple generated graph views should not rely on specialized architectural designs or extra regularization terms, and powerful GNNs can first automatically and effectively learn the information from each single graph view and then exchange information with each other adaptively.

Figure 1: Symbiosis relationship in biology and multiple-graph GNNs. The mosh and the flower are closely related due to the same living environment, whereas GNNs are trained with closely related views to the same graph. Ideally, multiple-graph GNNs benefit from mutual interactions of exchanging parameters, as it brings higher fitness to species with symbiotic interactions.

Symbiosis is a relationship between species living in a closely related environment, where participating species promote the fitness of each other by mutual interactions margulis1971symbiosis; bolnick2018annual; smith2010mycorrhizal. For example, the moth and the flower both live on the marshland. They benefit from each other by nectar collecting and pollination (as mutual interaction), and thus constitute a symbiosis relationship. GNNs are trained on multiple generated graph views, and are supposed to capture the graph information from different perspectives. Taking them as different species, one direction to promote their information fusion ability among multiple graphs is to mimic the mutual interaction and enable their symbiosis relationship, as shown in Figure 1. However, like the complex symbiosis relationship in biology, it is of vital importance to adaptively and effectively exchange parameters of GNNs trained on multiple generated graph views.

In this paper, we propose a novel framework for learning from multiple generated graph views, named graph symbiosis learning (GraphSym), which contains an individual phase and a symbiotic phase. In the individual phase, following previous work zhu2020graph; velivckovic2018deep; sun2019infograph, we first leverage graph augmentations to automatically generate two views of a graph. Then, we develop two GNNs with the same architecture and each one is optimized by feeding a single-view graph. In the symbiotic phase, we quantitatively measure information of one layer in a GNN using entropy, and design a channel-wise adaptive exchange approach to replace redundant channels in one network with the informative channels from another network in a layer-by-layer manner. Furthermore, we extend to the multiple-graph GNNs case, where each GNN can contain information for all the other GNNs trained on multiple generated graph views.

The key contribution of this work is that we propose a novel graph symbiosis training framework that enables adaptive information exchange to effectively integrate knowledge from multiple generated graph views. GraphSym does not rely on specific methods to generate multiple graph views. Moreover, GraphSym does not need to modify the original training configurations such as learning rate, number of layers, and weight decay, requiring only information exchange in model parameters. Thus, existing GNN architectures can be seamlessly integrated into GraphSym. Furthermore, after exchanging parameters, there are no extra calculation overheads in the inference stage of GraphSym. On three semi-supervised node classification datasets, our method outperforms previous single-graph and multiple-graph GNNs that do not use knowledge distillation, and achieves new state-of-the-art results. We further evaluate the effectiveness of GraphSym on 15 public benchmark datasets, 8 popular GNN models, and 3 graph tasks such as node classification, edge prediction, and graph classification. GraphSym consistently outperforms baseline GNN models by 1.9%3.9% (absolute improvements) on average on node classification, graph classification, and edge prediction tasks and their ensembles. In addition, our extensive ablation studies and analyses also demonstrate the superiority of GraphSym.

2 Related Work

Contrastive learning of multiple-graph GNNs. Contrastive learning is a discriminative approach by performing augmentations on a graph to obtain different views and maximizing the agreement of representations le2020contrastive. There is a large body of literature on contrastive multi-view representation learning in GNNs hassani2020contrastive; zhu2020graph; you2020graph; poole2019variational; velivckovic2018deep; sun2019infograph; qiu2020gcc; qiu2018network. DGI velivckovic2018deep

captures similarities using mutual information between node and graph summary vectors. GCA 

zhu2020graph uses an adaptive augmentation to generate informative views, which preserves important structural and attribute information of a graph. Although the approaches have achieved extensive success in various tasks, they need specialized graph augmentations to generate positive or negative samples. In contrast, our work can make use of broader augmentations and enable the exploitation of inner relationships of multiple generated graph views.

Joint learning of multiple-graph GNNs. Multiple-graph joint learning aims to jointly model multiple generated graph views to improve the generalization performance feng2020graph; fan2020one2multi; xie2020mgat; ma2020multi; hu2020multi; li2017multi; adaloglou2020multi; cheng2020multi; geng2019spatiotemporal; wang2020gcn. Most works leverage graph augmentations to generate multiple graph views from the original graph, and then design regularization term to force the GNN to optimize the prediction consistency or specialized architectures to directly model the relations between multiple graphs. Co-regularization based algorithms feng2020graph; fan2020one2multi; hu2020multi regard the consistency of two graphs as a regularization term in the objective function. GRAND feng2020graph designs a random propagation strategy to perform graph augmentations and utilizes consistency regularization to enforce the model to generate a similar feature map on different views of the same data. Besides, a number of architectural solutions have been proposed for joint learning of multiple-graph GNNs xie2020mgat; ma2020multi; li2017multi; adaloglou2020multi; cheng2020multi; geng2019spatiotemporal; wang2020gcn. MGAT xie2020mgat explores an attention-based architecture to learn the node representation from multiple views of a graph. AM-GCN wang2020gcn proposes an adaptive multi-channel GNN and constructs two topology and feature graphs to integrate topological structures and node features. In contrast to these works, our work retains the benefits of joint modeling multiple generated graph views via symbiosis mechanism while not requiring extra regularization terms and specialized architectural designs.

Grafting. Grafting meng2020filter

inspires us to construct interaction with multiple networks. It improves the network performance by grafting external information (weights) on the same data source in computer vision tasks. Grafting calculates the weighted sum of parameters of two networks trained with different learning rates (hyper-parameters) to reactive invalid layers. It leverages layer-wise integration to achieve the best performance. In contrast, our work aims to adaptively integrate knowledge from GNNs trained on different generated graph views (data) to improve the generality of the model. Rather than using the weighted sum of parameters for the whole layer, we design a fine-grained approach to exchange part of the resource in each graph neural network by iteratively replacing redundant channels with informative channels in the weight matrix.

3 Methodology

Figure 2: The illustrative schematic diagram of GraphSym. (a) Generating two graph views (i.e., masking node features and dropping edges). (b) Adaptive channel-wise exchange (in the first layer for illustration). (c) GraphSym with multiple networks (best viewed in color).

In Section 3.1, we introduce the notations. In Section 3.2, we discuss the inspiration, and formulate the GraphSym framework. Next, we introduce GraphSym between two networks trained on two generated graph views, which can be divided into two phases: the individual phase (Section 3.3) and the symbiotic phase (Section 3.4). In section 3.5, we extend GraphSym to multiple GNNs.

3.1 Notations

Let denote a graph, where is a set of nodes, is a set of edges between nodes. represents the node feature matrix and is the feature vector of node , where is the channel of the feature matrix . The adjacency matrix is defined by if , and otherwise. A different view of the original graph is obtained by , where is an augmentation function, such as masking node features and dropping edges zhu2020graph; velivckovic2018deep.

3.2 Inspiration & Formulation

Symbiosis is one of the common symbiotic relationships, in which organisms can promote each other to better adapt environment than those of living independently by exchanging resources or services margulis1971symbiosis; bolnick2018annual; smith2010mycorrhizal. Formally, we denote a parametrized GNN as with the initial parameter , where and are the input space and output space. In the individual phase, given paired training data , the network is optimized with a supervised loss as follows:

(1)
Figure 3: The illustration of the pipeline of parameter updating in GraphSym with two GNNs.

where is the weight of a GNN after optimization. Then, the symbiotic phase imitates the process of exchanging resources or services between two organisms in ecology. We take and of the two corresponding neural networks as input, exchange information, and produces and as output. At the final step, we re-train the output parameters and and get the final parameters and . The illustration of the pipeline of parameter updating in GraphSym with two GNNs trained on two generated graph views is shown in Figure 3. We then proceed to introduce GraphSym in its two phases.

3.3 The Individual Phase

Generating multiple graph views. All organisms rely on their diverse environment to survive, such as sunlight and soil. In all these environments, organisms use available resources, interact and benefit from each other. To provide diverse environment for GNNs, following previous works velivckovic2018deep; zhu2020graph, we apply stochastic augmentation functions to generate multiple graph views and then feed them into GNNs. Once fitting these different generated graph views (environment), they can further improve performance by exchanging their knowledge. We leverage four augmentation functions to generate multiple graph views in symbiosis learning, which are masking node features, corrupting node features, dropping edges, and extracting subgraphs. The details of graph augmentations can be found in Appendix B. Note that GraphSym does not rely on specific methods to generate multiple graph views, thus other generating methods can be readily incorporated in our framework.

Training GNNs. GraphSym does not rely on specific GNN models and does not need to modify the original training configurations such as the number of layers and the learning rate. Thus, existing modern GNNs can be directly applied in our framework.

3.4 The Symbiotic Phase

After the individual phase, GNNs have learned the knowledge from multiple generated graph views and can take a further step to interact for mutual benefit. GNNs, developed from different views, should perform knowledge exchange with each other in a complementary way. To this end, we should first decide what information is valuable to exchange.

We use entropy to measure the information in one layer of a GNN. Let be the weight of the -th layer in GNN , where is the number of channels in the -th layer. Following the previous approach introduced in meng2020filter; shwartz2017opening; cheng2019utilizing, we calculate the entropy by dividing the range of values in into different bins. Denote the number of values that fall into the -th bin as . We use

to approximate the probability of

-th bin, where . Then, the entropy of can be calculated as follows:

(2)

A larger value of indicates richer information in the weight of the -th layer in GNN, and vice verse. For example, if each element of takes the same value (entropy is 0), cannot discriminate which part of the input is more important.

Input: Two GNNs and , whose inputs are and , respectively; denotes the weight of -th channel in the -th layer of network ; the number of layers of GNN ; iteration number ; the number of exchange channels .
1 The individual phase:
2 Update the weights of GNN and of GNN according to the corresponding loss.
3 The symbiotic phase:
4 Get the source and target GNN: , .
5 for  do
6       for  do
7             Calculate the Pearson correlation among all possible pairs of the channels in .
8             Find a pair of channels indexed by and with the highest correlation.
9             for  do
10                   Substitute each channel of the source network for or of the target network to find an informative channel as Eq. 3.
11                  
12            
13      
14Reverse the source and target network , and repeat the line 5-10.
Output and .
Algorithm 1 Graph symbiosis learning of two GNNs

Adaptively exchange. Given the quantitative measurement of information in each layer of a GNN, we then consider how to exchange it in a complementary style. Thus we propose an adaptive exchange strategy to exchange the weights between two GNNs. GNNs follow the message passing framework to iteratively aggregate information from neighbor nodes. Specifically, the -th layer makes use of the subtree structures of height rooted at every node. Thus, we only exchange information of the same layer to preserve the consistency of information between two GNNs. Denote the weight in the -th layer of a GNN trained on the -th graph view as , where the input channel is , the output channel is and is an output channel vector. Denote the parameters of the source GNN as and the target GNN as . In each exchange step, our target is to adaptively exchange a channel in with another channel in for each layer. To substitute the redundant channel, we first calculate the Pearson correlation in and obtain a pair of redundant channels, i.e.,  and , that have the highest correlation. Then we choose a channel from to maximize the entropy in the new weight matrix. Formally, let be the operation to substitute the -th columns of the matrix with the vector . We find the informative channel such that

(3)

is maximized. By repeating the above exchange step times, can obtain the part of information ( channels) from while retaining the useful information in the original network. We also pictorially illustrate the procedure in Figure 2 (b). As we can see, GNN and perform adaptive channel-wise weight exchange in the first layer as aforementioned. As a result, substitutes the second channel in the weight matrix with the first channel in . Through this procedure, both networks contain information from two graph views. The complete algorithm is summarized in Algorithm 1

. Finally, we re-train two GNNs with the same epochs introduced in Section 

3.3 to obtain the output predictions.

3.5 Extending Symbiosis Learning to Multiple GNNs

Symbiosis learning can be easily extended to the case with multiple-graph GNNs, as illustrated in Figure 2 (c). In each iteration of the symbiotic phase, each network receives the knowledge from . After a certain iteration, each graph network contains the information for all the other networks trained on multiple generated graph views. The complete algorithm of GraphSym with multiple-graph GNNs can be found in Appendix A.

4 Experiments

4.1 Setup

 

Cora CiteSeer PubMed
Single-graph GNNs GCN kipf2016semi 81.5 70.3 79.0
GAT velivckovic2017graph 83.00.7 72.50.7 79.00.3
GraphSAGE hamilton2017inductive 78.90.8 67.40.7 77.80.6
APPNP klicpera2018predict 83.80.3 71.60.5 79.70.3
Graph U-Net gao2019graph 84.40.6 73.20.5 79.60.2
MixHop abu2019mixhop 81.90.4 71.40.8 80.80.6
SGC wu2019simplifying 82.00.0 71.90.1 78.90.0
GraphMix verma2019graphmix 83.90.6 74.50.6 81.00.6
GCNII chen2020simple 85.50.5 73.40.6 80.20.4
Multiple-graph GNNs MAGCN cheng2020multi 75.1 71.1 69.1
DGI velivckovic2018deep 82.30.6 71.80.7 76.80.6
GRAND feng2020graph 85.40.4 75.50.4 82.70.6
Graft meng2020filter 85.50.5 75.50.4 82.60.4
Ours Ensemble 85.6 75.1 82.6
GraphSym 85.80.2 75.70.3 82.80.7

 

Table 1: Results of accuracy (%) on the Cora, Citeceer and Pubmed.

Dataset. Following previous multiple generated graph views learning approaches hassani2020contrastive; xie2020mgat; feng2020graph, we utilize single-view graphs datasets and generate multiple graph views via graph augmentations. To show the generality of the proposed GraphSym framework, we conduct experiments on 15 public benchmark datasets, including 9 datasets for node classification, 3 datasets for edge prediction, 5 datasets for graph classification, and 1 dataset for large scale node classification (scalability). Statistics of each dataset are summarized in Table 2 and Table 3. More details are included in Appendix D.

  • [leftmargin=14pt]

  • Node classification: Citation network sen2008collective; namata2012query; yang2016revisiting: Cora, CiteSeer, and PubMed; Wikipedia network rozemberczki2021multi: Chameleon, and Squirrel; Actor co-occurrence network tang2009social: Actor; WebKB pei2020geom: Cornell, Texas, and Wisconsin.

  • Edge prediction: Citation network sen2008collective; namata2012query; yang2016revisiting: Cora, CiteSeer, and PubMed.

  • Graph classification: Chemical compounds Morris+2020: DD, NCI1, and PROTEIN; Social network Morris+2020

    : IMDB-BINARY and REDDIT-BINARY.

  • Scalability: Academic citation network in the open graph benchmark (OGB) hu2020open: ogbn-arxiv.

Implementations. As a training scheme (rather than a specific GNN architecture), GraphSym is implemented based on a baseline GNN model by first training the GNN model on generated multiple graph views in the individual phase and then exchanging information among them in the symbiotic phase. For generating multiple views, we use Masking node features, Corrupting node features, Drooping edges, and Extracting subgraphs introduced in Appendix B as the graph augmentation methods. Data preparation follows the standard experimental settings, including feature preprocessing and data splitting. We conduct experiments on the transductive node, edge prediction pei2020geom; kipf2016semi; ma2021improving and inductive graph classification errica2019fair. All the reported accuracy (%) are averaged over 100 runs, except 10 runs for the large-scale ogbn-arxiv dataset. Since each GNN in GraphSym interacts with all other GNNs, after symbiotic interaction, the performance of different GNNs is similar. Thus, in what follows, we always record the performance of the first GNN.

Baselines. To validate the effectiveness of our proposed method, we implement GraphSym based on the following GNN models222We follow the settings in their original paper. The details refer to Appendix E.:

  • [leftmargin=14pt]

  • Node classification: GCN kipf2016semi, GAT velivckovic2017graph, APPNP klicpera2018predict, JKNET with concatenation and maximum aggregation scheme xu2018representation, GRAND feng2020graph, and a recent deep GCN model GCNII chen2020simple.

  • Edge prediction: GCN kipf2016semi.

  • Graph classification: GCN kipf2016semi, and GIN xu2018powerful.

Besides, we test 3 variants for each GNN model for ablation analysis. To exclude the influence of longer training epochs, we conduct experiments, named further training (FT), to train the baseline models with the same number of epochs as GraphSym. Multiple-graph GNNs ensemble (Ensemble) first trains each GNN on the individual generated view of a graph and then assemble their outputs by majority voting. Multiple-graph GNNs ensemble+futher training (Ensemble+FT) is similar to Ensemble but trains each GNN with the same number of epochs as GraphSym. The original GNN models of the baselines are denoted by their name directly.

Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Texas Wisconsin
 # Nodes: 2,708 3,327 19,717 2,277 5,201 7,600 183 183 251
 # Edges: 5,278 4,676 44,327 31,421 198,493 26,752 280 295 466
 # Features: d 1,433 3,703 500 2,325 2,089 931 1,703 1,703 1,703
 # Classes: 7 6 3 5 5 5 5 5 5
GCN
  + FT
  + Ensemble
  + Ensemble + FT
  + GraphSym
  (3.8) 1.3 0.3 0.9 4.1 4.6 3.4 0.8 13.0 5.6
GAT
  + FT
  + Ensemble
  + Ensemble + FT
  + GraphSym
  (1.9) 0.4 0.1 0.3 1.2 2.4 6.5 2.1 0.5 3.5
APPNP
  + FT
  + Ensemble
  + Ensemble + FT
  + GraphSym
  (3.2) 0.7 0.6 0.5 4.9 3.2 1.5 5.0 5.2 7.1
JKNET-CAT
  + FT
  + Ensemble
  + Ensemble + FT
  + GraphSym
  (3.2) 0.5 0.3 0.2 3.7 0.6 1.0 0.3 8.1 5.4
JKNET-MAX
  + FT
  + Ensemble
  + Ensemble + FT
  + GraphSym
  (2.2) 0.5 0.3 0.2 2.6 1.7 0.6 1.9 10.2 5.6
GCNII
  + FT
  + Ensemble
  + Ensemble + FT
  + GraphSym
  (2.6) 0.4 0.3 0.2 2.4 1.9 1.3 1.6 12.7 2.1
Table 2: Results of accuracy (%) on node classification tasks. denotes absolute differences between the original GCN and GraphSym. The average absolute improvements between each GNN model and GraphSym are presented in brackets.
DD NCI1 PROTEINS IMDB-BINARY REDDIT-BINARY
# Graphs: 1,178 4,110 1,113 1,000 2,000
# Average nodes: 284.32 29.87 39.06 19.77 429.63
# Average edges: 715.66 32.30 72.82 96.53 497.75
# Labels: 2 2 2 2 2
GCN
  + FT
  + Ensemble
  + Ensemble + FT
  + GraphSym
  (3.9) 4.5 5.2 7.1 1.9 0.9
GIN
  + FT
  + Ensemble
  + Ensemble + FT
  + GraphSym
  (2.5) 6.5 0.8 3.1 0.6 1.6
Table 3: Results of accuracy (%) on graph classification tasks. We leverage model selection strategies and data splits following errica2019fair.
Dataset GCN FT Ensemble Ensemble + FT GraphSym (2.9)
Cora 4.3
CiteSeer 3.5
PubMed 0.8
Table 4: Results of accuracy (%) on edge prediction tasks.

4.2 Experimental Results

Node classification. We implement Ensemble and GraphSym based on GRAND feng2020graph. The comparison with the baselines on Cora, CiteSeer, and PubMed datasets is reported in Table 1. We can see that GraphSym achieves state-of-the-art results among both the single-graph GNNs and multiple-graph GNNs, which demonstrates the effectiveness of symbiosis learning in modeling the relationship of multiple generated graph views. The outperformance over the Ensemble shows that the adaptive integration of multiple generated graph views in our symbiosis learning is superior to the simple integration of multiple-graph GNNs. In contrast to Ensemble, which requires the simultaneous inference of multiple models, the inference cost of GraphSym is the same as a single-graph model. Moreover, we compare our methods with Graft meng2020filter by replacing our adaptive exchange scheme with the adaptive weighting method. We can find that our method consistently surpasses Graft, which suggests that our fine-grained approach to exchange part of resource in each GNN is a more effective approach for multiple-graph GNNs.

To further evaluate the effectiveness of GraphSym, we implement GraphSym and compare it with the 3 variants of the baseline GNN model. The baseline models include GCN, GAT, APPNP, JKNet, and GCNII kipf2016semi; velivckovic2017graph; klicpera2018predict; xu2018representation; chen2020simple. The results of baselines are reproduced based on their official codes. Experimental results are reported in Table 2. It is shown that GraphSym consistently outperforms baselines by 1.9%3.8% (absolute improvements) on average. The outperformance of GraphSym over FT, Ensemble, and Ensemble+FT shows that the expressiveness of the symbiosis learning comes from neither extra training epochs, nor the larger model capacity of multiple-graph GNNs. Besides, in contrast to the ensemble methods, GraphSym utilizes only one network during inference, which is more computationally efficient.

ogbn-Arxiv GCNII GCN_res-v2 GCN_DGL GraphSAGE
Original
 + FT
 + Ensemble
 + Ensemble + FT
 + GraphSym
OOM
OOM
  (0.5) 0.2 0.6 0.6 0.4
Table 5: Results of accuracy (%) on ogbn-arxiv .

Edge prediction and graph classification. To investigate the generality of GraphSym, we further conduct experiments on two common graph tasks: edge prediction and graph classification. As shown in Table 4 and 3, GraphSym consistently outperforms the original GNN model by a large margin. Meanwhile, GraphSym achieves a higher accuracy over the 3 variants, further showing that adaptively exchanging parameters among multiple generated graph views is more effective than model ensemble.

Scalability. To show the scalability of GraphSym, we conduct experiments on the large citation dataset ogbn-arxiv

. We select four top-ranked GNN models from the leader-board of OGB, and then perform GraphSym based on them with the same GNN architectures and hyperparameters(see Appendix E for more details). As shown in Table 

5, our method outperforms the original methods and their ensembles, which demonstrates the scalability of GraphSym.

Figure 4: Training curves on Cora.
Figure 5: Few-shot learning on Cora.

4.3 Experimental Analysis

Over-smoothing. GNNs are shown to suffer from the over-smoothing issue, where node features become indistinguishable as we increase the feature propagation steps zhao2020pairnorm; chen2020measuring; liu2020towards. Thus, it is difficult to extend model capacity for large-scale graphs by deepening GNNs. We represent the results of GCN kipf2016semi by increasing the propagation steps (layers) and implement GraphSym based on GCN for comparison. As shown in Figure 5, we empirically find that GraphSym can mitigate the over-smoothing issue compared to the original GCN. As the number of layers increases, the accuracy of the original GNN decreases dramatically–from 0.8 to 0.1. In contrast, the accuracy of GraphSym decreases much slower. This suggests that GraphSym provides a principled way to extend model capacity with relatively large layer numbers due to the adaptive information exchange scheme.

Few-shot. We further evaluate the effectiveness of GraphSym under few-shot settings. Taking the Cora as the representative dataset, we manually vary the number of training nodes per class from 1 to 50, and keep the validation and test dataset unchanged. As shown in Figure 5, GraphSym consistently outperforms GCN. Specifically, the relative improvements on classification accuracy are 4.0/3.3/4.2/0.9/2.2 on average for 1/5/10/20/50 labeled nodes per class, which shows that integrating information from multiple generated graph views is more efficient in utilizing limited supervision.

Figure 6: Hyper-parameter study on Cora.

Hyperparameter study. We study the effects of hyperparameters of our framework GraphSym, and conduct experiments on Cora based on the GCN model. We have two hyperparameters in the symbiotic phase: the iteration number and the exchanging channels . Taking the GraphSym on 4 graph views as an example, we first present a study on the iteration number by varying from 1 to 5 while using default value . As shown in Figure 6 (a), our proposed framework achieves relatively stable performance. We further study the number of exchange channels by varying it from 1 to 15 (the hidden size of the GCN is 16) while fixing . The best performance is achieved by exchanging part of channels rather the all weights in a layer, demonstrating that adaptively exchanging information in a complementary way can bring more benefits. In all, we can find that the performance of our framework is relatively stable across different hyperparameters and thus does not rely on heavy and case-by-case hyperparameter tuning to achieve the best results.

5 Conclusion

In this paper, we propose a new framework on multiple generated graph views, named graph symbiosis learning (GraphSym). In GraphSym, we propose a novel adaptive exchange method to iteratively substitute the redundant channels in the weight matrix of one GNN with informative channels of another GNN in a layer-by-layer manner. GraphSym does not rely on GNN architectures and specific methods to generate multiple graph views. Thus, existing GNNs can be readily integrated into our framework. Comprehensive experiments on real datasets show that our proposed training framework outperforms the existing popular GNNs models and their ensembles. For future work, we will utilize more adaptive graph augmentations to further improve the performance of GraphSym. We hope our work will inspire new ideas in exploring new training mechanisms on multiple-graph GNNs.

References

Appendix A Algorithm

The pseudo-code of the GraphSym with multiple generated graph views is outlined in Algorithm 2. The training process involves generating multiple graph views, the individual phase, and the symbiotic phase. At the beginning (Line 1-2), we generate multiple graph views via the graph augmentations. Then, in the individual phase (Line 4-6), we train GNNs with the default hyper-parameter settings and update the weights of GNNs. At last, in the symbiotic phase (Line 10-15), we utilize the entropy criterion to decide what is valuable information to exchange and propose an adaptively exchange strategy to exchange redundant channels of one GNN (target network) with informative channels of another GNN (source network). Finally, we reverse the source and target network and repeat the above process (Line 16).

Input: The original graph ; number of multiple graph views , denotes the -th generated graph view; GNN , whose input is ; denotes the weight in the -th layer of a GNN corresponding to the -th generated graph view, whose input channel is and output channel is and is the vector of the -th channel; the number layers of is ; learning rate ; iteration number ; number of exchange channels ; the graph augmentation functions to get multiple graph views.
Output: Prediction
1 for  do
2       Get the graph view via the graph augmentation function .
3      
4 The individual phase:
5 for  do
6       Train GNN  and compute the supervised classification loss .
7       Update the weight by gradient descent: .
8      
9 The symbiotic phase:
10 for  do
11       for  do
12             Get the source and target GNN: .
13             for  do
14                   Calculate the Pearson correlations among all possible pairs of the channels in .
15                   Find a pair of channels indexed by and with the highest correlation.
16                   for  do
17                         Substitute each channel of the source network for or of the target network to find an informative channel as Eq. 3.
18                        
19                  
20            
21      for  do
22             Exchange the role of the source and target networks: .
23             Repeat line 11-15.
24            
25      
26Output prediction via re-training .
Algorithm 2 Graph symbiosis learning with multiple GNNs

Appendix B Graph augmentations

We implement the commonly-used graph augmentation methods in previous work to generate multiple graph views zhu2020graph, velivckovic2018deep, feng2020graph.

  • [leftmargin=14pt]

  • Masking node features. Randomly mask a fraction of node attributes with zeros. Formally, the generated node features matrix is computed by

    (4)

    where

    is a random vector, which is drawn from a Bernoulli distribution,

    is the concatenation operator, and is the element-wise multiplication.

  • Corrupting node features. Randomly replace a fraction of node attributes with Gaussian noise. Formally, it can be calculated as follows,

    (5)

    where

    is a random vector drawn from a Gaussian distribution

    independently and is the mean value of a vector.

  • Dropping edges. Randomly remove edges in the graph. Formally, we sample a modified subset from the original edge set with probability defined as follows:

    (6)

    where and is the probability of removing .

  • Extracting subgraphs. Extract the induced subgraphs containing the nodes in a given subset, i.e.,  and , which follows the experimental setting in velivckovic2018deep.

Appendix C Preliminaries of GNNs

Graph neural networks.

The training process of modern GNNs follow the message-passing mechanism hamilton2020graph. During each message-passing iteration, a hidden embedding corresponding to each node is updated by aggregating information from ’s neighborhood , which can be expressed as follows:

(7)

where UPDATE and AGGREGATE are arbitrary differentiable functions (i.e., neural networks), and denotes the “message” that is aggregated from ’s neighborhood . The initial embedding at is set to the input features, i.e., . After running iterations of the GNN message-passing, we can obtain information from -hops neighborhood nodes. Different GNNs can be obtained by choosing different UPDATE and AGGREGATE functions. For example, the classical Graph Convolutional Networks (GCN) kipf2016semi updates the hidden embedding as

(8)

where is the hidden matrix of the -th layer. is the re-normalization of the adjacency matrix, and is the corresponding degree matrix of . is the filter matrix in the -th layer with referring to the size of -th hidden layer and is a nonlinear function, e.g

ReLU.

Problem setup.

In this work, we focus on undirected and unweighted graphs. However, it is straightforward to extend this work to directed and weighted graphs. Let denote a graph, where is the a set of nodes, is a set of edges between nodes, represents the node feature matrix and is the feature vector of node . The adjacency matrix is defined by iif , otherwise . Following the common semi-supervised transductive node classification setting, only a small part of nodes are given, whose corresponding labels are , where is the label of . Our goal is to use this partial information to infer the missing labels for nodes . Similarly, for the link prediction task, we are given the node set and an incomplete set of edges between these nodes , and our goal is to infer the missing edges . For graph classification problems, given a set of graphs and their labels , our goal is to learn a function and predict the label of other graphs.

Training targets and loss functions.

Here we briefly summarize the training targets of the node classification, link prediction and graph classification.We denote node and edge embedding of the final layer of a GNN as and , and graph-level embedding performed by a readout functions as

. Then GNNs can be trained in an end-to-end manner using the following loss function:

(9)

where , and are the true labels of the corresponding node, edge and graph, respectively. For example, Loss can be the cross-entropy loss in the node/graph classification task or the 0/1 loss in the edge prediction problem.

Appendix D Datasets

# Nodes: # Edges: # Features: d # classes:
Cora 2,708 5,278 1,433 7
CiteSeer 3,327 4,676 3,703 6
PubMed 19,717 44,327 500 3
Chameleon 2,277 31,421 2,325 5
Squirrel 5,201 198,493 2,089 5
Actor 7,600 26,752 931 5
Cornell 183 280 1,703 5
Texas 183 295 1,703 5
Wisconsin 251 466 1,703 5
ogbn-arxiv 169,343 1,166,243 128 40
# Graphs: # Average nodes: # Average edges: # classes:
DD 1,178 284.32 715.66 2
NCI1 4,110 29.87 32.30 2
PROTEINS 1,113 39.06 72.82 2
IMDB-BINARY 1,000 19.77 96.52 2
REDDIT-BINARY 2,000 429.63 497.75 2
Table 6: Statistics of the real-world datasets.

In this section, we describe the details (including nodes, edges, features, and classes for node/edge datasets and graphs, average nodes, average edges, and classes for graph datasets) of the real-world graph datasets used in this paper. We report the statistics of these real-world datasets in Table 6. The descriptions of each of these datasets and their data splits are given as follows:

  • [leftmargin=14pt]

  • Cora, CiteSeer, and PubMed are academical citation networks originally introduced in sen2008collective, namata2012query, yang2016revisiting, which are the most widely used benchmark datasets for semi-supervised node classification yang2016revisiting. In these networks, nodes represent papers and edges denote the citation of one paper by another. Node features are the bag-of-words representation for each node and node label is the academic topic of a paper. We follow the public fixed split from the yang2016revisiting. We use 20 samples per class for the training, 500 samples for the validation, and 1000 samples for the test.

  • Chameleon and Squirrel are two page-page graphs on specific topics in Wikipedia network rozemberczki2021multi. In these datasets, nodes represent the web pages and edges denote mutual links between pages. Node features are several informative nouns in the Wikipedia pages. We follow the node labels generated by pei2020geom

    , where nodes are classified into five categories in term of the number of the average monthly traffic of the web page. Following the method in 

    pei2020geom, we randomly split nodes of each class into 60%, 20%, and 20% for training, validation and testing.

  • Actor is the actor co-occurrence induced subgraph of the film-director-actor-writer network tang2009social. In this network, nodes represent actors and the edge between two nodes denotes the co-occurrence on the same Wikipedia page. Node features are generated by some keywords in the Wikipedia pages. Following the node labels generated by pei2020geom, we classify the nodes into five categories in term of words of corresponding actor’s Wikipedia. Following the method in pei2020geom, we randomly split nodes of each class into 60%, 20%, and 20% for training, validation and testing.

  • Cornell, Texas, and Wisconsin are webpage datasets of computer departments in universities, i.e., Cornell, Texas, and Wisconsin. In these datasets, nodes represent web pages and edges denote the hyperlinks between them. Node features are the bag-of-words representation of corresponding web pages. We also follow the node labels generated by pei2020geom, where nodes are classified into five categories according to the identity of people. Following the method in pei2020geom, we randomly split nodes of each class into 60%, 20%, and 20% for training, validation and testing.

  • Ogbn-arxiv is a recently proposed large-scale dataset of paper citation networks hu2020open. Nodes represent arXiv papers and edges denote the citations between two papers. Node features are 128-dimensional feature vectors obtained by averaging the embeddings of words in its title and abstract and node labels are the primary categories of the arXiv papers. We use the public data split based on the publication dates of the papers.

  • DD, NCI1, and PROTEINS are chemical compound datasets introduced in Morris+2020. The nodes represent secondary structure elements (SSEs) and edges between two nodes denote neighborhood relationship in the amino-acid sequence or in 3D space. We use one-hot embedding vector to denote features of different nodes. Node labels are categorised into two classes according to its chemical property. Following the method in errica2019fair, we randomly split nodes of each class into 80%, 10%, and 10% for training, validation and testing.

  • IMDB-BINARY, REDDIT-BINARY are social networks, representing movie collaboration and online discuss forum, respectively. Nodes represent actors/actresses and users, respectively. Edges between two nodes denote they appear in the same movie in IMDB-BINARY and one user respond to another’s comment in REDDIT-BINARY. We use one-hot embedding vector to denote features of different nodes. Node labels are categorised into two classes according to their community or subreddit. Following the method in errica2019fair, we randomly split nodes of each class into 80%, 10%, and 10% for training, validation and testing.

Appendix E The Experimental Setup

For the reproducibility, we provide our experimental environment, baseline GNN models, datasets websites and implementation details. The implementation details include the experimental settings, detailed hyper-parameters in their original papers that we follow, and the configurations of graph augmentation methods.

e.1 Experiments Settings

All experiments are conducted with the following settings:

  • [leftmargin=14pt]

  • Operating system: Ubuntu Linux 16.04.7 LTS

  • CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz

  • GPU: NVIDIA GP102GL [Tesla P40]

  • Software versions: Python 3.7; Pytorch 1.7.0; Numpy 1.16.2; SciPy 1.2.1; Pandas 1.0.5; scikit-learn 0.23.1; PyTorch-geometric 1.6.3; Open Graph Benchmark 1.3.1

e.2 Baseline GNNs and Datasets

We follow the experimental settings in their original paper. Table 7 summarizes URLs and commit numbers of baseline codes. Datasets used in this paper can be found in the following URLs, shown in Table 8.

Methods URL Commit
GCN https://github.com/rusty1s/pytorch_geometric db3bdc2
GAT https://github.com/rusty1s/pytorch_geometric db3bdc2
Node classification APPNP https://github.com/benedekrozemberczki/APPNP fce0d76
JKNET-CAT https://github.com/rusty1s/pytorch_geometric db3bdc2
JKNET-MAX https://github.com/rusty1s/pytorch_geometric db3bdc2
GCNII https://github.com/chennnM/GCNII ca91f56
GRAND https://github.com/THUDM/GRAND ba164c6
Edge prediction GCN-EDGE https://github.com/rusty1s/pytorch_geometric e6b8d6
Graph classification GCN-GRAPH https://github.com/diningphil/gnn-comparison 0e0e9b1
GIN https://github.com/rusty1s/pytorch_geometric db3bdc2
ogbn-arxiv GCN-RES-v2 https://github.com/ytchx1999/GCN_res-CS-v2 a3e7d1f
GCN-DGL https://github.com/Espylapiza/dgl 5feada0
GraphSAGE https://github.com/snap-stanford/ogb be38132
Table 7: Baseline GNNs.
Datasets URL
Cora, CiteSeer, PubMed https://github.com/rusty1s/pytorch_geometric
Chameleon, Squirrel https://github.com/graphdml-uiuc-jlu/geom-gcn
Actor https://github.com/graphdml-uiuc-jlu/geom-gcn
Cornell, Texas, Wisconsin https://github.com/graphdml-uiuc-jlu/geom-gcn
DD, NCI1, PROTEINS https://github.com/rusty1s/pytorch_geometric
IMDB-BINARY, REDDIT-BINARY https://github.com/rusty1s/pytorch_geometric
ogbn-arxiv https://github.com/snap-stanford/ogb
Table 8: Datesets.

e.3 Implementation Details

The codes of GraphSym in Table 1 in the main body are based on GRAND feng2020graph codebase. All implementation of codes can be found in the supplementary materials. For the reproducibility of our proposed framework, we also list all the hyper-parameter values used in our framework. Tolerance_num means number of epochs with no improvement after which training will be stopped. Tolerance denotes the quantity to be monitored in early stopping mechanism, which has three choices: loss, accuracy, and both. Specifically, loss means the training stops when the loss monitored metric has stopped decreasing, accuracy means the process stops when the accuracy monitored metric has stopped increasing, and both means the process stops when the loss monitored metric has stopped decreasing or the accuracy monitored metric has stopped increasing. The details are as follows.

  • [leftmargin=*]

  • GCN kipf2016semi:

    • All node classification datasets: layers: 2; hidden: 16; dropout: 0.5; epochs: 200; lr: 0.01; tolerance: loss; tolerance_num: 10; weight_decay: 5e-4 in the first GCN and 0 in the second GCN.

  • GAT velivckovic2017graph:

    • All node classification datasets except PubMed: layers: 2; hidden: 8; heads: 8; dropout: 0.6; attention_drop:0.6; epochs: 100000; lr: 0.005; tolerance: both; tolerance_num: 100; weight_decay: 0.0005.

    • PubMed: layers: 2; hidden: 8; heads: 8; dropout: 0.6; attention_drop:0.6; epochs: 100000; lr: 0.01; tolerance: both; tolerance_num: 100; weight_decay: 0.001.

  • APPNP klicpera2018predict:

    • All node classification datasets: layers: 2; hidden: 64; dropout: 0.5; epochs: 2000; lr: 0.01; tolerance: accuracy; tolerance_num: 500; weight_decay: 0.

  • JKNET-CAT xu2018representation:

    • All node classification datasets: layers: 2; hidden: 32; dropout: 0.75; epochs: 1500; lr: 0.005; tolerance: loss; tolerance_num: 100; weight_decay: 0.0005; mode: cat.

  • JKNET-MAX xu2018representation:

    • All node classification datasets: layers: 2; hidden: 32; dropout: 0.75; epochs: 1500; lr: 0.005; tolerance: loss; tolerance_num: 100; weight_decay: 0.0005; mode: max.

  • GCNII chen2020simple:

    • Cora: layers: 64; hidden: 64; dropout: 0.6; epochs: 1500; lr: 0.01; tolerance: loss; tolerance_num: 100; weight_decay: 0.01 for dense layer and 0.0005 for convolutional layer; : 0.5; : 0.1.

    • CiteSeer: layers: 32; hidden: 256; dropout: 0.7; epochs: 1500; lr: 0.01; tolerance: loss; tolerance_num: 100; weight_decay: 0.01 for dense layer and 0.0005 for convolutional layer; : 0.6; : 0.1.

    • PubMed: layers: 16; hidden: 256; dropout: 0.5; epochs: 1500; lr: 0.01; tolerance: loss; tolerance_num: 100; weight_decay: 0.0005 for dense layer and 0.0005 for convolutional layer; : 0.4; : 0.1.

    • Chameleon, Squirrel, Actor: layers: 8; hidden: 64; dropout: 0.5; epochs: 1500; lr: 0.01; tolerance: loss; tolerance_num: 100; weight_decay: 0.0005; : 1.5; : 0.2.

    • Cornell: layers: 16; hidden: 64; dropout: 0.5; epochs: 1500; lr: 0.01; tolerance: loss; tolerance_num: 100; weight_decay: 0.001; : 1; : 0.5.

    • Texas: layers: 32; hidden: 64; dropout: 0.5; epochs: 1500; lr: 0.01; tolerance: loss; tolerance_num: 100; weight_decay: 0.0001; : 1.5; : 0.5.

    • Wisconsin: layers: 16; hidden: 64; dropout: 0.5; epochs: 1500; lr: 0.01; tolerance: loss; tolerance_num: 100; weight_decay: 0.0005; : 1; : 0.5.

    • ogbn-arxiv: layers: 16; hidden: 256; dropout: 0.1; epochs: 1000; lr: 0.001; tolerance: accuracy; tolerance_num: 200; weight_decay: 0; : 1; : 0.5.

  • GRAND feng2020graph:

    • Cora: layers: 2; hidden: 32; input_dropout: 0.5; hidden_dropout: 0.5; epochs: 5000; lr: 0.01; tolerance: both; tolerance_num: 200; weight_decay: 0.0005; dropnode_rate 0.5; lambda: 1.0; temperature: 0.5; order: 8; sample: 4.

    • CiteSeer: layers: 2; hidden: 32; input_dropout: 0.0; hidden_dropout: 0.2; epochs: 5000; lr: 0.01; tolerance: both; tolerance_num: 200; weight_decay: 0.0005; dropnode_rate 0.5; lambda: 0.7; temperature: 0.3; order: 2; sample: 2.

    • PubMed: layers: 2; hidden: 32; input_dropout: 0.6; hidden_dropout: 0.8; epochs: 5000; lr: 0.01; tolerance: both; tolerance_num: 200; weight_decay: 0.0005; dropnode_rate 0.5; lambda: 1.0; temperature: 0.2; order: 5; sample: 4: use_batchnorm: True.

  • GCN-EDGE333https://github.com/rusty1s/pytorch_geometric/blob/master/examples/link_pred.py

    • All edge prediction datasets: layers: 2; hidden: 128; dropout: 0.5; epochs: 100; lr: 0.01; weight_decay: 0.

  • GCN-GRAPH errica2019fair:

    • All graph classification datasets: batch: 50; layers: 2; hidden: 32; dense: 128; dropout: 0.5; epochs: 1000; lr: 0.0001; weight_decay: 0; k: 0.9.

  • GIN xu2018powerful:

    • All graph classification datasets: batch: 32; layers: 5; hidden: 64; dropout: 0.5; epochs: 1000; lr: 0.01; weight_decay: 0; opt_scheduler: step.

  • GCN-RES-v2444https://github.com/ytchx1999/GCN_res-CS-v2:

    • ognb-arxiv: layers: 8; hidden: 128; dropout: 0.5; epochs: 500; lr: 0.01; weight_decay: 0.

  • GCN-DGL555https://github.com/Espylapiza/dgl/blob/master/examples/pytorch/ogb/ogbn-arxiv/models.py:

    • ognb-arxiv: layers: 3; hidden: 256; dropout: 0.75; epochs: 1000; lr: 0.005; weight_decay: 0.

  • GraphSAGE666https://github.com/snap-stanford/ogb/blob/master/examples/nodeproppred/arxiv/gnn.py:

    • ognb-arxiv: layers: 5; hidden: 1024; dropout: 0.5; epochs: 100; lr: 0.001; weight_decay: 0.

e.4 Configurations of the graph augmentation methods

In this section, we list the probability parameters controlling the sampling process in graph augmentations to generate multiple graph views of all experiments in the main body.

  • [leftmargin=*]

  • GCN kipf2016semi:

    • Cora, CiteSeer, Cornell: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.0.

    • Other node classification datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • GAT velivckovic2017graph:

    • PubMed, Squirrel, Actor: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.0.

    • Chameleon, Squirrel, Actor: masking node features: 0.4; Corrupting node features: 0.4; Dropping edges: 0.4; Extracting subgraphs: 0.4.

    • Cornell: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.0; Extracting subgraphs: 0.0.

    • Other node classification datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • APPNP klicpera2018predict:

    • Chameleon: masking node features: 0.5; Corrupting node features: 0.5; Dropping edges: 0.5; Extracting subgraphs: 0.5.

    • Actor: masking node features: 0.0; Corrupting node features: 0.1; Dropping edges: 0.0; Extracting subgraphs: 0.0.

    • Cornell, Wisconsin: masking node features: 0.4; Corrupting node features: 0.4; Dropping edges: 0.4; Extracting subgraphs: 0.4.

    • Texas: masking node features: 0.6; Corrupting node features: 0.6; Dropping edges: 0.6; Extracting subgraphs: 0.6.

    • Other node classification datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • JKNET-CAT xu2018representation:

    • CiteSeer, Chameleon, Squirrel, Cornell: masking node features: 0.0; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.0.

    • PubMed: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.0.

    • Other node classification datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • JKNET-MAX xu2018representation:

    • CiteSeer, PubMed, Squirrel, Cornell: masking node features: 0.0; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.0.

    • Actor: masking node features: 0.2; Corrupting node features: 0.2; Dropping edges: 0.2; Extracting subgraphs: 0.2.

    • Texas: masking node features: 0.2; Corrupting node features: 0.1; Dropping edges: 0.1; Extracting subgraphs: 0.0.

    • Other node classification datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • GCNII chen2020simple:

    • Cornell: masking node features: 0.6; Corrupting node features: 0.6; Dropping edges: 0.6; Extracting subgraphs: 0.6.

    • Actor: masking node features: 0.3; Corrupting node features: 0.3; Dropping edges: 0.3; Extracting subgraphs: 0.3.

    • Texas, Wisconsin: masking node features: 0.2; Corrupting node features: 0.2; Dropping edges: 0.2; Extracting subgraphs: 0.2.

    • Other node classification datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • GRAND feng2020graph:

    • All node classification datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • GCN-EDGE777https://github.com/rusty1s/pytorch_geometric/blob/master/examples/link_pred.py

    • All edge prediction datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • GCN-GRAPH errica2019fair:

    • All graph classification datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • GIN xu2018powerful:

    • All graph classification datasets: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • GCN-RES-v2888https://github.com/ytchx1999/GCN_res-CS-v2:

    • ogbn-arxiv: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • GCN-DGL999https://github.com/Espylapiza/dgl/blob/master/examples/pytorch/ogb/ogbn-arxiv/models.py:

    • ogbn-arxiv: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

  • GraphSAGE101010https://github.com/snap-stanford/ogb/blob/master/examples/nodeproppred/arxiv/gnn.py:

    • ogbn-arxiv: masking node features: 0.1; Corrupting node features: 0.0; Dropping edges: 0.1; Extracting subgraphs: 0.1.

Appendix F Additional Experimental Results

Figure 7: Accuracy and loss on ogbn-arxiv.

In this section, we plot the training loss and accuracy curves as a sanity check to investigate whether GraphSym can train better than the original GNN baseline models. Figure 7 shows the accuracy curve and the loss curve of GraphSym and GCNII on ogbn-arxie. The curve is recorded on the training set. We can observe that the blue line (GraphSym) is above the orange line (GCNII) in Figure 7(a), while the blue line (GraphSym) is under the orange line (GCNII) in Figure 7(b). We can empirically conclude that GraphSym can actually train better as a sanity check.

We also plot the accuracy and loss curves of GCN, GAT, APPNP, JKNET-CAT, JKNET-MAX, and GCN2 on Cora, CiteSeer, and PubMed datasets. Note that some lines in figures do not reach the end of training epochs because of the early stopping mechanism prechelt1998early used in the original paper. From Figure 8, we can find that the blue line (multi-view symbiosis) is always above the orange line (original GNN). From Figure 9, we can find that the blue line (multi-view symbiosis) is always under the orange line (original GNN). These figures indicate that GraphSym consistently outperforms the original GNN baseline models and suggest that GraphSym can empirically improve the training process of the original GNNs.

Figure 8: Accuracy curves on comparison between the baseline GNNs and GraphSym on Cora, CiteSeer, and PubMed.
Figure 9: Loss curves on comparison between the baseline GNNs and GraphSym on Cora, CiteSeer, and PubMed.