Rethinking and Scaling Up Graph Contrastive Learning: An Extremely Efficient Approach with Group Discrimination

Graph contrastive learning (GCL) alleviates the heavy reliance on label information for graph representation learning (GRL) via self-supervised learning schemes. The core idea is to learn by maximising mutual information for similar instances, which requires similarity computation between two node instances. However, this operation can be computationally expensive. For example, the time complexity of two commonly adopted contrastive loss functions (i.e., InfoNCE and JSD estimator) for a node is O(ND) and O(D), respectively, where N is the number of nodes, and D is the embedding dimension. Additionally, GCL normally requires a large number of training epochs to be well-trained on large-scale datasets. Inspired by an observation of a technical defect (i.e., inappropriate usage of Sigmoid function) commonly used in two representative GCL works, DGI and MVGRL, we revisit GCL and introduce a new learning paradigm for self-supervised GRL, namely, Group Discrimination (GD), and propose a novel GD-based method called Graph Group Discrimination (GGD). Instead of similarity computation, GGD directly discriminates two groups of summarised node instances with a simple binary cross-entropy loss. As such, GGD only requires O(1) for loss computation of a node. In addition, GGD requires much fewer training epochs to obtain competitive performance compared with GCL methods on large-scale datasets. These two advantages endow GGD with the very efficient property. Extensive experiments show that GGD outperforms state-of-the-art self-supervised methods on 8 datasets. In particular, GGD can be trained in 0.18 seconds (6.44 seconds including data preprocessing) on ogbn-arxiv, which is orders of magnitude (10,000+ faster than GCL baselines while consuming much less memory. Trained with 9 hours on ogbn-papers100M with billion edges, GGD outperforms its GCL counterparts in both accuracy and efficiency.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/22/2020

Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning

Graph representation learning has attracted lots of attention recently. ...
07/15/2020

GraphCL: Contrastive Self-Supervised Learning of Graph Representations

We propose Graph Contrastive Learning (GraphCL), a general framework for...
06/10/2020

Contrastive Multi-View Representation Learning on Graphs

We introduce a self-supervised approach for learning node and graph leve...
01/14/2021

Label Contrastive Coding based Graph Neural Network for Graph Classification

Graph classification is a critical research problem in many applications...
02/12/2021

Bootstrapped Representation Learning on Graphs

Current state-of-the-art self-supervised learning methods for graph neur...
02/17/2021

S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration

Previous studies dominantly target at self-supervised learning on real-v...
06/07/2021

Self-Supervised Graph Learning with Proximity-based Views and Channel Contrast

We consider graph representation learning in a self-supervised manner. G...

1 Introduction

Graph Neural Networks (GNNs) have been widely-adopted in learning representations for graph-structured data. With a message-passing approach to aggregate neighbouring information based on the topology of a graph, GNNs can learn effective low-dimensional node embeddings, which can be used for a variety of downstream tasks such as node classification 

Jin et al. (2021), link prediction Zhang and Chen (2018), and graph classification Hassani and Khasahmadi (2020). GNNs have been further applied in diverse domains, e.g., graph similarity computation Di et al. (2022), recommendation systems Fan et al. (2019), heterophilic graphs Zheng et al. (2022), trustworthy systems Zhang et al. (2022)

, and anomaly detection 

Liu et al. (2021); Zheng et al. (2021b).

However, many GNNs adopt a supervised learning manner to train models with label information, which is expensive and labour-intensive to collect in real-world. To address this issue, a few studies (e.g., DGI Veličković et al. (2019), MVGRL Hassani and Khasahmadi (2020), GMI Peng et al. (2020), GRACE Zhu et al. (2020), and Subg-Con Jiao et al. (2020)

) borrow the idea of contrastive learning from computer vision (CV), and introduce Graph contrastive learning (GCL) methods for self-supervised GRL. The core idea of these methods is to maximise the mutual information (MI) between an anchor node and its positive counterparts, sharing similar semantic information while doing the opposite for negative counterparts as shown in Figure

1(a). Nonetheless, such a scheme relies on expensive similarity calculation in contrastive loss computation. Taking two commonly-adopted contrastive loss functions, InfoNCE and JSD estimator as examples, the time complexity of loss computation for a node is and , respectively. Here, represents the number of nodes, and is the embedding dimension. Additionally, GCL normally requires a large number of training epoches to be well-trained on large-scale datasets. Thus, when the size of the dataset is large, these methods require a significant amount of time and resources to be well-trained.

Figure 1: The left subfigure shows the GCL learning scheme. Red line indicates MI maximisation between two nodes, each of which , while blue line indicates the opposite operation. The right subfigure presents Group Discrimination. It discriminates summarised node embeddings, each of which , from different groups.
Method Pre Tr Epo Total(E) Imp(E) Total(T) Imp(T) Acc
GBT(256) 5.52 6.47 300 1,946.52 - 1,941.00 - 70.1
GGD(256) 6.26 0.18 1 6.44 302.25 0.18 10,783.33 70.3
GGD(1,500) 6.26 0.95 1 7.21 269.96 0.95 2,043.16 71.6
Table 1: Training time in seconds comparison between GGD and GBT Zbontar et al. (2021) (i.e., the most efficient GCL baseline as shown in section 5.1) on ogbn-arxiv. Number in brackets means the hidden size. ‘Pre’, ‘Tr’ and ‘Epo’ indicate preprocessing time, training time per epoch, and the number of epochs for training GNNs. ‘Total(E)’ and ‘Total(T)’ are total end-to-end training time (i.e., including preprocessing), which equals to () and total training time, which is (). ‘Imp(E)’ and ‘Imp(T)’ indicate how many times GGD improve on ‘Total(E)’ and ‘Total(T)’. ‘Acc’ is averaged accuracy result on test set over five runs.

Though a few GCL works attempt to alleviate the training load of InfoNCE by removing negative node pairs with specially designed schemes, e.g., BGRL Thakoor et al. (2021) and GBT Bielak et al. (2021), these methods still require to compute the contrastiveness for a node. Driven by the success of BYOL Grill et al. (2020) in CV, BGRL Thakoor et al. (2021) adopts a bootstrapping scheme, which only contrasts a node from the online network (i.e., updated with gradient) to its corresponding embedding from the target network (i.e., updated momentumly with stop gradient). Based on Barlow-Twins Zbontar et al. (2021), GBT Bielak et al. (2021) utilises a cross-correlation-based loss function to get rid of negative samples. However, these two methods still require as they still need to conduct similarity computation. Thus, the complexity of these two methods are only on par with the JSD estimator and still suffer from the inefficiency in model training.

To boost training efficiency of self-supervised GRL, inspired by an observation of a technical defect (i.e., inappropriate application of Sigmoid function) in two representative GCL studies, we introduce a novel learning paradigm, namely, Group Discrimination (GD). Remarkably, with GD, the time complexity of loss computation for a node is only , endowing the scheme with the extremely efficient property. Instead of similarity computation, GD directly discriminates a group of positive nodes from a group of negative nodes, as shown in Figure 1

(b). Specifically, GD defines summarised node embeddings generated with original graph as the positive group, where each node is summarised into a single scalar, while embeddings obtained with corrupted topology are regarded as the negative group. Then, GD trains the model by classifying these embeddings into the correct group with a very simple binary cross-entropy loss. By doing so, the model can extract valuable self-supervised signals from learning the edge distribution of a graph. Compared with GCL, GD enjoys numerous merits including extremely fast training, fast convergence (e.g., 1 epoch to be well-trained on large-scale datasets), and high scalability while achieving SOTA or on par performance with existing GCL approaches.

Using GD as backbone, we design a new self-supervised GRL model with the Siamese structure called Graph Group Discrimination (GGD). Firstly, we can optionally augment a given graph with augmentation techniques, e.g., feature and edge dropout. Then, the augmented graph are fed into a GNN encoder and a projector to obtain embeddings for the positive group. After that, the augmented feature is corrupted with node shuffling (i.e., disarranging the order of nodes in the feature matrix) to disrupt the topology of a graph and input to the same network for obtaining embeddings of the opposing group. Finally, the model is trained by discriminating the summarisation of these two groups of nodes. The contributions of this paper are three-folds: 1) We introduce a novel and efficient self-supervised GRL paradigm, namely, Group Discrimination (GD). Notably, with GD, the time complexity of loss computation for a node is only . 2) Based on GD, we propose a new self-supervised GRL model, GGD, which is fast in training and convergence, and possess high scalability. 3) We conduct extensive experiments on eight datasets, including an extremely large dataset, ogbn-papers100M with billion edges. The experiment results show that our proposed method reaches state-of-the-art performance while consuming much less time and memory than baselines, e.g., 10783 faster than the most efficient GCL baseline with its best selected epochs number Bielak et al. (2021), as shown in Table 1.

2 Rethinking Representative GCL Methods

Figure 2: The architecture of DGI. Cubes indicate node embeddings. Red and blue lines represent MI maxmisation and minimisation, respectively. and denote the original graph and the corrupted graph.

is the summary vector.

In this section, we analyse a technical defect observed in two representative GCL methods, DGI Veličković et al. (2019) and MVGRL Hassani and Khasahmadi (2020). Then, based on the technical defect, we show the learning method behind these two approaches is not contributed to Contrastive Learning, but a new paradigm, Group Discrimination. Finally, from the analysis, we provide the definition of this new concept.

2.1 Rethinking DGI

DGI Veličković et al. (2019)

is the first work introducing Contrastive Learning into GRL. However, due to a technical defect observed in their official open-source code, we found it is essentially not working as the authors thought (i.e., learning via MI interaction). The backbone that truly makes it work is a new paradigm, Group Discrimination.

Constant Summary Vector. As shown in Figure 2, the original idea of DGI is to maximise the MI (i.e., the red line) between a node and the summary vector , which is obtained by averaging all node embeddings in a graph . Also, to regularise the model training, DGI corrupts by shuffling the node order of the input feature matrix to get . Then, generated embeddings of are served as negative samples, which are pulled apart from the summary vector via MI minimisation.

Activation Statistics Cora CiteSeer PubMed
Mean 0.50 0.50 0.50
ReLU/LReLU/PReLU Std 1.3e-03 1.0e-04 4.0e-04
Range 1.4e-03 8.0e-04 1.5e-03
Mean 0.62 0.62 0.62
Sigmoid Std 5.4e-05 2.9e-05 6.6e-05
Range 3.6e-03 3.0e-03 3.2e-03
Table 2:

Summary vector statistics on three datasets with different activation functions including ReLU, LeakyReLU (i.e., LReLU shown below), PReLU, and Sigmoid.

Nonetheless, in the implementation of DGI, a Sigmoid function is inappropriately applied on the summary vector generated from a GNN whose weight is initialised with Xavier initialisation. As a result, elements in the summary vector are very close to the same value. We have validated this finding on three datasets (i.e., Cora, CiteSeer and PubMed) with different activation functions used in the GNN encoder including ReLU, Leaky ReLU, PReLU, and Sigmoid. The experiment result is shown in Table 2, which shows that summary vectors in all datasets are approximately a constant vector , where is a scalar and is an all-ones vector (e.g., =0.5 in these datasets with ReLu, Leaky ReLu or PReLu as activation).

To theoretically explain this phenomenon, we present the theorem below:

Theorem 1

Given , and a GCN encoder whose weight matrix is initialised with Xavier initialisation, we can obtain its embedding . Then, when increases, the data range of Sigmoid function taking H as input (i.e., ) converges to 0.

Based on this theorem, we can see if the dimension of input feature matrix X

is large, then these summary vectors can lose variance and become constant vector. The proof for the theorem is presented in Appendix

A.1.

To evaluate the effect of to constant summary vector, we vary the scalar (from 0 to 1 increment by 0.2) to change the constant summary vector and report the model performance (i.e., averaged accuracy on five runs) in Table 3.

Dataset 0 0.2 0.4 0.6 0.8 1.0
Cora 70.3 82.4 82.3 82.5 82.3 82.5
CiteSeer 61.8 71.7 71.9 71.6 71.7 71.6
PubMed 68.3 77.8 77.9 77.7 77.4 77.2
Table 3: The experiment result on three datasets with changing value from 0 to 1.0 for the summary vector.

From this table, we can see, except for 0, the model performance is trivially affected by for constant summary vector. When the summary vector is set to 0, the model performance plummets because node embeddings become all 0 when multiplying with such vector and the model converges to the trivial solution. As the summary vector only has a trivial effect on model training, the hypothesis of DGI Veličković et al. (2019) on learning via contrastiveness between anchor nodes and the summary instance does not hold, which raises a question to be investigated: What truly leads to the success of DGI?

Simplifying DGI. To answer the question, we predigest the loss proposed in DGI by using an all-ones vector as the summary vector (i.e., setting ) and simplifying the discriminator (i.e., removing the learnable weight vector). Then, we rewrite the loss to the following form:

(1)

where is the vector multiplication operation, is the number of nodes in a graph, and are the original and corrupted embedding for node , is the summation function, and is a discriminator for bilinear transformation, which can be formulated as follows:

(2)
Experiment Method Cora CiteSeer PubMed
Accuracy DGI 81.7 71.5 77.3
82.5 71.7 77.7
Memory DGI 4189MB 8199MB 11471MB
1475MB|64.8% 1587MB|80.6% 1629MB|85.8%
Time DGI 0.085s 0.134s 0.158s
0.010s|8.5 0.021s|6.4 0.015s|10.5
Table 4: Comparison of the original and in terms of accuracy (averaged on five runs), memory efficiency (in MB) and training time (in seconds). Number after | shows how many times have improved on top of .

where is a learnable weight vector. Specifically, as shown in Equation 2, by removing the weight vector , is directly multiplied with . As is a vector containing only one, the multiplication of and is equivalent to summing itself directly. From this form, we can see that the multiplication of and the summary vector only serves as an aggregation function (i.e., summation aggregation) to summarise information in . To explore the effect of other aggregation functions, we replace the summation function in Equation 1 with other aggregation methods including mean-, minimum-, maximum- pooling and linear aggregation. We report the experiment results (i.e., averaged accuracy on five runs) in Table 5. The table shows that replacing the summation function with other aggregation methods still works, while summation and linear aggregation achieve comparatively better performance.

Based on Equation 1, we can rewrite it to a very simple binary cross entropy loss if we also include corrupted nodes as data samples and setting :

(3)
Method Cora CiteSeer PubMed
Sum 82.5 71.7 77.7
Mean 81.8 71.8 76.5
Min 80.4 61.7 70.1
Max 71.4 65.3 70.2
linear 82.2 72.1 77.9
Table 5: The experiment result on three datasets with different aggregation function on node embeddings.

where means the indicator for node (i.e., if node is corrupted, is 0, otherwise it is 1), and represents the summarisation of node ’s embedding. As we include corrupted nodes as data samples, the size of nodes to be processed is doubled to (i.e., the number of corrupted nodes is equal to the number of original nodes). From the equation above, we can easily observe that what DGI truly does is discriminate a group of summarised original node embeddings from the other group of summarised corrupted node embeddings, as shown in Figure 1. We name this self-supervised learning paradigm "Group Discrimination". To validate the effectiveness of this paradigm, we replace the original DGI loss with Equation 3, namely, and compare it with DGI on three datasets in terms of training time, memory efficiency and model performance as shown in Table 4. Here, adopts the same parameter setting as DGI. From this table, we can observe dramatically improves DGI in both memory- and time-efficiency while it slightly enhances the model performance of DGI. This is contributed to the removal of multiplication operations between node pairs, which eases the burden of computation and memory consumption in both forward and backward propagation. In section 4.1, we further explore GD and compare it with existing contrastive losses.

2.2 Rethinking MVGRL

Figure 3: The architecture of MVGRL. Here augment means augmentation. is the summary vector based on , and is the summary vector based on the augmented graph .

Extending the architecture of DGI, MVGRL resorts to multi-view contrastiveness via additional augmentation. Specifically, as shown in Figure 3, it first uses the diffusion augmentation to create . Then, it corrupts and to generate negative samples and . To build contrastiveness, MVGRL also generates two summary vectors and by averaging all embeddings in and , respectively. Based on the design of MVGRL, the model training is driven by mutual information maximisation between an anchor node embedding and its corresponding augmented summary vector. However, MVGRL has the same technical error in their official JSD-based implementation as DGI, which makes it also becomes an group-discrimination-based approach.

Similar to Equation 3 of DGI, the proposed loss in MVGRL can also be rewritten as a binary cross entropy loss:

(4)

here the number of nodes is increased to as we include nodes in and as data samples. The indicator for and are 1, while and are considered as negative samples (i.e., the indicator for them are 0). To explore why MVGRL can achieve a better performance than DGI, we replace the original MVGRL loss with Equation 4 and conduct ablation study of by removing different set of data samples in Equation 4, and report the experiment result in Table 6. From the table, the performance of is on par with , which reconfirms the effectiveness of using the bce loss. Also, we can observe that including and is the key of MVGRL surpassing DGI. With and , the model performance of and is improved from 82.2 to 83.1. We conjecture this is because with the diffusion augmentation, MVGRL is trained with the additional global information provided by the diffused view . However, the diffusion augmentation involves expensive matrix inversion computation and significantly densifies the given graph, which requires much more memory and time to store and process than the original view. This can hinder the model from extending to large-scale datasets Zheng et al. (2021a).

Method Cora CiteSeer PubMed
82.9 72.6 78.8
83.1 72.8 79.1
w/o 81.2 52.8 76.6
w/o 82.1 71.8 77.1
w/o 81.1 56.7 74.9
w/o 82.7 72.0 78.6
w/o and 82.2 71.8 77.0
w/o and 83.1 72.6 78.5
Table 6: The ablation study of MVGRL from the perspective of Group Discrimination.

2.3 Definition of Group Discrimination

As mentioned above, Group Discrimination is a self-supervised GRL paradigm, which learns by discriminating different groups of node embeddings. Specifically, the paradigm assigns different indicators to different groups of node embeddings. For example, for binary group discrimination, one group is considered as the positive group with class 1 as its indicator, whereas the other group is the negative group, having its indicator assigned as 0. Given a graph , the positive group usually includes node embeddings generated with the original graph or its augmented views (i.e., similar graph instances of created by augmentation). In contrast, the opposing group contains negative samples obtained by corrupting , e.g., changing its topology structure.

3 Methodology

Figure 4: The architecture of Graph Group Discrimination (GGD). Given a graph and a feature matrix X, we can optionally apply augmentation, e.g., edge/node dropout, on them to generate and . Then, we corrupts and to obtain and . Taking and as input to the encoder and projector, the positive group of nodes can be obtained. Similarly, and are passed through the same encoder and projector for the generation of the negative group. The model training is driven by discriminating these two groups of summarised node embeddings.

We first define unsupervised node representation learning and then present the architecture of GGD, which extends with additional augmentation and embedding reinforcement to reach better model performance. Given a graph with attributes , where is the number of nodes in , and is the number of dimensions of X, our aim is to train a GNN encoder without the reliance on labelling information. With the trained encoder, taking and X as input, it can output learned representations , where is the predefined output dimension. H can then be used in many downstream tasks such as node classification.

3.1 Graph Group Discrimination

Based on the proposed self-supervised GRL paradigm, Group Discrimination, we have designed a novel method, namely GGD, to learn node representations using a Siamese network structure and a binary cross-entropy loss. The architecture of the proposed framework is presented in Figure 4. The framework mainly consists of four components: augmentation, corruption, a Siamese GNN network, and Group Discrimination.

Augmentation. With a given graph and feature matrix X, optionally, we can augment it with augmentation techniques such as edge and feature dropout to create and . In practice, we follow the augmentation proposed in GraphCL You et al. (2020). Specifically, edge dropout removes a predefined fraction of edges, while we use node dropout to mask a predefined proportion of feature dimension, i.e., assigning 0 to replace values in randomly selected dimensions. This step is optional in implementation.

Notably, the motivation of using augmentation in our framework is distinct from contrastive learning methods. In our study, augmentation is used to increase the difficulty of the self-supervised training tasks. With augmentation, and is changing in every training iteration, which forces the model to lessen the dependence on the fixed pattern (i.e., unchanged edge and feature distribution) in a monotonous graph. However, in Contrastive Learning, augmentation creates augmented views sharing similar semantic information for building contrastiveness.

Corruption. and are then corrupted to build and for the generation of node embeddings in the negative group. We adopt the same corruption technique used in DGI Veličković et al. (2019) and MVGRL Hassani and Khasahmadi (2020) as shown in Figure 5. The corruption technique devastates the topology structure of by randomly changing the order of nodes in . The corrupted and can be used for producing node representations with incorrect network connections.

The Siamese GNN. We have designed a Siamese GNN network to output node representations given a graph and its attribute. The Siamese GNN network is made up of two components, which are a GNN encoder and a projector. The backbone GNN encoder is replaceable with a variety of choices of GNNs, e.g., GCN Kipf and Welling (2017) and GAT Veličković et al. (2018). In our work, we adopt GCN as the backbone. The projector is a flexible network with a user-defined number of linear layers. When generating node embeddings of the positive group, the Siamese network takes and as input. Using the same encoder and projector, the Siamese network output the negative group with and . These two groups of node embeddings are considered as a collection of data samples with a size of for discrimination. Before conducting group discrimination, all data samples are summarised with the same aggregation technique, e.g., sum-, mean-, and linear aggregation.

Group Discrimination. In the group discrimination process, we adopts a very simple binary cross entropy (BCE) loss to discriminate two groups of node embeddings as shown in Equation 3. In our implementation, is 0 and 1 for node embeddings in negative and positive groups. During model training, the model is optimised by categorising node embeddings in the collection of data samples into their corresponding class correctly. The loss is computed by comparing the summarisation of a node , i.e., a scalar, with its indicator . With the ease of BCE loss computation, the training process of GGD is very fast and memory efficient.

3.2 Model Inference

During training, the model is optimised via loss minimisation with Equation 3. The time complexity analysis of GGD is provided in Appendix A.2. With the trained GNN encoder , we can obtain the node embeddings with the input .

Inspired by MVGRL Hassani and Khasahmadi (2020), which strengthens the output embeddings by including additional global information, we adopt a conceptually similar embedding reinforcement approach. Specifically, they obtain the final embeddings by summing up embeddings from two views: the original view comprising local information and the diffused view with global information. This operation reinforces the final embeddings and leads to model performance improvement. Nonetheless, graph diffusion impairs the scalability of a model Zheng et al. (2021a) and hence cannot be directly applied in our embedding generation. To avoid the diffusion computation, we have come up with a workaround in the virtue of graph power to extract global information. Graph power can extend the message passing scope of to n-hop neighbourhood, which encodes global information from distant neighbours. It can be formulated as follows:

(5)

where is the global embedding, and A is the adjacency matrix of the graph . It is notable that this operation can be easily decomposed with the associative property of matrix multiplication and is easy to compute. To show the easiness of such computation, we conduct an experiment showing its time consumption on various datasets in Appendix A.3. Finally, the final embedding can be achieved by , which can be used for downstream tasks.

4 Exploring Group Discrimination

Figure 5: Corruption technique in DGI and MVGRL.

4.1 Corruption

Firstly, we explore the corruption technique used in DGI Veličković et al. (2019) and MVGRL Hassani and Khasahmadi (2020), which is shown in Figure 5. These two studies corrupt the topology of a given graph by shuffling the feature matrix X. This is because by changing the node order of X, the neighbouring structure of is completely changed, e.g., neighbours of node become node neighbours.

With the corruption technique, negative samples in the negative group are generated with incorrect edges. Thus, by discriminating the positive group (i.e., nodes generated with ground truth edges) and the negative group, we conjecture the model can distill valuable signals from learning how to identify nodes generated with correct topology and output effective node embeddings.

4.1.1 Benefits of Group Discrimination

Compared with contrastive learning, Group Discrimination enjoys advantages in computation efficiency, memory efficiency and convergence speed. To compare Group Discrimination with GCL, we first analyse the complexity of two commonly-adopted contrastive loss functions, InfoNCE and JSD estimator. These two contrastive losses of an anchor node can be formulated as follows:

(6)
(7)

where represents node embedding of node , is the temperature coefficient, is the discriminator, is the corrupted negative counterpart, and is the positive counterpart sharing similar semantic information with node . In InfoNCE, the loss computation conducts similarity computation for positive (i.e, the numerator of Equation 6) and negative counterparts (i.e., the denominator) once and times, respectively. As the time complexity of node similarity computation is (e.g., vector multiplication between and ), the overall complexity for InfoNCE is . Though the JSD estimator does not require a size of negative samples for each node, it still needs for the loss computation of an anchor node because the discriminator requires vector multiplication between embeddings of node and .

To alleviate the computation burden of InfoNCE loss, BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) attempt to get rid of the reliance on negative samples with customised losses. BGRL Thakoor et al. (2021)

utilises a simple loss to minimise the cosine similarity between the online node embedding

and the target node embedding for node :

(8)

where and are generated from two augmented views and . Different from BGRL Thakoor et al. (2021), GBT Zbontar et al. (2021) adopts a correlation-based loss as follows:

(9)

where and are two indices for the embedding dimension.

From the losses shown above, we can see these two losses still involve vector multiplication between two node embeddings. Thus, even though these two methods reduce the time complexity of InfoNCE (i.e., ) to , they are still on par with the JSD estimator. Different from the aforementioned contrastive learning loss, to actuate a Group Discrimination learning paradigm, our method uses an effortless binary cross-entropy loss as shown in Equation 3. In Group Discrimination, the loss computation of a node requires only as it conducts multiplication between scalars (i.e., and ) in lieu of vectors. Thus, Group Discrimination has great advantages in computation and memory efficiency. Another merit of the proposed paradigm is fast convergence as shown in section 5.2. We conjecture this is contributed to the concentration of Group Discrimination on the general edge distribution of graphs instead of node-specific information as GCL methods do. Thus, it would not be distracted from too-detailed information.

5 Experiments

We evaluate the effectiveness of our model using eight benchmark datasets of different sizes. These datasets include five small- and medium-scale datasets: Cora, CiteSeer, PubMed Sen et al. (2008), Amazon Computers, and Amazon Photos Shchur et al. (2018), as well as large-scale datasets ogbn-arxiv, ogbn-products and ogbn-papers100M. Notably, ogbn-papers100M is the largest dataset provided by Open Graph BenchmarkHu et al. (2020) for node property prediction tasks. It has over 110 million nodes and 1 billion edges. The statistics of these datasets are summarised in Appendix A.4. To ensure reproducibility, the detailed experiment settings and computing infrastructure are summarised in Appendix A.5.

Data Method Cora CiteSeer PubMed Comp Photo
X, A, Y GCN 81.5 70.3 79.0 76.3 87.3
X, A, Y GAT 83.0 72.5 79.0 79.3 86.2
X, A, Y SGC 81.0 71.9 78.9 74.4 86.4
X, A, Y CG3 83.4 73.6 80.2 79.9 89.4
X, A DGI 81.7 71.5 77.3 75.9 83.1
X, A GMI 82.7 73.0 80.1 76.8 85.1
X, A MVGRL 82.9 72.6 79.4 79.0 87.3
X, A GRACE 80.0 71.7 79.5 71.8 81.8
X, A BGRL 80.5 71.0 79.5 89.2 91.2
X, A GBT 81.0 70.8 79.0 88.5 91.1
X, A GGD 83.9 73.0 81.3 90.1 92.5
Table 7: Model performance of node classification on 5 datasets. X, A and Y represent feature, adjacency matrix, and labels. Best performance for each dataset is in bold. Comp and Photo refer to Amazon Computers and Amazon Photos.
Method Cora CiteSeer PubMed Comp Photo
DGI 0.085 0.134 0.158 0.171 0.059
GMI 0.394 0.497 2.285 1.297 0.637
MVGRL 0.123 0.171 0.488 0.663 0.468
GRACE 0.056 0.092 0.893 0.546 0.203
BGRL 0.085 0.094 0.147 0.337 0.273
GBT 0.073 0.072 0.103 0.492 0.173
GGD 0.010 0.021 0.015 0.016 0.009
Improve 7.3-39.4 3.4-23.7 6.9-152.3 10.7-15.3 19.2-70.8
Table 8: Comparison of training time per epoch in seconds between six GCL-based methods and GGD on five datasets. Improve means how many times are GGD faster than baselines. ‘-’ means the improvement range.

5.1 Evaluating on Small- and Medium-scale Datasets

We compare GGD with ten baselines including four supervised GNNs (i.e., GCN Kipf and Welling (2017), GAT Veličković et al. (2018), SGC Wu et al. (2019), and CG3 Wan et al. (2020)) and six GCL methods (i.e., DGI Veličković et al. (2019), GMI Peng et al. (2020), MVGRL Hassani and Khasahmadi (2020), GRACE Zhu et al. (2020), BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021)) on five small- and medium scale benchmark datasets. In the experiment, we follow the same data splits as  Yang et al. (2016)

for Cora, CiteSeer and PubMed. For Amazon Computers and Photos, we adopt the semi-supervised experiment setting, which includes 30 randomly-selected nodes per class in the training and validation set. The remaining nodes are used as the test set. The model performance is measured using the averaged classification accuracy with five results along with standard deviations and reported in Table

7.

Method Cora CiteSeer PubMed Comp Photo
DGI 4,189 8,199 11,471 7,991 4,946
GMI 4,527 5,467 14,697 10,655 5,219
MVGRL 5,381 5,429 6,619 6,645 6,645
GRACE 1,913 2,043 12,597 8,129 4,881
BGRL 1,627 1,749 2,299 5,069 3,303
GBT 1,651 1,799 2,461 5,037 2,641
GGD 1,475 1,587 1,629 1,787 1,637
Improve 10.7-72.6% 11.8-80.6% 27.2-85.8% 64.5-83.2% 38.0-75.4%
Table 9: Comparison of memory consumption in MBs of six GCL baselines and GGD on five datasets.

Accuracy. From Table 7, we can observe that GGD generally outperforms all baselines in all datasets. The only exception is on CiteSeer dataset, where the semi-supervised method, CG3Wan et al. (2020), slightly outperforms GGD, which still provides the 2nd best performance. In this experiment, we reproduce BGRL Thakoor et al. (2021), GBT Zbontar et al. (2021) and GGD, while the other results are sourced from previous studies Wan et al. (2020); Jin et al. (2021).

Efficiency and Memory Consumption. GGD is substantially more efficient than other self-supervised baselines in time and memory consumption as shown in Table 8 and Table 9. Remarkably, GGD is 19.2 times faster in Amazon Photos for training time per epoch, and consumes 64.5% less memory in Amazon Computers for memory consumption than the most efficient baselines (i.e., GBT Zbontar et al. (2021)). The dramatic boost of time and memory efficiency of GGD is contributed to the exclusion of similarity computation in self-supervised signal extraction, which enables model training without multiplication of node embeddings.

5.2 Evaluating on Large-scale datasets

To evaluate the scalability of GGD, we choose three large-scale datasets from Open Graph Benchmark Hu et al. (2020), which are ogbn-arxiv, ogbn-products, and ogbn-papers100M. ogbn-papers100M is the most challenging large-scale graph available in Open Graph Benchmark for node property prediction with over 1 billion edges and 110 million nodes. Extending to extremely large graphs (i.e., ogbn-products and ogbn-papers100M), we adopt a Neighbourhood Sampling strategy, which is described in Appendix A.5.

ogbn-arxiv & ogbn-products. For ogbn-arxiv, we compare GGD against four self-supervised baselines (i.e., DGI Veličković et al. (2019), GRACE Zhu et al. (2020), BGRL Thakoor et al. (2021)and GBT Zbontar et al. (2021)), whereas BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) are selected to be compared for ogbn-products.

Method Valid Test Memory Time Total
Supervised GCN 73.0 71.7 - - -
MLP 57.7 55.5 - - -
Node2vec 71.3 70.1 - - -
DGI 71.3 70.3 - - -
GRACE(10k epos) 72.6 71.5 - - -
BGRL(10k epos) 72.5 71.6 OOM (Full-graph) / /
GBT(300 epos) 71.0 70.1 14,959MB 6.47 1,941.00
GGD(1 epo) 72.7 71.6 4,513MB|69.8% 0.18 0.18|10,783
Table 10: Node classification result and efficiency comparison on ogbn-arxiv. ‘epo’ means epoch. ‘Time’ means training time per epoch (in seconds). ‘Total’ is total training time (Number of epochs ‘Time’). OOM indicates out-of-memory on Nvidia A40 (48GB).

In addition, we include the performance of MLP, Node2vec Grover and Leskovec (2016), and supervised GCN Kipf and Welling (2017) sourced from Hu et al. (2020) in Table 10 and Table 11. For memory and training time comparison, we only compare GGD with the two most efficient baselines (i.e., BGRL and GBT according to Tables 8 and 9). In ogbn-arxiv, we reproduce BGRL Thakoor et al. (2021) and found it fails to process ogbn-arxiv in full batch. Thus, we only compare GGD and GBT in this dataset, which can successfully train in full-graph processing mode.

Method Valid Test Memory Time Total
Supervised GCN 92.0 75.6 - -
MLP 75.5 61.1 - -
Node2vec 70.0 68.8 - -
BGRL (100 epos) 78.1 64.0 29,303MB 53m16s 5,326m40s
GBT (100 epos) 85.0 70.5 20,419MB 48m38s 4,863m20s
GGD(1 epo) 90.9 75.7 4,391MB|78.5% 12m46s 12m46s|381
Table 11: Node classification result and efficiency comparison on ogbn-products.

From Table 10 and Table 11, we can see GGD remarkably achieves the state-of-the-art performance using only one epoch to train. As a result, GGD is 10,783 times faster than the most efficient baseline, i.e., GBT Zbontar et al. (2021)

, on total training time to reach the desirable performance in ogbn-arxiv. Please be noted that the number of epochs in our experiment is consistent with the optimal choice of this hyperparameter specified in GBT 

Zbontar et al. (2021). For ogbn-products, we are 381 faster than GBT Zbontar et al. (2021) on total training time. Notably, our performance is significantly higher than GCL baselines using 100 epochs (i.e., 6% and 5.2% improvement on GBT Zbontar et al. (2021) in validation and test set) with only one epoch training in this dataset. In addition, we compare the convergence speed among GGD, BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) on ogbn-arxiv and ogbn-products, which are shown in Figure 6. For ogbn-arxiv, BGRL Zbontar et al. (2021) is running using batched processing with neighbour sampling. This figure shows the preeminence of GGD in convergence speed as GGD can be well-trained with only one epoch (i.e., reaching the peak model performance in the first epoch and staying stable with increased epochs). In contrast, the other two baselines require comparatively much more epochs to gradually improve their performance. Compared with GCL baselines, GGD achieves much faster convergence via Group Discrimination. We conjecture this is because GD-based method focuses on the general edge distribution of graphs instead of node-specific information. Inversely, GCL methods can suffer from convergence inefficiency as they may be easily distracted from too-detailed node-specific information during training.

Figure 6: Convergence speed comparison among GGD, BGRLThakoor et al. (2021) and GBT Zbontar et al. (2021). X-axis means number of epochs, while Y-axis represents the accuracy on test set.
Method Validation Test Memory Time
Supervised SGC 63.3 66.5 - -
MLP 47.2 49.6 - -
Node2vec 55.6 58.1 - -
BGRL (1 epoch) 59.3 62.1 14,057MB 26h28m
GBT (1 epoch) 58.9 61.5 13,185MB 24h38m
GGD(1 epoch) 60.2 63.5 4,105MB|68.9% 9h15m|2.7
Table 12: Node classification result and efficiency comparison on ogbn-papers100M.

ogbn-papers100M. We further compare GGD with BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) on ogbn-papers100M, the largest OGB dataset with billion scale edges. Other self-supervised learning algorithms such as DGI Veličković et al. (2019) and GMI Peng et al. (2020) fail to scale to such a large graph with a reasonable batch size (i.e., 256). We only report the performance of each algorithm after a single epoch of training in Table LABEL:ogbn-papers100m due to the extreme scale of the dataset and the limitation of our available resources. From the table, we can observe that GGD outperforms the two GCL counterparts, BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) in both accuracy and efficiency. Specifically, GGD achieves 60.2 in accuracy while BGRL and GBT reach 59.3 and 58.9 in test set, respectively. With only one epoch, these two algorithms may not be well trained. However, training each epoch of these two requires over 1 day and if we would like to train them for 100 epochs, then we will need 100+ GPU days, which is prohibitively impractical for general practitioners. In contrast, GGD can be trained in about 9 hours to achieve a good result for this dataset, which is more appealing in practice.

6 Related Work

Graph Neural Networks (GNNs) are generalised deep neural networks for graph-structured data. GNNs mainly have two categories, spectral-based GNNs and spatial-based GNNs. Spectral GNNs attempt to use eigen-decomposition to obtain the spectral-based representation of graphs, whereas spatial GNNs focus on using spatial neighbours of nodes for message passing. Extending spectral-based methods to the spatial domain, GCN Kipf and Welling (2017) utilises first-order Chebyshev polynomial filters to approximate spectral-based graph convolution. Taking the weight of spatial neighbours in consideration, GAT Veličković et al. (2018), improves GCN by introducing the attention module in message passing. To decouple message passing from neural networks, SGC Wu et al. (2019) simplifies GCN by removing the non-linearity and weight matrices in graph convolution layers. However, these studies cannot handle datasets with limited or no labels. Graph contrastive learning has been recently exploited to address this issue.

Graph Contrastive Learning (GCL) aims to alleviate the reliance on labelling information in model training based on the concept of mutual information (MI). Specifically, GCL approaches maximise MI between instances with similar semantic information, and minimise the MI between dissimilar instances. For example, DGI Veličković et al. (2019) builds contrastiveness between node embeddings and a summary vector (i.e., a graph level embedding obtained by averaging all node embeddings) with a JSD estimator. To improve DGI, MVGRL Hassani and Khasahmadi (2020) and GMI Peng et al. (2020) extends the idea of DGI by introducing multi-view contrastiveness with diffusion augmentation, and focusing on a local scope with the first-order neighbourhood, respectively. Adopting InfoNCE loss, GRACE Zhu et al. (2020) applies augmentation techniques to create two augmented views and inject contrastiveness between them. Though these GCL methods have successfully outperformed some supervised baselines in benchmark datasets, these methods suffer from significant limitations, including time-consuming loss computation, a large number of training epochs and poor scalability. For example, InfoNCE and JSD estimator require and in loss computation of a node, respectively. On the contrast, our method GGD presented in this paper reduces the time complexity of loss computation for a node to only and converge fast.

Scalable GNNs. Efficiency is a bottleneck for most existing GNNs to handle large graphs. To address this challenge, there are mainly three categories of approaches: layer-wise sampling (e.g., GraphSage Hamilton et al. (2017)), graph sampling methods such as Cluster-GCN  Chiang et al. (2019) and GraphSAINT  Klicpera et al. (2019), and linear models, e.g., SGC Wu et al. (2019) and PPRGo Bojchevski et al. (2020). GraphSage Hamilton et al. (2017) introduces a neighbour-sampling approach, which creates fixed-size subgraphs for each node. Underpinned by graph sampling, Cluster-GCN Chiang et al. (2019) decomposes a large-scale graph into multiple subgraphs based on clustering, while GraphSAINT  Klicpera et al. (2019) utilises light-weight graph samplers along with a normalisation technique for biases elimination in mini-batches. Linear models, SGC Wu et al. (2019) and PPRGo  Bojchevski et al. (2020), decouple graph convolution from embedding transformation (i.e., matrix multiplication with weight matrices), and leverage Personalised PageRank to encode multi-hop neighbourhood, respectively. However, all these methods only focus on supervised learning on graphs. For unsupervised/self-supervised learning settings where no labelled supervision signal is available, these frameworks are not applicable. The most close works to ours to handle large scale graph datasets under self-supervised settings are BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021). They try to reduce the time complexity of the contrastive losses by removing negative samples. However, they still require to calculate the loss for a node, in comparison with of our method GGD.

7 Future Work

In this paper, we have introduced a new self-supervised GRL paradigm: Group Discrimination, which achieves the same level of performance as GCL methods with much less resources consumption (i.e., training time and memory). Some limitations of this work are we still have not explored some questions for GD. For example, can we extend the current binary Group Discrimination scheme (i.e., classifying two groups of summarised node embeddings) to discrimination among multiple groups? Are there any other corruption technique to create a more difficult negative group for discrimination? More importantly, with the extremely efficient property, GD has the potential to be deployed to various real-world applications, e.g., recommendation systems, which have limited labelling information and desire fast computation with limited resources.

References

  • [1] P. Bielak, T. Kajdanowicz, and N. V. Chawla (2021) Graph barlow twins: a self-supervised representation learning framework for graphs. arXiv preprint arXiv:2106.02466. Cited by: §1, §1.
  • [2] A. Bojchevski, J. Klicpera, B. Perozzi, A. Kapoor, M. Blais, B. Rózemberczki, M. Lukasik, and S. Günnemann (2020) Scaling graph neural networks with approximate pagerank. In KDD, pp. 2464–2473. Cited by: §6.
  • [3] W. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C. Hsieh (2019) Cluster-gcn: an efficient algorithm for training deep and large graph convolutional networks. In KDD, pp. 257–266. Cited by: §6.
  • [4] J. Di, L. Wang, Y. Zheng, X. Li, F. Jiang, W. Lin, and S. Pan (2022) CGMN: a contrastive graph matching network for self-supervised graph similarity learning. In IJCAI, Cited by: §1.
  • [5] W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin (2019) Graph neural networks for social recommendation. In WWW, pp. 417–426. Cited by: §1.
  • [6] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. NIPS. Cited by: §1.
  • [7] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In KDD, pp. 855–864. Cited by: §5.2.
  • [8] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NIPS, pp. 1025–1035. Cited by: §A.5, §6.
  • [9] K. Hassani and A. H. Khasahmadi (2020) Contrastive multi-view representation learning on graphs. In ICML, pp. 4116–4126. Cited by: §1, §1, §2, §3.1, §3.2, §4.1, §5.1, §6.
  • [10] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020)

    Open graph benchmark: datasets for machine learning on graphs

    .
    NIPS. Cited by: §A.4, §5.2, §5.2, §5.
  • [11] Y. Jiao, Y. Xiong, J. Zhang, Y. Zhang, T. Zhang, and Y. Zhu (2020) Sub-graph contrast for scalable self-supervised graph representation learning. In ICDM, pp. 222–231. Cited by: §1.
  • [12] M. Jin, Y. Zheng, Y. Li, C. Gong, C. Zhou, and S. Pan (2021) Multi-scale contrastive siamese networks for self-supervised graph representation learning. IJCAI. Cited by: §1, §5.1.
  • [13] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: §A.2, §3.1, §5.1, §5.2, §6.
  • [14] J. Klicpera, A. Bojchevski, and S. Günnemann (2019) Predict then propagate: graph neural networks meet personalized pagerank. ICLR. Cited by: §6.
  • [15] Y. Liu, Z. Li, S. Pan, C. Gong, C. Zhou, and G. Karypis (2021) Anomaly detection on attributed networks via contrastive self-supervised learning. TNNLS. Cited by: §1.
  • [16] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, and J. Huang (2020) Graph representation learning via graphical mutual information maximization. In WWW, pp. 259–270. Cited by: §1, §5.1, §5.2, §6.
  • [17] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §5.
  • [18] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018) Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868. Cited by: §5.
  • [19] S. Thakoor, C. Tallec, M. G. Azar, R. Munos, P. Veličković, and M. Valko (2021) Bootstrapped representation learning on graphs. ICLR2021. Cited by: §1, §4.1.1, §4.1.1, Figure 6, §5.1, §5.1, §5.2, §5.2, §5.2, §5.2, §6.
  • [20] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. ICLR. Cited by: §3.1, §5.1, §6.
  • [21] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2019) Deep graph infomax. ICLR. Cited by: §A.1, §1, §2.1, §2.1, §2, §3.1, §4.1, §5.1, §5.2, §5.2, §6.
  • [22] S. Wan, S. Pan, J. Yang, and C. Gong (2020)

    Contrastive and generative graph convolutional networks for graph-based semi-supervised learning

    .
    AAAI. Cited by: §5.1, §5.1.
  • [23] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger (2019) Simplifying graph convolutional networks. In ICML, pp. 6861–6871. Cited by: §5.1, §6, §6.
  • [24] Z. Yang, W. Cohen, and R. Salakhudinov (2016) Revisiting semi-supervised learning with graph embeddings. In ICML, pp. 40–48. Cited by: §5.1.
  • [25] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen (2020) Graph contrastive learning with augmentations. NIPS 33, pp. 5812–5823. Cited by: §3.1.
  • [26] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny (2021) Barlow twins: self-supervised learning via redundancy reduction. In ICML, pp. 12310–12320. Cited by: Table 1, §1, §4.1.1, §4.1.1, Figure 6, §5.1, §5.1, §5.1, §5.2, §5.2, §5.2, §6.
  • [27] H. Zhang, B. Wu, X. Yuan, S. Pan, H. Tong, and J. Pei (2022) Trustworthy graph neural networks: aspects, methods and trends. arXiv preprint arXiv:2205.07424. Cited by: §1.
  • [28] M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. NIPS 31. Cited by: §1.
  • [29] X. Zheng, Y. Liu, S. Pan, M. Zhang, D. Jin, and P. S. Yu (2022) Graph neural networks for graphs with heterophily: a survey. arXiv preprint arXiv:2202.07082. Cited by: §1.
  • [30] Y. Zheng, M. Jin, S. Pan, Y. Li, H. Peng, M. Li, and Z. Li (2021) Towards graph self-supervised learning with contrastive adjusted zooming. arXiv preprint arXiv:2111.10698. Cited by: §2.2, §3.2.
  • [31] Y. Zheng, V. Lee, Z. Wu, and S. Pan (2021) Heterogeneous graph attention network for small and medium-sized enterprises bankruptcy prediction. In PAKDD, pp. 140–151. Cited by: §1.
  • [32] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang (2020) Deep graph contrastive representation learning. ICML Workshop on Graph Representation Learning and Beyond. Cited by: §1, §5.1, §5.2, §6.

Appendix A Appendix A

a.1 Proof of Theorem 1

Proof. To prove Theorem 1, given a graph , where and a GNN encoder , for simplicity, we consider as a one-layer GCN and conduct the following normalisation to X to make its value to be within the range :

(10)

where is the normalised , denotes the row-entry of the matrix, indicates the column-entry of the matrix, is the function to get the minimum value, and is the function for getting the maximum value.

Then, we input Z to whose weight matrix W is initialised with Xavier initialisation:

H (11)

where H is the output embedding, is non-linear activation function, , is the degree matrix for , W is the learnable weight matrix. As multiplying the normalised adjacency matrix will not change the output data range, the output of is still within the range . Then, as the element in Xavier initialised W is in between and , we can easily derive that the output data range of the matrix multiplication of and W still stay in the same range as W.

After that, we need to apply to the multiplication output. Here, we analyse four commonly adopted non-linear activation functions: Sigmoid, ReLu, LeakyReLu and PReLu.

We first consider the Sigmoid function , as it is a monotonic increasing function, we can derive that the data range of . Then, we apply the Sigmoid function again as DGIVeličković et al. (2019) did to H and obtain:

(12)

Similarily, as the Sigmoid function is a monotonic increasing function, we can easily derive that the data range for is in . Then, when , for the lower bound , we can obtain:

(13)

Also, for the upper bound , when , we can obtain:

(14)

Finally, we can easily observe that and thus prove Theorem 1 when is Sigmoid function.

For the other three non-linear activation functions ReLu , Leaky ReLu and PReLu , where is a learnable parameter, we can also derive the data range of these functions for H are , and , respectively. Here, we can see these functions share the same upper bound . The only difference for their lower bound is the coefficient for , i.e., 0 for ReLu, 0.01 for Leaky ReLu and for PReLu.

By inputting H with these activation functions to the Sigmoid function, we can derive the output data range of the function is . For the lower bound , we can obtain:

(15)

For the upper bound , we can obtain:

(16)

Finally, we can easily observe that and thus prove Theorem 1 when is ReLu, LeakyReLu or PReLu.

a.2 Complexity Analysis

The time complexity of our method consists of two components: the siamese GNN and the loss computation. Existing self-supervised baselines share similar time complexity for the first component. In GGD, Given a graph with nodes and edges in the sparse format, taking a GCN Kipf and Welling (2017) encoder as an example, the time complexity of it is . As we need to process both the augmented graph and the corrupted graph , GGD requires the encoder computation twice. Then, the projector network (i.e., MLP) with linear layers will be applied to the encoder output, which takes for each layer in computation. Here is a parameter defining the hidden size. Before group discrimination, we aggregate the generated embedding with simple summation consuming .

For the loss computation, we use the BCE loss, i.e., Equation 3, to category summarised node embeddings, i.e., scalars. The time complexity of this final step is (i.e., processing all data samples from the positive and negative group). Ignoring the computation cost of the augmentation, the overall time complexity of GGD for computing a graph is , where we can see the time complexity is mainly contributed by the siamese GNN. More importantly, we compress the self-supervised learning loss computation to for a whole graph.

a.3 Graph Power

To show the easiness of graph power computation, we conduct an experiment to evaluate the time consumption for graph power computation on eight datasets, whose statistics are shown in Appendix A.4. Specifically, we set the hidden size of to 256, and is fixed to 10 for all datasets. The experiment results are shown as below:

Cora Cite PubMed Comp Photo Arxiv Products Papers
5.4e-3 7.3e-3 9.8e-3 1.2e-2 8.5e-3 2.2e-2 24.5 208.8
Table 13: Graph power computation time in seconds on eight benchmark datasets. The experiment is conducted using CPU: Intel Xeon Gold 5320. ‘Cite’ , ‘Comp’, ‘Photo’, ‘Arxiv’, ‘Products’, ‘Papers’ means Citeseer, Amazon Computer, Amazon Photo, ogbn-arxiv, ogbn-products and ogbn-papers100M.

This table shows that the computation of graph power is very trivial on small and medium size graphs, e.g., ogbn-arxiv, which has million of edges, consuming only 0.22 seconds. Extending to an extremely large graph, ogbn-papers100M, which have over 1 billion edges and 11 million nodes, the computation only requires 209 seconds (i.e., around three minutes), which is acceptable considering the sheer size of the dataset.

a.4 Dataset Statistics

The following table presents the statistics of eight benchmark datasets including five small to medium -scale datasets and three large-scale datasets from OGB Graph BenchmarkHu et al. (2020).

Dataset Nodes Edges Features Classes
Cora 2,708 5,429 1,433 7
CiteSeer 3,327 4,732 3,703 6
PubMed 19,717 44,338 500 3
Amazon Computers 13,752 245,861 767 10
Amazon Photo 7,650 119,081 745 8
ogbn-arxiv 169,343 1,166,243 128 40
ogbn-products 2,449,029 61,859,140 100 47
ogbn-papers-100M 111,059,956 1,615,685,872 100 172
Table 14: The statistics of eight benchmark datasets.

a.5 Experiment Settings & Computing Infrastructure

Extending to Extremely Large Datasets. Extending to extremely large graphs (i.e., ogbn-products and ogbn-papers100M), we adopt a simple Neighbourhood Sampling strategy introduced in GraphSage Hamilton et al. (2017) to decouple model training from the sheer size of graphs. Specifically, we create a fixed size subgraph for each node, which is created by sampling a predefined number of neighbours in each convolution layer for sampled nodes. The same approach is employed in the testing phase to obtain final embeddings.

General Parameter Settings. In our experiment, we mainly tunes four parameters for GGD ,which are learning rate, hidden size, number of convolution layers in the GNN encoder, and number of linear layers in the projector. For simplicity, we set the n-th graph power for final embedding generation fixed to 10 for all datasets. The parameter setting for each dataset is shown below:

Dataset lr hidden num-conv num-proj
Cora 1e-3 256 1 1
CiteSeer 1e-5 512 1 1
PubMed 1e-3 512 1 1
Amazon Computers 1e-3 256 1 1
Amazon Photo 1e-3 1024 1 1
ogbn-arxiv 5e-5 1500 3 1
ogbn-products 1e-4 1024 4 4
ogbn-papers-100M 1e-3 256 3 1
Table 15: Parameter settings on eight datasets. ‘num-conv’ and ‘num-proj’ represent number of convolution layers in GNNs and number of linear layers in projector, respectively.

Large-scale Datasets Parameter Settings. To decouple model training from the scale of graphs, we adopt the neighbouring sampling technique, which has three parameters: batch size, sample size, number of hops to be sampled. Batch size refers to the number of nodes to be processed in one parameter optimisation step. Sample size means the number of nodes to be sampled in each convolution layer, and number of hops determines the scope of the neighbourhood for sampling. In GGD implementation, the batch size, sample size, and number of hops are fixed to 2048, 12 and 3, respectively.

Memory and Training Time Comparison. As memory and training time are very sensitive to hyper-parameters related to the structure of GNNs, including hidden size, number of convolution layers, and batch processing for large-scale datasets, e.g., batch size and number of neighbours sampled in each layer. Thus, in memory and training comparison, to be fair, we set all these parameters to be the same for all baselines and GGD. The specific parameter setting for each dataset is shown below:

Dataset hidden num-conv batch num-neigh
Cora 512 1 - -
CiteSeer 512 1 - -
PubMed 256 1 - -
Amazon Computers 256 1 - -
Amazon Photo 256 1 - -
ogbn-arxiv 256 3 - -
ogbn-products 256 3 512 10
ogbn-papers-100M 128 3 512 10
Table 16: Parameter settings on eight datasets for memory and training time comparison.

Computing Infrastructure. For experiments in section 3, 5 and 6.1, they are conducted using Nvidia GRID T4 (16GB memory) and Intel Xeon Platinum 8260 with 8 core. For experiments on large-scale datasets (i.e.,ogbn-arxiv, ogbn-products and ogbn-papers100M), we use NVIDIA A40 (48GB memory) and Intel Xeon Gold 5320 with 13 cores.