1 Introduction
Graph Neural Networks (GNNs) have been widelyadopted in learning representations for graphstructured data. With a messagepassing approach to aggregate neighbouring information based on the topology of a graph, GNNs can learn effective lowdimensional node embeddings, which can be used for a variety of downstream tasks such as node classification
Jin et al. (2021), link prediction Zhang and Chen (2018), and graph classification Hassani and Khasahmadi (2020). GNNs have been further applied in diverse domains, e.g., graph similarity computation Di et al. (2022), recommendation systems Fan et al. (2019), heterophilic graphs Zheng et al. (2022), trustworthy systems Zhang et al. (2022), and anomaly detection
Liu et al. (2021); Zheng et al. (2021b).However, many GNNs adopt a supervised learning manner to train models with label information, which is expensive and labourintensive to collect in realworld. To address this issue, a few studies (e.g., DGI Veličković et al. (2019), MVGRL Hassani and Khasahmadi (2020), GMI Peng et al. (2020), GRACE Zhu et al. (2020), and SubgCon Jiao et al. (2020)
) borrow the idea of contrastive learning from computer vision (CV), and introduce Graph contrastive learning (GCL) methods for selfsupervised GRL. The core idea of these methods is to maximise the mutual information (MI) between an anchor node and its positive counterparts, sharing similar semantic information while doing the opposite for negative counterparts as shown in Figure
1(a). Nonetheless, such a scheme relies on expensive similarity calculation in contrastive loss computation. Taking two commonlyadopted contrastive loss functions, InfoNCE and JSD estimator as examples, the time complexity of loss computation for a node is and , respectively. Here, represents the number of nodes, and is the embedding dimension. Additionally, GCL normally requires a large number of training epoches to be welltrained on largescale datasets. Thus, when the size of the dataset is large, these methods require a significant amount of time and resources to be welltrained.Method  Pre  Tr  Epo  Total(E)  Imp(E)  Total(T)  Imp(T)  Acc 
GBT(256)  5.52  6.47  300  1,946.52    1,941.00    70.1 
GGD(256)  6.26  0.18  1  6.44  302.25  0.18  10,783.33  70.3 
GGD(1,500)  6.26  0.95  1  7.21  269.96  0.95  2,043.16  71.6 
Though a few GCL works attempt to alleviate the training load of InfoNCE by removing negative node pairs with specially designed schemes, e.g., BGRL Thakoor et al. (2021) and GBT Bielak et al. (2021), these methods still require to compute the contrastiveness for a node. Driven by the success of BYOL Grill et al. (2020) in CV, BGRL Thakoor et al. (2021) adopts a bootstrapping scheme, which only contrasts a node from the online network (i.e., updated with gradient) to its corresponding embedding from the target network (i.e., updated momentumly with stop gradient). Based on BarlowTwins Zbontar et al. (2021), GBT Bielak et al. (2021) utilises a crosscorrelationbased loss function to get rid of negative samples. However, these two methods still require as they still need to conduct similarity computation. Thus, the complexity of these two methods are only on par with the JSD estimator and still suffer from the inefficiency in model training.
To boost training efficiency of selfsupervised GRL, inspired by an observation of a technical defect (i.e., inappropriate application of Sigmoid function) in two representative GCL studies, we introduce a novel learning paradigm, namely, Group Discrimination (GD). Remarkably, with GD, the time complexity of loss computation for a node is only , endowing the scheme with the extremely efficient property. Instead of similarity computation, GD directly discriminates a group of positive nodes from a group of negative nodes, as shown in Figure 1
(b). Specifically, GD defines summarised node embeddings generated with original graph as the positive group, where each node is summarised into a single scalar, while embeddings obtained with corrupted topology are regarded as the negative group. Then, GD trains the model by classifying these embeddings into the correct group with a very simple binary crossentropy loss. By doing so, the model can extract valuable selfsupervised signals from learning the edge distribution of a graph. Compared with GCL, GD enjoys numerous merits including extremely fast training, fast convergence (e.g., 1 epoch to be welltrained on largescale datasets), and high scalability while achieving SOTA or on par performance with existing GCL approaches.
Using GD as backbone, we design a new selfsupervised GRL model with the Siamese structure called Graph Group Discrimination (GGD). Firstly, we can optionally augment a given graph with augmentation techniques, e.g., feature and edge dropout. Then, the augmented graph are fed into a GNN encoder and a projector to obtain embeddings for the positive group. After that, the augmented feature is corrupted with node shuffling (i.e., disarranging the order of nodes in the feature matrix) to disrupt the topology of a graph and input to the same network for obtaining embeddings of the opposing group. Finally, the model is trained by discriminating the summarisation of these two groups of nodes. The contributions of this paper are threefolds: 1) We introduce a novel and efficient selfsupervised GRL paradigm, namely, Group Discrimination (GD). Notably, with GD, the time complexity of loss computation for a node is only . 2) Based on GD, we propose a new selfsupervised GRL model, GGD, which is fast in training and convergence, and possess high scalability. 3) We conduct extensive experiments on eight datasets, including an extremely large dataset, ogbnpapers100M with billion edges. The experiment results show that our proposed method reaches stateoftheart performance while consuming much less time and memory than baselines, e.g., 10783 faster than the most efficient GCL baseline with its best selected epochs number Bielak et al. (2021), as shown in Table 1.
2 Rethinking Representative GCL Methods
In this section, we analyse a technical defect observed in two representative GCL methods, DGI Veličković et al. (2019) and MVGRL Hassani and Khasahmadi (2020). Then, based on the technical defect, we show the learning method behind these two approaches is not contributed to Contrastive Learning, but a new paradigm, Group Discrimination. Finally, from the analysis, we provide the definition of this new concept.
2.1 Rethinking DGI
DGI Veličković et al. (2019)
is the first work introducing Contrastive Learning into GRL. However, due to a technical defect observed in their official opensource code, we found it is essentially not working as the authors thought (i.e., learning via MI interaction). The backbone that truly makes it work is a new paradigm, Group Discrimination.
Constant Summary Vector. As shown in Figure 2, the original idea of DGI is to maximise the MI (i.e., the red line) between a node and the summary vector , which is obtained by averaging all node embeddings in a graph . Also, to regularise the model training, DGI corrupts by shuffling the node order of the input feature matrix to get . Then, generated embeddings of are served as negative samples, which are pulled apart from the summary vector via MI minimisation.
Activation  Statistics  Cora  CiteSeer  PubMed 
Mean  0.50  0.50  0.50  
ReLU/LReLU/PReLU  Std  1.3e03  1.0e04  4.0e04 
Range  1.4e03  8.0e04  1.5e03  
Mean  0.62  0.62  0.62  
Sigmoid  Std  5.4e05  2.9e05  6.6e05 
Range  3.6e03  3.0e03  3.2e03 
Summary vector statistics on three datasets with different activation functions including ReLU, LeakyReLU (i.e., LReLU shown below), PReLU, and Sigmoid.
Nonetheless, in the implementation of DGI, a Sigmoid function is inappropriately applied on the summary vector generated from a GNN whose weight is initialised with Xavier initialisation. As a result, elements in the summary vector are very close to the same value. We have validated this finding on three datasets (i.e., Cora, CiteSeer and PubMed) with different activation functions used in the GNN encoder including ReLU, Leaky ReLU, PReLU, and Sigmoid. The experiment result is shown in Table 2, which shows that summary vectors in all datasets are approximately a constant vector , where is a scalar and is an allones vector (e.g., =0.5 in these datasets with ReLu, Leaky ReLu or PReLu as activation).
To theoretically explain this phenomenon, we present the theorem below:
Theorem 1
Given , and a GCN encoder whose weight matrix is initialised with Xavier initialisation, we can obtain its embedding . Then, when increases, the data range of Sigmoid function taking H as input (i.e., ) converges to 0.
Based on this theorem, we can see if the dimension of input feature matrix X
is large, then these summary vectors can lose variance and become constant vector. The proof for the theorem is presented in Appendix
A.1.To evaluate the effect of to constant summary vector, we vary the scalar (from 0 to 1 increment by 0.2) to change the constant summary vector and report the model performance (i.e., averaged accuracy on five runs) in Table 3.
Dataset  0  0.2  0.4  0.6  0.8  1.0 
Cora  70.3  82.4  82.3  82.5  82.3  82.5 
CiteSeer  61.8  71.7  71.9  71.6  71.7  71.6 
PubMed  68.3  77.8  77.9  77.7  77.4  77.2 
From this table, we can see, except for 0, the model performance is trivially affected by for constant summary vector. When the summary vector is set to 0, the model performance plummets because node embeddings become all 0 when multiplying with such vector and the model converges to the trivial solution. As the summary vector only has a trivial effect on model training, the hypothesis of DGI Veličković et al. (2019) on learning via contrastiveness between anchor nodes and the summary instance does not hold, which raises a question to be investigated: What truly leads to the success of DGI?
Simplifying DGI. To answer the question, we predigest the loss proposed in DGI by using an allones vector as the summary vector (i.e., setting ) and simplifying the discriminator (i.e., removing the learnable weight vector). Then, we rewrite the loss to the following form:
(1)  
where is the vector multiplication operation, is the number of nodes in a graph, and are the original and corrupted embedding for node , is the summation function, and is a discriminator for bilinear transformation, which can be formulated as follows:
(2) 
Experiment  Method  Cora  CiteSeer  PubMed 
Accuracy  DGI  81.7  71.5  77.3 
82.5  71.7  77.7  
Memory  DGI  4189MB  8199MB  11471MB 
1475MB64.8%  1587MB80.6%  1629MB85.8%  
Time  DGI  0.085s  0.134s  0.158s 
0.010s8.5  0.021s6.4  0.015s10.5 
where is a learnable weight vector. Specifically, as shown in Equation 2, by removing the weight vector , is directly multiplied with . As is a vector containing only one, the multiplication of and is equivalent to summing itself directly. From this form, we can see that the multiplication of and the summary vector only serves as an aggregation function (i.e., summation aggregation) to summarise information in . To explore the effect of other aggregation functions, we replace the summation function in Equation 1 with other aggregation methods including mean, minimum, maximum pooling and linear aggregation. We report the experiment results (i.e., averaged accuracy on five runs) in Table 5. The table shows that replacing the summation function with other aggregation methods still works, while summation and linear aggregation achieve comparatively better performance.
Based on Equation 1, we can rewrite it to a very simple binary cross entropy loss if we also include corrupted nodes as data samples and setting :
(3) 
Method  Cora  CiteSeer  PubMed 
Sum  82.5  71.7  77.7 
Mean  81.8  71.8  76.5 
Min  80.4  61.7  70.1 
Max  71.4  65.3  70.2 
linear  82.2  72.1  77.9 
where means the indicator for node (i.e., if node is corrupted, is 0, otherwise it is 1), and represents the summarisation of node ’s embedding. As we include corrupted nodes as data samples, the size of nodes to be processed is doubled to (i.e., the number of corrupted nodes is equal to the number of original nodes). From the equation above, we can easily observe that what DGI truly does is discriminate a group of summarised original node embeddings from the other group of summarised corrupted node embeddings, as shown in Figure 1. We name this selfsupervised learning paradigm "Group Discrimination". To validate the effectiveness of this paradigm, we replace the original DGI loss with Equation 3, namely, and compare it with DGI on three datasets in terms of training time, memory efficiency and model performance as shown in Table 4. Here, adopts the same parameter setting as DGI. From this table, we can observe dramatically improves DGI in both memory and timeefficiency while it slightly enhances the model performance of DGI. This is contributed to the removal of multiplication operations between node pairs, which eases the burden of computation and memory consumption in both forward and backward propagation. In section 4.1, we further explore GD and compare it with existing contrastive losses.
2.2 Rethinking MVGRL
Extending the architecture of DGI, MVGRL resorts to multiview contrastiveness via additional augmentation. Specifically, as shown in Figure 3, it first uses the diffusion augmentation to create . Then, it corrupts and to generate negative samples and . To build contrastiveness, MVGRL also generates two summary vectors and by averaging all embeddings in and , respectively. Based on the design of MVGRL, the model training is driven by mutual information maximisation between an anchor node embedding and its corresponding augmented summary vector. However, MVGRL has the same technical error in their official JSDbased implementation as DGI, which makes it also becomes an groupdiscriminationbased approach.
Similar to Equation 3 of DGI, the proposed loss in MVGRL can also be rewritten as a binary cross entropy loss:
(4) 
here the number of nodes is increased to as we include nodes in and as data samples. The indicator for and are 1, while and are considered as negative samples (i.e., the indicator for them are 0). To explore why MVGRL can achieve a better performance than DGI, we replace the original MVGRL loss with Equation 4 and conduct ablation study of by removing different set of data samples in Equation 4, and report the experiment result in Table 6. From the table, the performance of is on par with , which reconfirms the effectiveness of using the bce loss. Also, we can observe that including and is the key of MVGRL surpassing DGI. With and , the model performance of and is improved from 82.2 to 83.1. We conjecture this is because with the diffusion augmentation, MVGRL is trained with the additional global information provided by the diffused view . However, the diffusion augmentation involves expensive matrix inversion computation and significantly densifies the given graph, which requires much more memory and time to store and process than the original view. This can hinder the model from extending to largescale datasets Zheng et al. (2021a).
Method  Cora  CiteSeer  PubMed 
82.9  72.6  78.8  
83.1  72.8  79.1  
w/o  81.2  52.8  76.6 
w/o  82.1  71.8  77.1 
w/o  81.1  56.7  74.9 
w/o  82.7  72.0  78.6 
w/o and  82.2  71.8  77.0 
w/o and  83.1  72.6  78.5 
2.3 Definition of Group Discrimination
As mentioned above, Group Discrimination is a selfsupervised GRL paradigm, which learns by discriminating different groups of node embeddings. Specifically, the paradigm assigns different indicators to different groups of node embeddings. For example, for binary group discrimination, one group is considered as the positive group with class 1 as its indicator, whereas the other group is the negative group, having its indicator assigned as 0. Given a graph , the positive group usually includes node embeddings generated with the original graph or its augmented views (i.e., similar graph instances of created by augmentation). In contrast, the opposing group contains negative samples obtained by corrupting , e.g., changing its topology structure.
3 Methodology
We first define unsupervised node representation learning and then present the architecture of GGD, which extends with additional augmentation and embedding reinforcement to reach better model performance. Given a graph with attributes , where is the number of nodes in , and is the number of dimensions of X, our aim is to train a GNN encoder without the reliance on labelling information. With the trained encoder, taking and X as input, it can output learned representations , where is the predefined output dimension. H can then be used in many downstream tasks such as node classification.
3.1 Graph Group Discrimination
Based on the proposed selfsupervised GRL paradigm, Group Discrimination, we have designed a novel method, namely GGD, to learn node representations using a Siamese network structure and a binary crossentropy loss. The architecture of the proposed framework is presented in Figure 4. The framework mainly consists of four components: augmentation, corruption, a Siamese GNN network, and Group Discrimination.
Augmentation. With a given graph and feature matrix X, optionally, we can augment it with augmentation techniques such as edge and feature dropout to create and . In practice, we follow the augmentation proposed in GraphCL You et al. (2020). Specifically, edge dropout removes a predefined fraction of edges, while we use node dropout to mask a predefined proportion of feature dimension, i.e., assigning 0 to replace values in randomly selected dimensions. This step is optional in implementation.
Notably, the motivation of using augmentation in our framework is distinct from contrastive learning methods. In our study, augmentation is used to increase the difficulty of the selfsupervised training tasks. With augmentation, and is changing in every training iteration, which forces the model to lessen the dependence on the fixed pattern (i.e., unchanged edge and feature distribution) in a monotonous graph. However, in Contrastive Learning, augmentation creates augmented views sharing similar semantic information for building contrastiveness.
Corruption. and are then corrupted to build and for the generation of node embeddings in the negative group. We adopt the same corruption technique used in DGI Veličković et al. (2019) and MVGRL Hassani and Khasahmadi (2020) as shown in Figure 5. The corruption technique devastates the topology structure of by randomly changing the order of nodes in . The corrupted and can be used for producing node representations with incorrect network connections.
The Siamese GNN. We have designed a Siamese GNN network to output node representations given a graph and its attribute. The Siamese GNN network is made up of two components, which are a GNN encoder and a projector. The backbone GNN encoder is replaceable with a variety of choices of GNNs, e.g., GCN Kipf and Welling (2017) and GAT Veličković et al. (2018). In our work, we adopt GCN as the backbone. The projector is a flexible network with a userdefined number of linear layers. When generating node embeddings of the positive group, the Siamese network takes and as input. Using the same encoder and projector, the Siamese network output the negative group with and . These two groups of node embeddings are considered as a collection of data samples with a size of for discrimination. Before conducting group discrimination, all data samples are summarised with the same aggregation technique, e.g., sum, mean, and linear aggregation.
Group Discrimination. In the group discrimination process, we adopts a very simple binary cross entropy (BCE) loss to discriminate two groups of node embeddings as shown in Equation 3. In our implementation, is 0 and 1 for node embeddings in negative and positive groups. During model training, the model is optimised by categorising node embeddings in the collection of data samples into their corresponding class correctly. The loss is computed by comparing the summarisation of a node , i.e., a scalar, with its indicator . With the ease of BCE loss computation, the training process of GGD is very fast and memory efficient.
3.2 Model Inference
During training, the model is optimised via loss minimisation with Equation 3. The time complexity analysis of GGD is provided in Appendix A.2. With the trained GNN encoder , we can obtain the node embeddings with the input .
Inspired by MVGRL Hassani and Khasahmadi (2020), which strengthens the output embeddings by including additional global information, we adopt a conceptually similar embedding reinforcement approach. Specifically, they obtain the final embeddings by summing up embeddings from two views: the original view comprising local information and the diffused view with global information. This operation reinforces the final embeddings and leads to model performance improvement. Nonetheless, graph diffusion impairs the scalability of a model Zheng et al. (2021a) and hence cannot be directly applied in our embedding generation. To avoid the diffusion computation, we have come up with a workaround in the virtue of graph power to extract global information. Graph power can extend the message passing scope of to nhop neighbourhood, which encodes global information from distant neighbours. It can be formulated as follows:
(5) 
where is the global embedding, and A is the adjacency matrix of the graph . It is notable that this operation can be easily decomposed with the associative property of matrix multiplication and is easy to compute. To show the easiness of such computation, we conduct an experiment showing its time consumption on various datasets in Appendix A.3. Finally, the final embedding can be achieved by , which can be used for downstream tasks.
4 Exploring Group Discrimination
4.1 Corruption
Firstly, we explore the corruption technique used in DGI Veličković et al. (2019) and MVGRL Hassani and Khasahmadi (2020), which is shown in Figure 5. These two studies corrupt the topology of a given graph by shuffling the feature matrix X. This is because by changing the node order of X, the neighbouring structure of is completely changed, e.g., neighbours of node become node neighbours.
With the corruption technique, negative samples in the negative group are generated with incorrect edges. Thus, by discriminating the positive group (i.e., nodes generated with ground truth edges) and the negative group, we conjecture the model can distill valuable signals from learning how to identify nodes generated with correct topology and output effective node embeddings.
4.1.1 Benefits of Group Discrimination
Compared with contrastive learning, Group Discrimination enjoys advantages in computation efficiency, memory efficiency and convergence speed. To compare Group Discrimination with GCL, we first analyse the complexity of two commonlyadopted contrastive loss functions, InfoNCE and JSD estimator. These two contrastive losses of an anchor node can be formulated as follows:
(6) 
(7) 
where represents node embedding of node , is the temperature coefficient, is the discriminator, is the corrupted negative counterpart, and is the positive counterpart sharing similar semantic information with node . In InfoNCE, the loss computation conducts similarity computation for positive (i.e, the numerator of Equation 6) and negative counterparts (i.e., the denominator) once and times, respectively. As the time complexity of node similarity computation is (e.g., vector multiplication between and ), the overall complexity for InfoNCE is . Though the JSD estimator does not require a size of negative samples for each node, it still needs for the loss computation of an anchor node because the discriminator requires vector multiplication between embeddings of node and .
To alleviate the computation burden of InfoNCE loss, BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) attempt to get rid of the reliance on negative samples with customised losses. BGRL Thakoor et al. (2021)
utilises a simple loss to minimise the cosine similarity between the online node embedding
and the target node embedding for node :(8) 
where and are generated from two augmented views and . Different from BGRL Thakoor et al. (2021), GBT Zbontar et al. (2021) adopts a correlationbased loss as follows:
(9) 
where and are two indices for the embedding dimension.
From the losses shown above, we can see these two losses still involve vector multiplication between two node embeddings. Thus, even though these two methods reduce the time complexity of InfoNCE (i.e., ) to , they are still on par with the JSD estimator. Different from the aforementioned contrastive learning loss, to actuate a Group Discrimination learning paradigm, our method uses an effortless binary crossentropy loss as shown in Equation 3. In Group Discrimination, the loss computation of a node requires only as it conducts multiplication between scalars (i.e., and ) in lieu of vectors. Thus, Group Discrimination has great advantages in computation and memory efficiency. Another merit of the proposed paradigm is fast convergence as shown in section 5.2. We conjecture this is contributed to the concentration of Group Discrimination on the general edge distribution of graphs instead of nodespecific information as GCL methods do. Thus, it would not be distracted from toodetailed information.
5 Experiments
We evaluate the effectiveness of our model using eight benchmark datasets of different sizes. These datasets include five small and mediumscale datasets: Cora, CiteSeer, PubMed Sen et al. (2008), Amazon Computers, and Amazon Photos Shchur et al. (2018), as well as largescale datasets ogbnarxiv, ogbnproducts and ogbnpapers100M. Notably, ogbnpapers100M is the largest dataset provided by Open Graph BenchmarkHu et al. (2020) for node property prediction tasks. It has over 110 million nodes and 1 billion edges. The statistics of these datasets are summarised in Appendix A.4. To ensure reproducibility, the detailed experiment settings and computing infrastructure are summarised in Appendix A.5.
Data  Method  Cora  CiteSeer  PubMed  Comp  Photo 
X, A, Y  GCN  81.5  70.3  79.0  76.3  87.3 
X, A, Y  GAT  83.0  72.5  79.0  79.3  86.2 
X, A, Y  SGC  81.0  71.9  78.9  74.4  86.4 
X, A, Y  CG3  83.4  73.6  80.2  79.9  89.4 
X, A  DGI  81.7  71.5  77.3  75.9  83.1 
X, A  GMI  82.7  73.0  80.1  76.8  85.1 
X, A  MVGRL  82.9  72.6  79.4  79.0  87.3 
X, A  GRACE  80.0  71.7  79.5  71.8  81.8 
X, A  BGRL  80.5  71.0  79.5  89.2  91.2 
X, A  GBT  81.0  70.8  79.0  88.5  91.1 
X, A  GGD  83.9  73.0  81.3  90.1  92.5 
Method  Cora  CiteSeer  PubMed  Comp  Photo 
DGI  0.085  0.134  0.158  0.171  0.059 
GMI  0.394  0.497  2.285  1.297  0.637 
MVGRL  0.123  0.171  0.488  0.663  0.468 
GRACE  0.056  0.092  0.893  0.546  0.203 
BGRL  0.085  0.094  0.147  0.337  0.273 
GBT  0.073  0.072  0.103  0.492  0.173 
GGD  0.010  0.021  0.015  0.016  0.009 
Improve  7.339.4  3.423.7  6.9152.3  10.715.3  19.270.8 
5.1 Evaluating on Small and Mediumscale Datasets
We compare GGD with ten baselines including four supervised GNNs (i.e., GCN Kipf and Welling (2017), GAT Veličković et al. (2018), SGC Wu et al. (2019), and CG3 Wan et al. (2020)) and six GCL methods (i.e., DGI Veličković et al. (2019), GMI Peng et al. (2020), MVGRL Hassani and Khasahmadi (2020), GRACE Zhu et al. (2020), BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021)) on five small and medium scale benchmark datasets. In the experiment, we follow the same data splits as Yang et al. (2016)
for Cora, CiteSeer and PubMed. For Amazon Computers and Photos, we adopt the semisupervised experiment setting, which includes 30 randomlyselected nodes per class in the training and validation set. The remaining nodes are used as the test set. The model performance is measured using the averaged classification accuracy with five results along with standard deviations and reported in Table
7.Method  Cora  CiteSeer  PubMed  Comp  Photo 
DGI  4,189  8,199  11,471  7,991  4,946 
GMI  4,527  5,467  14,697  10,655  5,219 
MVGRL  5,381  5,429  6,619  6,645  6,645 
GRACE  1,913  2,043  12,597  8,129  4,881 
BGRL  1,627  1,749  2,299  5,069  3,303 
GBT  1,651  1,799  2,461  5,037  2,641 
GGD  1,475  1,587  1,629  1,787  1,637 
Improve  10.772.6%  11.880.6%  27.285.8%  64.583.2%  38.075.4% 
Accuracy. From Table 7, we can observe that GGD generally outperforms all baselines in all datasets. The only exception is on CiteSeer dataset, where the semisupervised method, CG3Wan et al. (2020), slightly outperforms GGD, which still provides the 2nd best performance. In this experiment, we reproduce BGRL Thakoor et al. (2021), GBT Zbontar et al. (2021) and GGD, while the other results are sourced from previous studies Wan et al. (2020); Jin et al. (2021).
Efficiency and Memory Consumption. GGD is substantially more efficient than other selfsupervised baselines in time and memory consumption as shown in Table 8 and Table 9. Remarkably, GGD is 19.2 times faster in Amazon Photos for training time per epoch, and consumes 64.5% less memory in Amazon Computers for memory consumption than the most efficient baselines (i.e., GBT Zbontar et al. (2021)). The dramatic boost of time and memory efficiency of GGD is contributed to the exclusion of similarity computation in selfsupervised signal extraction, which enables model training without multiplication of node embeddings.
5.2 Evaluating on Largescale datasets
To evaluate the scalability of GGD, we choose three largescale datasets from Open Graph Benchmark Hu et al. (2020), which are ogbnarxiv, ogbnproducts, and ogbnpapers100M. ogbnpapers100M is the most challenging largescale graph available in Open Graph Benchmark for node property prediction with over 1 billion edges and 110 million nodes. Extending to extremely large graphs (i.e., ogbnproducts and ogbnpapers100M), we adopt a Neighbourhood Sampling strategy, which is described in Appendix A.5.
ogbnarxiv & ogbnproducts. For ogbnarxiv, we compare GGD against four selfsupervised baselines (i.e., DGI Veličković et al. (2019), GRACE Zhu et al. (2020), BGRL Thakoor et al. (2021)and GBT Zbontar et al. (2021)), whereas BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) are selected to be compared for ogbnproducts.
Method  Valid  Test  Memory  Time  Total 
Supervised GCN  73.0  71.7       
MLP  57.7  55.5       
Node2vec  71.3  70.1       
DGI  71.3  70.3       
GRACE(10k epos)  72.6  71.5       
BGRL(10k epos)  72.5  71.6  OOM (Fullgraph)  /  / 
GBT(300 epos)  71.0  70.1  14,959MB  6.47  1,941.00 
GGD(1 epo)  72.7  71.6  4,513MB69.8%  0.18  0.1810,783 
In addition, we include the performance of MLP, Node2vec Grover and Leskovec (2016), and supervised GCN Kipf and Welling (2017) sourced from Hu et al. (2020) in Table 10 and Table 11. For memory and training time comparison, we only compare GGD with the two most efficient baselines (i.e., BGRL and GBT according to Tables 8 and 9). In ogbnarxiv, we reproduce BGRL Thakoor et al. (2021) and found it fails to process ogbnarxiv in full batch. Thus, we only compare GGD and GBT in this dataset, which can successfully train in fullgraph processing mode.
Method  Valid  Test  Memory  Time  Total 
Supervised GCN  92.0  75.6      
MLP  75.5  61.1      
Node2vec  70.0  68.8      
BGRL (100 epos)  78.1  64.0  29,303MB  53m16s  5,326m40s 
GBT (100 epos)  85.0  70.5  20,419MB  48m38s  4,863m20s 
GGD(1 epo)  90.9  75.7  4,391MB78.5%  12m46s  12m46s381 
From Table 10 and Table 11, we can see GGD remarkably achieves the stateoftheart performance using only one epoch to train. As a result, GGD is 10,783 times faster than the most efficient baseline, i.e., GBT Zbontar et al. (2021)
, on total training time to reach the desirable performance in ogbnarxiv. Please be noted that the number of epochs in our experiment is consistent with the optimal choice of this hyperparameter specified in GBT
Zbontar et al. (2021). For ogbnproducts, we are 381 faster than GBT Zbontar et al. (2021) on total training time. Notably, our performance is significantly higher than GCL baselines using 100 epochs (i.e., 6% and 5.2% improvement on GBT Zbontar et al. (2021) in validation and test set) with only one epoch training in this dataset. In addition, we compare the convergence speed among GGD, BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) on ogbnarxiv and ogbnproducts, which are shown in Figure 6. For ogbnarxiv, BGRL Zbontar et al. (2021) is running using batched processing with neighbour sampling. This figure shows the preeminence of GGD in convergence speed as GGD can be welltrained with only one epoch (i.e., reaching the peak model performance in the first epoch and staying stable with increased epochs). In contrast, the other two baselines require comparatively much more epochs to gradually improve their performance. Compared with GCL baselines, GGD achieves much faster convergence via Group Discrimination. We conjecture this is because GDbased method focuses on the general edge distribution of graphs instead of nodespecific information. Inversely, GCL methods can suffer from convergence inefficiency as they may be easily distracted from toodetailed nodespecific information during training.Method  Validation  Test  Memory  Time 
Supervised SGC  63.3  66.5     
MLP  47.2  49.6     
Node2vec  55.6  58.1     
BGRL (1 epoch)  59.3  62.1  14,057MB  26h28m 
GBT (1 epoch)  58.9  61.5  13,185MB  24h38m 
GGD(1 epoch)  60.2  63.5  4,105MB68.9%  9h15m2.7 
ogbnpapers100M. We further compare GGD with BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) on ogbnpapers100M, the largest OGB dataset with billion scale edges. Other selfsupervised learning algorithms such as DGI Veličković et al. (2019) and GMI Peng et al. (2020) fail to scale to such a large graph with a reasonable batch size (i.e., 256). We only report the performance of each algorithm after a single epoch of training in Table LABEL:ogbnpapers100m due to the extreme scale of the dataset and the limitation of our available resources. From the table, we can observe that GGD outperforms the two GCL counterparts, BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021) in both accuracy and efficiency. Specifically, GGD achieves 60.2 in accuracy while BGRL and GBT reach 59.3 and 58.9 in test set, respectively. With only one epoch, these two algorithms may not be well trained. However, training each epoch of these two requires over 1 day and if we would like to train them for 100 epochs, then we will need 100+ GPU days, which is prohibitively impractical for general practitioners. In contrast, GGD can be trained in about 9 hours to achieve a good result for this dataset, which is more appealing in practice.
6 Related Work
Graph Neural Networks (GNNs) are generalised deep neural networks for graphstructured data. GNNs mainly have two categories, spectralbased GNNs and spatialbased GNNs. Spectral GNNs attempt to use eigendecomposition to obtain the spectralbased representation of graphs, whereas spatial GNNs focus on using spatial neighbours of nodes for message passing. Extending spectralbased methods to the spatial domain, GCN Kipf and Welling (2017) utilises firstorder Chebyshev polynomial filters to approximate spectralbased graph convolution. Taking the weight of spatial neighbours in consideration, GAT Veličković et al. (2018), improves GCN by introducing the attention module in message passing. To decouple message passing from neural networks, SGC Wu et al. (2019) simplifies GCN by removing the nonlinearity and weight matrices in graph convolution layers. However, these studies cannot handle datasets with limited or no labels. Graph contrastive learning has been recently exploited to address this issue.
Graph Contrastive Learning (GCL) aims to alleviate the reliance on labelling information in model training based on the concept of mutual information (MI). Specifically, GCL approaches maximise MI between instances with similar semantic information, and minimise the MI between dissimilar instances. For example, DGI Veličković et al. (2019) builds contrastiveness between node embeddings and a summary vector (i.e., a graph level embedding obtained by averaging all node embeddings) with a JSD estimator. To improve DGI, MVGRL Hassani and Khasahmadi (2020) and GMI Peng et al. (2020) extends the idea of DGI by introducing multiview contrastiveness with diffusion augmentation, and focusing on a local scope with the firstorder neighbourhood, respectively. Adopting InfoNCE loss, GRACE Zhu et al. (2020) applies augmentation techniques to create two augmented views and inject contrastiveness between them. Though these GCL methods have successfully outperformed some supervised baselines in benchmark datasets, these methods suffer from significant limitations, including timeconsuming loss computation, a large number of training epochs and poor scalability. For example, InfoNCE and JSD estimator require and in loss computation of a node, respectively. On the contrast, our method GGD presented in this paper reduces the time complexity of loss computation for a node to only and converge fast.
Scalable GNNs. Efficiency is a bottleneck for most existing GNNs to handle large graphs. To address this challenge, there are mainly three categories of approaches: layerwise sampling (e.g., GraphSage Hamilton et al. (2017)), graph sampling methods such as ClusterGCN Chiang et al. (2019) and GraphSAINT Klicpera et al. (2019), and linear models, e.g., SGC Wu et al. (2019) and PPRGo Bojchevski et al. (2020). GraphSage Hamilton et al. (2017) introduces a neighboursampling approach, which creates fixedsize subgraphs for each node. Underpinned by graph sampling, ClusterGCN Chiang et al. (2019) decomposes a largescale graph into multiple subgraphs based on clustering, while GraphSAINT Klicpera et al. (2019) utilises lightweight graph samplers along with a normalisation technique for biases elimination in minibatches. Linear models, SGC Wu et al. (2019) and PPRGo Bojchevski et al. (2020), decouple graph convolution from embedding transformation (i.e., matrix multiplication with weight matrices), and leverage Personalised PageRank to encode multihop neighbourhood, respectively. However, all these methods only focus on supervised learning on graphs. For unsupervised/selfsupervised learning settings where no labelled supervision signal is available, these frameworks are not applicable. The most close works to ours to handle large scale graph datasets under selfsupervised settings are BGRL Thakoor et al. (2021) and GBT Zbontar et al. (2021). They try to reduce the time complexity of the contrastive losses by removing negative samples. However, they still require to calculate the loss for a node, in comparison with of our method GGD.
7 Future Work
In this paper, we have introduced a new selfsupervised GRL paradigm: Group Discrimination, which achieves the same level of performance as GCL methods with much less resources consumption (i.e., training time and memory). Some limitations of this work are we still have not explored some questions for GD. For example, can we extend the current binary Group Discrimination scheme (i.e., classifying two groups of summarised node embeddings) to discrimination among multiple groups? Are there any other corruption technique to create a more difficult negative group for discrimination? More importantly, with the extremely efficient property, GD has the potential to be deployed to various realworld applications, e.g., recommendation systems, which have limited labelling information and desire fast computation with limited resources.
References
 [1] (2021) Graph barlow twins: a selfsupervised representation learning framework for graphs. arXiv preprint arXiv:2106.02466. Cited by: §1, §1.
 [2] (2020) Scaling graph neural networks with approximate pagerank. In KDD, pp. 2464–2473. Cited by: §6.
 [3] (2019) Clustergcn: an efficient algorithm for training deep and large graph convolutional networks. In KDD, pp. 257–266. Cited by: §6.
 [4] (2022) CGMN: a contrastive graph matching network for selfsupervised graph similarity learning. In IJCAI, Cited by: §1.
 [5] (2019) Graph neural networks for social recommendation. In WWW, pp. 417–426. Cited by: §1.
 [6] (2020) Bootstrap your own latent: a new approach to selfsupervised learning. NIPS. Cited by: §1.
 [7] (2016) Node2vec: scalable feature learning for networks. In KDD, pp. 855–864. Cited by: §5.2.
 [8] (2017) Inductive representation learning on large graphs. In NIPS, pp. 1025–1035. Cited by: §A.5, §6.
 [9] (2020) Contrastive multiview representation learning on graphs. In ICML, pp. 4116–4126. Cited by: §1, §1, §2, §3.1, §3.2, §4.1, §5.1, §6.

[10]
(2020)
Open graph benchmark: datasets for machine learning on graphs
. NIPS. Cited by: §A.4, §5.2, §5.2, §5.  [11] (2020) Subgraph contrast for scalable selfsupervised graph representation learning. In ICDM, pp. 222–231. Cited by: §1.
 [12] (2021) Multiscale contrastive siamese networks for selfsupervised graph representation learning. IJCAI. Cited by: §1, §5.1.
 [13] (2017) Semisupervised classification with graph convolutional networks. ICLR. Cited by: §A.2, §3.1, §5.1, §5.2, §6.
 [14] (2019) Predict then propagate: graph neural networks meet personalized pagerank. ICLR. Cited by: §6.
 [15] (2021) Anomaly detection on attributed networks via contrastive selfsupervised learning. TNNLS. Cited by: §1.
 [16] (2020) Graph representation learning via graphical mutual information maximization. In WWW, pp. 259–270. Cited by: §1, §5.1, §5.2, §6.
 [17] (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §5.
 [18] (2018) Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868. Cited by: §5.
 [19] (2021) Bootstrapped representation learning on graphs. ICLR2021. Cited by: §1, §4.1.1, §4.1.1, Figure 6, §5.1, §5.1, §5.2, §5.2, §5.2, §5.2, §6.
 [20] (2018) Graph attention networks. ICLR. Cited by: §3.1, §5.1, §6.
 [21] (2019) Deep graph infomax. ICLR. Cited by: §A.1, §1, §2.1, §2.1, §2, §3.1, §4.1, §5.1, §5.2, §5.2, §6.

[22]
(2020)
Contrastive and generative graph convolutional networks for graphbased semisupervised learning
. AAAI. Cited by: §5.1, §5.1.  [23] (2019) Simplifying graph convolutional networks. In ICML, pp. 6861–6871. Cited by: §5.1, §6, §6.
 [24] (2016) Revisiting semisupervised learning with graph embeddings. In ICML, pp. 40–48. Cited by: §5.1.
 [25] (2020) Graph contrastive learning with augmentations. NIPS 33, pp. 5812–5823. Cited by: §3.1.
 [26] (2021) Barlow twins: selfsupervised learning via redundancy reduction. In ICML, pp. 12310–12320. Cited by: Table 1, §1, §4.1.1, §4.1.1, Figure 6, §5.1, §5.1, §5.1, §5.2, §5.2, §5.2, §6.
 [27] (2022) Trustworthy graph neural networks: aspects, methods and trends. arXiv preprint arXiv:2205.07424. Cited by: §1.
 [28] (2018) Link prediction based on graph neural networks. NIPS 31. Cited by: §1.
 [29] (2022) Graph neural networks for graphs with heterophily: a survey. arXiv preprint arXiv:2202.07082. Cited by: §1.
 [30] (2021) Towards graph selfsupervised learning with contrastive adjusted zooming. arXiv preprint arXiv:2111.10698. Cited by: §2.2, §3.2.
 [31] (2021) Heterogeneous graph attention network for small and mediumsized enterprises bankruptcy prediction. In PAKDD, pp. 140–151. Cited by: §1.
 [32] (2020) Deep graph contrastive representation learning. ICML Workshop on Graph Representation Learning and Beyond. Cited by: §1, §5.1, §5.2, §6.
Appendix A Appendix A
a.1 Proof of Theorem 1
Proof. To prove Theorem 1, given a graph , where and a GNN encoder , for simplicity, we consider as a onelayer GCN and conduct the following normalisation to X to make its value to be within the range :
(10) 
where is the normalised , denotes the rowentry of the matrix, indicates the columnentry of the matrix, is the function to get the minimum value, and is the function for getting the maximum value.
Then, we input Z to whose weight matrix W is initialised with Xavier initialisation:
H  (11) 
where H is the output embedding, is nonlinear activation function, , is the degree matrix for , W is the learnable weight matrix. As multiplying the normalised adjacency matrix will not change the output data range, the output of is still within the range . Then, as the element in Xavier initialised W is in between and , we can easily derive that the output data range of the matrix multiplication of and W still stay in the same range as W.
After that, we need to apply to the multiplication output. Here, we analyse four commonly adopted nonlinear activation functions: Sigmoid, ReLu, LeakyReLu and PReLu.
We first consider the Sigmoid function , as it is a monotonic increasing function, we can derive that the data range of . Then, we apply the Sigmoid function again as DGIVeličković et al. (2019) did to H and obtain:
(12) 
Similarily, as the Sigmoid function is a monotonic increasing function, we can easily derive that the data range for is in . Then, when , for the lower bound , we can obtain:
(13)  
Also, for the upper bound , when , we can obtain:
(14)  
Finally, we can easily observe that and thus prove Theorem 1 when is Sigmoid function.
For the other three nonlinear activation functions ReLu , Leaky ReLu and PReLu , where is a learnable parameter, we can also derive the data range of these functions for H are , and , respectively. Here, we can see these functions share the same upper bound . The only difference for their lower bound is the coefficient for , i.e., 0 for ReLu, 0.01 for Leaky ReLu and for PReLu.
By inputting H with these activation functions to the Sigmoid function, we can derive the output data range of the function is . For the lower bound , we can obtain:
(15)  
For the upper bound , we can obtain:
(16)  
Finally, we can easily observe that and thus prove Theorem 1 when is ReLu, LeakyReLu or PReLu.
a.2 Complexity Analysis
The time complexity of our method consists of two components: the siamese GNN and the loss computation. Existing selfsupervised baselines share similar time complexity for the first component. In GGD, Given a graph with nodes and edges in the sparse format, taking a GCN Kipf and Welling (2017) encoder as an example, the time complexity of it is . As we need to process both the augmented graph and the corrupted graph , GGD requires the encoder computation twice. Then, the projector network (i.e., MLP) with linear layers will be applied to the encoder output, which takes for each layer in computation. Here is a parameter defining the hidden size. Before group discrimination, we aggregate the generated embedding with simple summation consuming .
For the loss computation, we use the BCE loss, i.e., Equation 3, to category summarised node embeddings, i.e., scalars. The time complexity of this final step is (i.e., processing all data samples from the positive and negative group). Ignoring the computation cost of the augmentation, the overall time complexity of GGD for computing a graph is , where we can see the time complexity is mainly contributed by the siamese GNN. More importantly, we compress the selfsupervised learning loss computation to for a whole graph.
a.3 Graph Power
To show the easiness of graph power computation, we conduct an experiment to evaluate the time consumption for graph power computation on eight datasets, whose statistics are shown in Appendix A.4. Specifically, we set the hidden size of to 256, and is fixed to 10 for all datasets. The experiment results are shown as below:
Cora  Cite  PubMed  Comp  Photo  Arxiv  Products  Papers 
5.4e3  7.3e3  9.8e3  1.2e2  8.5e3  2.2e2  24.5  208.8 
This table shows that the computation of graph power is very trivial on small and medium size graphs, e.g., ogbnarxiv, which has million of edges, consuming only 0.22 seconds. Extending to an extremely large graph, ogbnpapers100M, which have over 1 billion edges and 11 million nodes, the computation only requires 209 seconds (i.e., around three minutes), which is acceptable considering the sheer size of the dataset.
a.4 Dataset Statistics
The following table presents the statistics of eight benchmark datasets including five small to medium scale datasets and three largescale datasets from OGB Graph BenchmarkHu et al. (2020).
Dataset  Nodes  Edges  Features  Classes 
Cora  2,708  5,429  1,433  7 
CiteSeer  3,327  4,732  3,703  6 
PubMed  19,717  44,338  500  3 
Amazon Computers  13,752  245,861  767  10 
Amazon Photo  7,650  119,081  745  8 
ogbnarxiv  169,343  1,166,243  128  40 
ogbnproducts  2,449,029  61,859,140  100  47 
ogbnpapers100M  111,059,956  1,615,685,872  100  172 
a.5 Experiment Settings & Computing Infrastructure
Extending to Extremely Large Datasets. Extending to extremely large graphs (i.e., ogbnproducts and ogbnpapers100M), we adopt a simple Neighbourhood Sampling strategy introduced in GraphSage Hamilton et al. (2017) to decouple model training from the sheer size of graphs. Specifically, we create a fixed size subgraph for each node, which is created by sampling a predefined number of neighbours in each convolution layer for sampled nodes. The same approach is employed in the testing phase to obtain final embeddings.
General Parameter Settings. In our experiment, we mainly tunes four parameters for GGD ,which are learning rate, hidden size, number of convolution layers in the GNN encoder, and number of linear layers in the projector. For simplicity, we set the nth graph power for final embedding generation fixed to 10 for all datasets. The parameter setting for each dataset is shown below:
Dataset  lr  hidden  numconv  numproj 
Cora  1e3  256  1  1 
CiteSeer  1e5  512  1  1 
PubMed  1e3  512  1  1 
Amazon Computers  1e3  256  1  1 
Amazon Photo  1e3  1024  1  1 
ogbnarxiv  5e5  1500  3  1 
ogbnproducts  1e4  1024  4  4 
ogbnpapers100M  1e3  256  3  1 
Largescale Datasets Parameter Settings. To decouple model training from the scale of graphs, we adopt the neighbouring sampling technique, which has three parameters: batch size, sample size, number of hops to be sampled. Batch size refers to the number of nodes to be processed in one parameter optimisation step. Sample size means the number of nodes to be sampled in each convolution layer, and number of hops determines the scope of the neighbourhood for sampling. In GGD implementation, the batch size, sample size, and number of hops are fixed to 2048, 12 and 3, respectively.
Memory and Training Time Comparison. As memory and training time are very sensitive to hyperparameters related to the structure of GNNs, including hidden size, number of convolution layers, and batch processing for largescale datasets, e.g., batch size and number of neighbours sampled in each layer. Thus, in memory and training comparison, to be fair, we set all these parameters to be the same for all baselines and GGD. The specific parameter setting for each dataset is shown below:
Dataset  hidden  numconv  batch  numneigh 
Cora  512  1     
CiteSeer  512  1     
PubMed  256  1     
Amazon Computers  256  1     
Amazon Photo  256  1     
ogbnarxiv  256  3     
ogbnproducts  256  3  512  10 
ogbnpapers100M  128  3  512  10 
Computing Infrastructure. For experiments in section 3, 5 and 6.1, they are conducted using Nvidia GRID T4 (16GB memory) and Intel Xeon Platinum 8260 with 8 core. For experiments on largescale datasets (i.e.,ogbnarxiv, ogbnproducts and ogbnpapers100M), we use NVIDIA A40 (48GB memory) and Intel Xeon Gold 5320 with 13 cores.