AutoGCL: Automated Graph Contrastive Learning via Learnable View Generators

09/21/2021 ∙ by Yihang Yin, et al. ∙ Penn State University Nanyang Technological University Baidu, Inc. 0

Contrastive learning has been widely applied to graph representation learning, where the view generators play a vital role in generating effective contrastive samples. Most of the existing contrastive learning methods employ pre-defined view generation methods, e.g., node drop or edge perturbation, which usually cannot adapt to input data or preserve the original semantic structures well. To address this issue, we propose a novel framework named Automated Graph Contrastive Learning (AutoGCL) in this paper. Specifically, AutoGCL employs a set of learnable graph view generators orchestrated by an auto augmentation strategy, where every graph view generator learns a probability distribution of graphs conditioned by the input. While the graph view generators in AutoGCL preserve the most representative structures of the original graph in generation of every contrastive sample, the auto augmentation learns policies to introduce adequate augmentation variances in the whole contrastive learning procedure. Furthermore, AutoGCL adopts a joint training strategy to train the learnable view generators, the graph encoder, and the classifier in an end-to-end manner, resulting in topological heterogeneity yet semantic similarity in the generation of contrastive samples. Extensive experiments on semi-supervised learning, unsupervised learning, and transfer learning demonstrate the superiority of our AutoGCL framework over the state-of-the-arts in graph contrastive learning. In addition, the visualization results further confirm that the learnable view generators can deliver more compact and semantically meaningful contrastive samples compared against the existing view generation methods.






Is the code of this paper ready to be made public


page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural networks (GNNs) 

[18, 40, 42, 12] are gaining increasing attention in the realm of graph representation learning. By generally following a recursive neighborhood aggregation scheme, GNNs have shown impressive representational power in various domains, such as point clouds [34], social networks [7], chemical analysis [6], and so on. Most of the existing GNN models are trained in an end-to-end supervised fashion, which relies on a high volume of fine-annotated data. However, labeling graph data requests a huge amount of effort from professional annotators with domain knowledge. To alleviate this issue, GAE [19] and GraphSAGE [12] have been proposed to exploit a naive unsupervised pretraining strategy that reconstructs the vertex adjacency information. Some recent works [16, 46] introduce self-supervised pretraining strategies to GNNs which further improve the generalization performance.

More recently, with developments of contrastive multi-view learning in computer vision

[15, 3, 37]

and natural language processing

[44, 24], some self-supervised pretraining approaches perform as good as (or even better than) supervised methods. In general, contrastive methods generate training views using data augmentations, where views of the same (positive pairs) input are concentrated in the representation space with views of different inputs (negative pairs) pushed apart. To work on graphs, DGI [41] has been proposed to treat both graph-level and node-level representations of the same graph as positive pairs, pursuing consistent representations from local and global features. CMRLG [13] achieves a similar goal by grouping adjacency matrix (local features) and its diffusion matrix (global features) as positive pairs. GCA [49] generates the positive view pairs through sub-graph sampling with the structure priors with node attributes randomly masked. GraphCL [45] offers even more strategies for augmentations, such as node dropping and edge perturbation. While above attempts incorporate contrastive learning into graphs, they usually fail to generate views with respect to the semantic of original graphs or adapt augmentation policies to specific graph learning tasks.

Blessed by the invariance of image semantics under various transformation, image data augmentation has been widely used [5] to generative contrastive views. However, the use of graph data augmentation might be ineffective here, as transformations on a graph might severely disrupt its semantics and properties for learning. In the meanwhile, InfoMin [38] improves contrastive learning for vision tasks and proposes to replace image data augmentation with a flow-based generative model for contrastive views generation. Thus, learning a probability distribution of contrastive views conditioned by an input graph might be an alternative to simple data augmentation for graph contrastive learning but still requests non-trivial efforts, as the performance and scalability of common graph generative models are poor in real-world scenarios.

Property CMRLG GRACE GraphCL GCA Ours Topological Node Feature - Label-preserving - - - - Adaptive - - - Variance - Differentiable - - - - Efficient BP - - - -
Table 1: An overview of graph augmentation methods. The explanation of these properties can be found in Section 3.1.

In this work, we propose a learnable graph view generation method, namely AutoGCL, to address above issues via learning a probability distribution over node-level augmentations. While the conventional pre-defined view generation methods, such as random dropout or graph node masking, may inevitably change the semantic labels of graphs and finally hurt contrastive learning, AutoGCL adapts to the input graph such that it can well preserve the semantic labels of the graph. In addition, thanks to the gumbel-softmax trick [17], AutoGCL is end-to-end differentiable yet providing sufficient variances for contrastive samples generation. We further propose a joint training strategy to train the learnable view generators, the graph encoder, and the classifier in an end-to-end manner. The strategy includes the view similarity loss, the contrastive loss, and the classification loss. It makes the proposed view generators generate augmented graphs that have similar semantic information but with different topological properties. In Table 1, we summarize the properties of existing graph augmentation methods, where AutoGCL dominates in the comparisons.

We conduct extensive graph classification experiments using semi-supervised learning, unsupervised learning, and transfer learning tasks to evaluate the effectiveness of AutoGCL. The results show that AutoGCL improves the state-of-the-art graph contrastive learning performances on most of the datasets. In addition, we visualize the generated graphs on MNIST-Superpixel dataset

[27] and reveal that AutoGCL could better preserve semantic structures of the input data than existing pre-defined view generators.

Our contributions can be summarized as follows.

  • We propose a graph contrastive learning framwork with learnable graph view generators embedded into a auto augmentation strategy. To the best of our knowledge, this is the first work that builds learnable generative augmentation policies for graph contrastive learning.

  • We propose a joint training strategy for training the graph view generators, the graph encoder, and the graph classifier under the context of graph contrastive learning in an end-to-end manner.

  • We extensively evaluate the proposed method on a variety of graph classification datasets with semi-supervised, unsupervised, and transfer learning settings. The t-SNE and view visualization results also demonstrate the effectiveness of our method.

2 Related Work

2.1 Graph Neural Networks

Denote a graph as where the node features are for . In this paper, we focus on the graph classification task using Graph Neural Networks (GNNs). GNNs generate node-level embedding through aggregating the node features of its neighbors. Each layer of GNNs serves as an iteration of aggregation, such that the node embedding after the -th layers aggregates the information within its -hop neighborhood. The -th layer of GNNs can be formulated as


For the downstream tasks such as graph classification, the graph-level representation is obtained via the READOUT function and MLP layers as


In this work we follow the existing graph contrastive learning literature to employ two state-of-the-art GNNs, i.e., GIN [42] and ResGCN [2], as our backbone GNNs.

2.2 Pre-training Graph Neural Networks

Pre-training GNNs on graph datasets still remains a challenging task, since the semantics of graphs are not straightforward, and the annotation of graphs (proteins, chemicals, etc.) usually requires professional domain knowledge. It is very costly to collect large-scale and fine-annotated graph datasets like ImageNet

[20]. An alternative way is to pre-train the GNNs in an unsupervised manner. The GAE [19] first explored the unsupervised GNN pre-training by reconstructing the graph topological structure. GraphSAGE [12] proposed an inductive way of unsupervised node embedding by learning the neighborhood aggregation function. Pretrain-GNN [16] conducted the first systematic large-scale investigation of strategies for pre-training GNNs under the transfer learning setting. It proposed self-supervised pre-training strategies to learn both the local and global features of graphs. However, the benefits of graph transfer learning may be limited and lead to negative transfer [30], as the graphs from different domains actually differ a lot in their structures, scales and node/edge attributes. Therefore, many of the following works started to explore an alternative approach, i.e., the contrastive learning, for GNNs pre-training.

2.3 Contrastive Learning

In recent years, contrastive learning (CL) has received considerable attention among the self-supervised learning approaches, and a series of CL methods including SimCLR [3] and MoCo-v2 [4] even outperform the supervised baselines. Through minimizing the contrastive loss [11], the views generated from the same input (i.e., positive view pairs) are pulled close in the representation space, while the views of different inputs (i.e., negative view pairs) are pushed apart. Most of the existing CL methods [15, 47, 3, 9] generate views using data augmentation, which is still challenging and under-explored for the graph data. Instead of data augmentation, DGI [41] treated the graph-level representations and the node-level representations of the same graph as positive view pairs. CMRLG [13] achieved an analogical goal by treating the adjacency matrix (local features) and the diffusion matrix (global features) as positive pairs. More recently, the GraphCL framework [45] employed four types of graph augmentations, including node dropping, edge perturbation2, sub-graph sampling3, and node attribute masking1, enabling the most diverse augmentations by far for graph view generation. GCA [49] used sub-graph sampling and node attribute masking as augmentations and introduced a prior augmentation probability based on the node centrality measures, enabling more adaptiveness than GraphCL [45]. However, these graph augmentation methods are not label-preserving. Moreover, the augmentation intensity needs to be manually tuned and the augmentation policy is not adaptive to different tasks. In this work, we propose to learn the optimal augmentation policy from the graph data.

2.4 Learnable Data Augmentation

As mentioned above, data augmentation is a significant component of CL. The existing literature [3, 45] has revealed that the optimal augmentation policies are task-dependent and the choice of augmentations makes a considerable difference to the CL performance. The researchers have explored to automatically discover the optimal policy for image augmentations in the computer vision field. For instance, AutoAugment [5]

firstly optimized the combination of augmentation functions through reinforcement learning. Faster-AA

[14] and DADA [22] proposed a differentiable augmentation optimization framework following the DARTS [23] style.

However, the learnable data augmentation methods are barely explored for CL except the InfoMin framework [38], which claims that good views of CL should maintain the label information as well as minimizing the mutual information of positive view pairs. InfoMin employs a flow-based generative model as the view generator for data augmentation and trains the view generator in a semi-supervised manner. However, transferring this idea to graph C: is a non-trivial task since current graph generative models are either of limited generation qualities [19] or designed for specific tasks such as the molecular data [6, 25]. To overcome this issue, in this work we build a learnable graph view generator that learns a probability distribution over the node-level augmentations. Compared to the existing graph CL methods, our method well preserves the semantic structures of original graphs. Moreover it is end-to-end differentiable and can be efficiently trained.

3 Methodology

3.1 What Makes a Good Graph View Generator?

Our goal is to design a learnable graph view generator that learns to generate the augmented graph view in data-driven manner. Although various graph data augmentation methods have been proposed, there is less discussion on what makes a good graph view generator? From our perspective, an ideal graph view generator for data augmentation and contrastive learning should satisfy the following properties:

  • It supports both the augmentations of the graph topology and the node feature.

  • It is label-preserving, i.e., the augmented graph should maintain the semantic information in the original graph.

  • It is adaptive to different data distributions and scalable to large graphs.

  • It provides sufficient variances for contrastive multi-view pre-training.

  • It is end-to-end differentiable and efficient enough for fast gradient computation via back-propagation (BP).

Here we provide an overview of the augmentation methods proposed in existing literature of graph contrastive learning in Table 1. CMRLG [13] applies diffusion kernel on adjacency matrix to get different topological structures. GRACE [48] uses random edge dropping and node attribute masking111Randomly mask the attributes of certain ratio of nodes. . GCA [49] uses node dropping and node attribute masking along with a structural prior. Among all the previous works, GraphCL [45] enables the most flexible set of graph data augmentations so far, as it includes node dropping, edge perturbation222Randomly replace certain ratio of edges with random edges. , sub-graph333Randomly select a connected subgraph of certain size. , and attribute masking1. We provide a detailed ablation study and analysis of GraphCL augmentations with different augmentation ratios in the Section 1.1 of the supplementary.

In this work, we propose a learnable view generator to address all the above issues. Our view generator includes both augmentations of node dropping and attribute masking, but it is much more flexible since both two augmentations can be simultaneously employed in a node-wise manner, without the need of tuning the “aug ratio”. Besides the concern of model performance, another reason for not incorporating edge perturbation in our view generator is, the generation of edges through the learnable methods (e.g., VGAE [19]) requires to predict the full adjacency matrix that contains elements, which is a heavy burden for back-propagation when dealing with large-scale graphs.

3.2 Learnable Graph View Generator

Figure 1: The architecture of our learnable graph view generator. The GNN layers embed the original graph to generate a distribution for each node. The augmentation choice of each node is sampled from it using the gumbel-softmax.

Figure 1 illustrates the scheme of our proposed learnable graph view generator. We use GIN [42] layers to get the node embedding from the node attribute. For each node, we use the embedded node feature to predict the probability of selecting a certain augment operation. The augmentation pool for each node is drop, keep, and mask. We employ the gumbel-softamx [17] to sample from these probabilities then assign an augmentation operation to each node. Formally, if we use GIN layers as the embedding layer, we denote as the hidden state of node at the -th layer and as the embedding of node after the -th layer. For node , we have the node feature , the augmentation choice , and the function for applying the augmentation. Then the augmented feature of node is obtained via


The dimension of the last layer k is set as the same number of possible augmentations for each node. denotes the probability distribution for selecting each kind of augmentation.

is a one-hot vector sampled from this distribution via gumbel-softmax

[17] and it is differentiable due to the reparameterization trick. The augmentation applying function combines the node attribute and using differentiable operations (e.g. multiplication), so the gradients of the weights of the view generator are kept in the augmented node features and can be computed using back-propagation. For the augmented graph, the edge table is updated using for all , where the edges connected to any dropped nodes are removed. As the edge table is only the guidance for node feature aggregation and it doe not participate in the gradient computation, it does not need to be updated in a differentiable manner. Therefore, our view generator is end-to-end differentiable. The GIN embedding layers and the gumbel-softmax can be efficiently scaled up for larger graph datasets and more augmentation choices.

3.3 Contrastive Pre-training Strategy

Since the contrastive learning requires multiple views to form a positive view pair, we have two view generators and one classifier for our framework. According to InfoMin principle [38], a good positive view pair for contrastive learning should maximize the label-related information as well as minimizing the mutual information (similarity) between them. To achieve that, our framework uses two separate graph view generators and trains them and the classifier in a joint manner.

Figure 2: The proposed AutoGCL framework is composed of three parts: (1) two view generators that generate different views of the original graph, (2) a graph encoder that extracts the features of graphs and (3) a classifier that provides the graph outputs.

3.3.1 Loss Function Defination

Here we define three loss functions, contrastive loss

, similarity loss , and classification loss . For contrastive loss, we follow the previous works [3, 45] and use the normalized temperature-scaled cross entropy loss (NT-XEnt) [35]. We formulate the similarity function as


Suppose we have a data batch made up of graphs. We pass the batch to the two view generators to obtain graph views. We regard the two augmented views from the same input graph as the positive view pair. We use to denote the indicator function. We denote the contrastive loss function for a positive pair of samples as , the contrastive loss of this data batch as , the temperature parameter as , then we have


The similarity loss is used to minimize the mutual information between the views generated by the two view generators. During the view generation process, we have a sampled state matrix indicting each node’s corresponding augmentation operation (see Figure 1). For a graph , we denote the sampled augmentation choice matrix of each view generator as , then we formulate the similarity loss as


Finally, for the classification loss, we directly use the cross entropy loss (). For a graph sample with class label , we denote the augmented view as and and the classifier as . Then the classification loss is formulated as


is employed in the semi-supervised pre-training task to encourage the view generator to generate label-preserving augmentations.

3.3.2 Naive Training Strategy

For unsupervised learning and transfer learning tasks, we use a naive training strategy (naive-strategy). Since we do not know the label in the pre-training stage, the is not used because it does not make sense to just encourage the views to be different without keeping the label-related information. This could lead to generating useless or even harmful view samples. We just train the view generators and the classifier jointly to minimize the in the pre-training stage.

Also, we note that the quality of the generated views will not be as good as the original data. During the minimization, instead of just minimizing the between two augmented views like GraphCL [45], we also make use of the original data. By pulling the original data and the augmented views close in the embedding space, the view generator can be more likely to preserve the label-related information. The details of the naive training strategy are described in Algorithm 1.

1:Initialize weights of the two view generator ,
2:Initialize weights of the classifer

 not reached maximum epochs 

4:     for mini-batch from unlabeled data do
5:          Get augmentation
6:          Sample two views from
8:          Update the weights of to minimize      
9:while not reached maximum epochs do
10:     for mini-batch from labeled data do
12:          Update the weights of to minimize      
Algorithm 1 Naive training strategy (naive-strategy).

3.3.3 Joint Training Strategy

For semi-supervised learning tasks, we proposed a joint training strategy, performs contrastive training and supervised training alternately. This strategy generates label-preserving augmentation and outperforms the naive-strategy, the experiment results and detailed analysis is shown in Section 4.1.3 and Section 4.3.

For the joint-strategy, during the unsupervised training stage, we fix the view generators, and train the classifer by contrastive learning using unlabeled data. During the supervised training stage, we jointly train the view generator with the classifier using labeled data. By simultaneously optimizing and , the two view generator are encouraged to generated label-preserving augmentations, yet being different enough from each other. The unsupervised training stage and supervised training stage are repeated alternately. This is very different from previous graph contrastive learning methods. Previous work like GraphCL [45] use the pre-training/fine-tuning strategy, which first minimizes the contrastive loss () until convergence using the unlabeled data and then fine-tunes it with the labeled data.

1:Initialize weights of , , .
2:while not reached maximum epochs do
3:     for mini-batch from unlabeled data do
4:          Fix the weights of
5:          Get augmentation
6:          Sample two views from
8:          Update the weights of to minimize      
9:     for mini-batch from labeled data do
10:          Get augmentation
12:          Update the weights of to minimize      
Algorithm 2 Joint training strategy (joint-strategy).

However, we found that for graph contrastive learning, the pre-training/fine-tuning strategy are more likely to cause over-fitting in the fine-tuning stage. And minimizing the too much may have negative effect for the fine-tuning stage (see Section 4.3). We speculate that minimizing the too much will push data points near the decision boundary to be too closed to each other, thus become more difficult the classifer to separate them. Because no matter how well we train the GNN classifer, there are still mis-classified samples due to the natural overlaps between the data distribution of different classes. But in the contrastive pre-training state, the classifer is not aware of whether the samples being pulled together are really from the same class.

GL 81.66±2.11 - - - - 65.87±0.98 77.34±0.18 41.01±0.17
WL 80.72±3.00 72.92±0.56 - 80.01±0.50 - 72.30±3.44 68.82±0.41 46.06±0.21
DGK 87.44±2.72 73.30±0.82 - blue80.31±0.46 - 66.96±0.56 78.04±0.39 41.27±0.18
node2vec 72.63±10.20 57.49±3.57 - 54.89±1.61 - - - -
sub2vec 61.05±15.80 53.03±5.55 - 52.84±1.47 - 55.26±1.54 71.48±0.41 36.68±0.42
graph2vec 83.15±9.25 73.30±2.05 - 73.22±1.81 - 71.10±0.54 75.78±1.03 47.86±0.26
InfoGraph 89.01±1.13 blue74.44±0.31 72.85±1.78 76.20±1.06 blue70.65±1.13 blue73.03±0.87 82.50±1.42 53.46±1.03
GraphCL 86.80±1.34 74.39±0.45 78.62±0.40 77.87±0.41 71.36±1.15 71.14±0.44 89.53±0.84 blue55.99±0.28
Ours blue88.64±1.08 75.80±0.36 blue77.57±0.60 82.00±0.29 70.12±0.68 73.30±0.40 blue88.58±1.49 56.75±0.18
Table 2: Comparison with the existing methods for unsupervised learning. The bold numbers denote the best performance and the the numbers in blueblue represent the second best performance.
Model BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE
No Pretrain 65.8±4.5 74.0±0.8 63.4±0.6 57.3±1.6 58.0±4.4 71.8±2.5 75.3±1.9 70.1±5.4
Infomax 68.8±0.8 75.3±0.5 62.7±0.4 58.4±0.8 69.9±3.0 75.3±2.5 76.0±0.7 75.9±1.6
EdgePred 67.3±2.4 blue76.0±0.6 blue64.1±0.6 60.4±0.7 64.1±3.7 74.1±2.1 76.3±1.0 blue79.9±0.9
AttrMasking 64.3±2.8 76.7±0.4 64.2±0.5 61.0±0.7 71.8±4.1 74.7±1.4 77.2±1.1 79.3±1.6
ContextPred 68.0±2.0 75.7±0.7 63.9±0.6 blue60.9±0.6 65.9±3.8 blue75.8±1.7 77.3±1.0 79.6±1.2
GraphCL blue69.68±0.67 73.87±0.66 62.40±0.57 60.53±0.88 blue75.99±2.65 69.80±2.66 78.47±1.22 75.38±1.44
Ours 73.36±0.77 75.69±0.29 63.47±0.38 62.51±0.63 80.99±3.38 75.83±1.30 blue78.35±0.64 83.26±1.13
Table 3: Comparison with the existing methods for transfer learning. The bold numbers denote the best performance and the numbers in blueblue denote the second best performance.

Therefore, we propose a new semi-supervised training strategy, namely the joint-strategy by alternately minimizing the and . Minimizing is inspired by InfoMin [38], so as to make the two view generator to keep label-related information while having less mutual information. However, since we only have a small portion of labeled data to train our view generator, it is still beneficial to use the original data just like the naive-strategy. Interestingly, since we need to minimize and simultaneously, a weight can be applied to better balance the optimization, but actually we found setting works pretty well during the experiments in Section 4.1. The detailed training strategy is described in Algorithm 2. And the overview of our whole framework is shown in Figure 2.

4 Experiment

4.1 Comparison with State-of-the-Art Methods

4.1.1 Unsupervised Learning

For the unsupervised graph classification task, we contrastively train a representation model using unlabeled data, then fix the representation model and train the classifier using labeled data. Following GraphCL [45], we use a 5-layer GIN with a hidden size of 128 as our representation model, and use an SVM as our classifier. We train the GIN with a batch size of 128 and a learning rate of 0.001. There are 30 epochs of contrastive pre-training under the naive-strategy. We perform a 10-fold cross validation on every dataset. For each fold, we employ 90% of the total data as the unlabeled data for contrastive pre-training, and 10% as the labeled testing data. We repeat every experiment for 5 times using different random seeds.

We compare with the kernel-based methods like graphlet kernel (GL) [33], Weisfeiler-Lehman sub-tree kernel (WL) [32] and deep graph kernel (DGK) [43], and other unsupervised graph representation methods like node2vec [10], sub2vec [1], graph2vec [29] also the contrastive learning methods like InfoGraph [36] and GraphCL [45]. Table 2 show the comparison among different models for unsupervised learning. Our proposed model achieves the best results on PROTEINS, NCI1, IMDB-binary, and REDDIT-Multi-5K datasets and the second best performances on MUTAG, DD, and REDDIT-binary datasets, outperforming current state-of-the-art contrastive learning method GraphCL.

4.1.2 Transfer Learning

We also evaluate the transfer learning performance of the proposed method. A strong baseline method for graph transfer learning is Pretrain-GNN [16]. The network backbone of Pretrain-GNN, GraphCL, and our method is a variant of GIN [42], which incorporates the edge attribute. We perform 100 epochs of supervised pre-training on the pre-processed ChEMBL dataset ([26, 8]), which contains 456K molecules with 1,310 kinds of diverse and extensive biochemical assays.

We perform 30 epochs of fine-tuning on the 8 chemistry evaluation subsets. We use a hidden size of 300 for the classifier, a hidden size of 128 for the view generator. We train the model using a batch size of 256 and a learning rate of 0.001. The results in Table 3 are the mean±std of the ROC-AUC scores from 10 reps. Infomax, EdgePred, AttrMasking, ContextPred are the manually designed pre-training strategies from Pretrain-GNN [16].

Table 3 presents the comparison among different methods. Our proposed method achieves the best performance on most dataset, such as BBBP, SIDER, ClinTox, MUV and BACE, and compared with the current SoTA model – GraphCL [45], our method performs considerably better, for example, on BBBP dataset, the accuracy raises from 69.68±0.67 to 73.36±0.77. Considering all datasets, the average gain of using our proposed method is around 3.42. Interestingly, AttrMasking achieves the best performance on Tox21 and ToxCast, which is slightly better than our method. One possible reason is that attributes are important for classification in Tox21 and ToxCast datasets.

4.1.3 Semi-Supervised Learning

Full Data 78.25±1.61 80.73±3.78 83.65±1.16 83.44±0.77 66.89±1.04 76.60±4.20 89.95±2.06 55.59±2.24
10% Data 69.72±6.71 74.36±5.86 75.16±2.07 74.34±2.00 61.05±1.57 64.80±4.92 76.75±5.60 blue49.71±3.20
10% GCA 73.85±5.56 76.74±4.09 68.73±2.36 74.32±2.30 59.24±3.21 73.70±4.88 77.15±6.96 32.95±10.89
10% GraphCL Aug Only 70.71±5.63 76.48±4.12 70.97±2.08 73.56±2.52 59.80±1.94 71.10±5.11 76.45±4.83 47.33±4.02
10% GraphCL CL 74.21±4.50 76.65±5.12 73.16±2.90 75.50±2.15 63.51±1.02 68.10±5.15 78.05±2.65 48.09±1.74
10% Our Aug Only blue75.49±5.15 blue77.16±4.53 73.33±2.86 75.92±1.93 60.65±1.04 blue71.90±2.88 blue79.65±2.84 47.97±2.22
10% Our CL Naive 74.57±3.29 75.55±4.76 73.22±2.48 blue76.60±2.15 60.95±1.32 71.00±2.91 79.10±4.38 46.71±2.64
10% Our CL Joint () 74.66±2.58 76.57±5.08 71.78±1.61 75.38±2.15 60.39±1.50 70.60±4.17 78.90±3.11 46.89±3.13
10% Our CL Joint (+) 75.12±3.35 76.23±3.57 72.55±2.72 75.60±2.08 60.18±1.75 71.70±3.86 79.25±2.88 47.51±2.51
10% Our CL Joint ( + ) 74.75±3.35 76.82±3.85 73.07±2.31 76.18±2.46 61.75±1.30 71.50±5.32 78.35±4.21 47.73±2.69
10% redOur CL Joint ( ++) 75.65±2.40 77.50±4.41 blue73.75±2.25 77.16±1.48 blue62.46±1.51 blue71.90±4.79 79.80±3.47 49.91±2.70
Table 4: Comparison with existing methods and different strategies for semi-supervised learning. Bold numbers denote the best performance and the numbers in blueblue denote the second best performance. redRed is our default setting for joint training strategy.

We perform semi-supervised graph classification task on TUDataset [28]. For our view generator, we use a 5-layer GIN with a hidden size of 128 as the embedding model. We use ResGCN [2] with a hidden size of 128 as the classifier. For GraphCL, we use the default policy random4, which randomly selects two augmentations from node dropout, edge perturbation, subgraph, and attribute masking for every mini-batch. For all augmentations, a node or edge could be dropped or perturbed with a probability of , which is also the default setting in GraphCL [45].

We employ a 10-fold cross validation on each dataset. For each fold, we use 80% of the total data as the unlabeled data, 10% as labeled training data, and 10% as labeled testing data. For the augmentation only (Aug Only) experiments, we only perform 30 epochs of supervised training with augmentations using labeled data. For the contrastive learning experiments of GraphCL and our naive-strategy, we perform 30 epochs of contrastive pre-training followed by 30 epochs of supervised training. For our joint-strategy, there is 30 joint epochs of contrastive training and supervised training.

Table 4 compares the performances obtained by different training strategies: augmentation only (Aug only), naive-strategy (CL naive) and joint-strategy (CL joint). We also conducted an ablation study of our joint loss function. The proposed CL joint approach achieves relatively high accuracy on most datasets, for example, on DD, COLLAB, REDDIT-B and REDDIT-M-5K datasets, using joint strategy obtains the best performance, the average gain of which is around 0.31 compared with the second best performances. In terms of other datasets, using joint strategy also achieves the second best performances. Looking at the comparison among Aug only, CL naive and CL joint, CL joint is superior to the other two approaches, in particular to CL naive.

4.2 Effectiveness of Learnable View Generators

In this section, we demonstrate the superiority of learnable graph augmentation policies over the fixed ones. Since the graph datasets are usually difficult to be manually classified and visualized, we trained a view generator on MNIST-Superpixel dataset [27] to verify that our graph view generator is able to effectively capture the semantic information in graphs than GraphCL [45], since MNIST-Superpixel graphs have clear semantics which does not require any domin knowledge. The visualization result is shown in Figure 4.

Here we jointly trained the view generators with the classifier until the test accuracy (evaluated on generated views) reached . Since our only topological augmentation is node dropping. So we compared the view of GraphCL’s node dropping augmentation, and use the default setting . Figure 4 shows that, our view generator are more likely to keep key nodes in the original graph, preserving its semantic feature, yet providing enough variance for contrastive learning. Details of the MNIST-Superpixel dataset and more visualization examples are shown in Section 1.2 of the supplementary.

Figure 3: View visualization on the MNIST-Superpixel dataset. The nodes with non-zero node attribute are colored in red, other nodes are colored in blue. Redder nodes indicate larger value of the node attribute.

4.3 Analysis for Joint Training Strategy

We compared the naive-strategy (Algorithm 1) with the joint-strategy (Algorithm 2). We trained on COLLAB [31] dataset, which have 5000 social network graphs of 3 classes, the average nodes and edges are 74.49 and 2457.78. Here we use 5-layer GIN [42] as the backbone for both the view generator and the classifier. For naive-strategy, there is 30 epochs of contrastive pretrain using 80% unlabeled data and 30% of fine-tuning using 10% of data. For joint-strategy, there is 30 epochs of joint training. The learning curves are shown in Section 1.3 of the supplementary. Our results show that the joint strategy considerably alleviate the over-fitting effect, and our label-preserving view generator is very effective. We also visualize the process for learning the embedding for each strategy using t-SNE [39] in the supplementary. We can find that using joint training strategy can learn better representation much faster since labeled data is used for supervision, also this supervision signal could benefit view generator learning.

5 Conclusion

In this paper, we presented a learnable data augmentation approach for graph contrastive learning, where we employed GIN to generate different views of the original graphs and to preserve the semantic label of the input graph, we developed a joint learning strategy, which alternately optimize the view generators, graph encoders and classifier. We also conducted extensive experiments on a number of datasets and tasks, such as semi-supervised learning, unsupervised learning and transfer learning and the results demonstrate that our proposed method outperforms the counterparts on most datasets and tasks. In addition, we visualized the generated graph views, which could preserve the discriminative structure of the input graph, benefiting classification. Finally, the t-SNE visualization illustrated that the proposed joint training strategy could be a better choice for semi-supervised graph representation learning.

Appendix A More Analysis

a.1 An Insight into GraphCL Augmentations

Here we want to prove that the augmentation selection policy and the intensity of augmentations really matter to the final results. Among all the previous works, GraphCL [45] enables the most flexible set of graph data augmentations so far, as it includes node dropping, edge perturbation, sub-graph, and attribute masking. Where

  • Node dropping randomly removes certain ratio of nodes.

  • Edge perturbation first randomly removes certain ratio of edges, then randomly adds the same number of edges.

  • Sub-graph randomly choosing a connected subgraph by firstly choose a random center node, then gradually add its neighbor nodes until certain ratio of the total nodes are reached.

  • Node attribute masking randomly masks the attributes of certain ratio of nodes.

We note that the only augmentation selection policy of all existing works is uniform sampling and all the augmentation methods require a hyper-parameter “aug ratio” that controls the portion of nodes/edges that are selected for augmentation. The “aug ratio” is set to a constant in every experiment (e.g., 20% by GraphCL’s default). We perform an ablation study of these augmentation methods as shown in Table LABEL:tab-graphcl-aug-select-ablation, Table 6 and conclude that:

  • The positive contributions of edge perturbation and subgraph augmentation for graph contrastive learning are very limited (or even negative).

  • The subgraph augmentation is actually contained in the augmentation space of node dropping. For instance, the potential view space of dropping 80% of nodes contains the potential view space of selecting a connected subgraph that contains 20% of the nodes.

  • The choice of “aug ratio” has a considerable effect on the final performance. It is inappropriate to apply the same “aug ratio” to different augmentations, datasets, and tasks.

Augmentation NCI1 PROTEINS DD Full Data 83.28 ± 1.84 77.18 ± 2.71 79.80 ± 2.65 10% Data 72.99 ± 1.28 67.50 ± 7.81 72.91 ± 4.42 10% Random4 72.94 ± 3.23 71.08 ± 5.03 75.29 ± 2.74 10% NodeDrop 72.55 ± 2.43 71.98 ± 4.29 76.66 ± 2.33 10% EdgePerturb 71.85 ± 3.45 70.72 ± 5.60 73.34 ± 3.22 10% Subgraph 72.70 ± 2.60 62.99 ± 4.92 68.33 ± 4.77 10% AttrMask 72.87 ± 2.08 70.72 ± 3.72 74.96 ± 3.46
Table 6: Ablation Study of the Aug Ratio of GraphCL Augmentations.
Dataset Aug Node Edge Subgraph Attribute Ratio Dropping perturbation Masking NCI1 0.0 74.48 ± 1.91 74.48 ± 1.91 74.48 ± 1.91 74.48 ± 1.91 NCI1 0.1 75.01 ± 2.91 72.94 ± 2.41 74.77 ± 2.78 75.96 ± 2.28 NCI1 0.2 74.40 ± 2.84 72.07 ± 2.74 75.09 ± 2.46 75.57 ± 2.34 NCI1 0.3 74.57 ± 2.14 71.87 ± 2.14 74.18 ± 2.94 75.11 ± 2.24 NCI1 0.4 73.94 ± 2.32 70.29 ± 2.08 74.31 ± 2.48 75.13 ± 2.52 NCI1 0.5 73.70 ± 2.43 71.44 ± 2.24 74.55 ± 1.90 74.70 ± 2.01
Table 5: Ablation Study of GraphCL Augmentations.

a.2 The Effectiveness of Our Learnable Graph Augmentations

Figure 4: View visualization on the MNIST-Superpixel dataset. The nodes with non-zero node attribute are colored in red, other nodes are colored in blue. Redder nodes indicate larger value of the node attribute.
Figure 5: t-SNE visualization of our naive-strategy and joint-strategy. (a) is the contrastive pre-training stage of naive-strategy. (b) is the fine-tuning stage of the native strategy. (c) is the training stage of joint-strategy. cl, cls, sim represent , , , test represents the test accuracy.

Here we demonstrate the superiority of learnable graph augmentation policies over the fixed ones. Since the graph datasets are usually difficult to be manually classified and visualized, we trained a view generator on MNIST-Superpixel Dataset [27] to verify that our graph view generator is able to effectively capture the semantic information in graphs than GraphCL [45]. The visualization result is shown in Figure 4.

The MNIST-Superpixel Dataset [27] is made of the super-pixel graphs of the MNIST Dataset [21], contains 60000 training samples and 10000 testing samples, each graph have 75 nodes. The node attribute can be understand as the intensity of each super-pixel.

Here we jointly trained the view generators with the classifier until the test accuracy (evaluated on generated views) reached . Since our only topological augmentation is node dropping. So we compared the view of Graphic’s node dropping augmentation, and use the default setting . Figure 4 shows that, our view generator are more likely to keep key nodes in the original graph, preserving its semantic feature, yet providing enough variance for contrastive learning.

a.3 Analysis for Joint Training Strategy

Here we compared the naive-strategy (Algorithm 1 in the paper) with the joint-strategy (Algorithm 2 in the paper). We trained on COLLAB [31] dataset, which have 5000 social network graphs of 3 classes, the average nodes and edges are 74.49 and 2457.78. Here we use 5-layer GIN [42] as the backbone for both the view generator and the classifier. For naive-strategy, there is 30 epochs of contrastive pretrain using 80% unlabeled data and 30% of fine-tuning using 10% of data. For joint-strategy, there is 30 epochs of joint training.

We compared the learning curves in Figure 6. The contrastive losses are both multiplied by to fit in the figure. Here we can see the of naive strategy drops much faster than the joint strategy. However, the test accuracy of naive strategy is lower than the joint strategy, and shows an downward tendency, indicating over-fitting. The joint strategy considerably alleviate the over-fitting effect, this also shows the effectiveness of our label-preserving view generator.

Figure 6: Accuracy comparison between the naive-strategy and the joint-strategy.

We also visualize the process for learning the embedding for each strategy using t-SNE [39] in Figure 5. Figure 5 (a) demonstrates that during the contrastive learning process, the graphs that have the same semantic label could gradually cluster together, but it still difficult to recognize the decision boundary to classify the graphs, while using labeled data to fine-tune the model (see figure 5 (b)) could obtain much better graph representations for classification, indicating that to some extent, only using contrastive learning could benefit classification, but still far away from supervised learning. Figure 5 (c) presents the joint training process, we can easily find that introducing label supervision, the model could learn better representations using a few epochs and looking at the sim, and cl loss values, both of them decrease, indicating that the views of one input graph are more different but the representations of these views are close enough, hence the view generators learn to generate different views and preserve the semantic label of the input graph.

Figure 7: Loss comparison between naive-strategy and the joint-strategy.


  • [1] B. Adhikari, Y. Zhang, N. Ramakrishnan, and B. A. Prakash (2018) Sub2vec: feature learning for subgraphs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 170–182. Cited by: §4.1.1.
  • [2] T. Chen, S. Bian, and Y. Sun (2019) Are powerful graph neural nets necessary? a dissection on graph classification. arXiv preprint arXiv:1905.04579. Cited by: §2.1, §4.1.3.
  • [3] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    pp. 1597–1607. Cited by: §1, §2.3, §2.4, §3.3.1.
  • [4] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §2.3.
  • [5] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 113–123. Cited by: §1, §2.4.
  • [6] N. De Cao and T. Kipf (2018) MolGAN: an implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973. Cited by: §1, §2.4.
  • [7] W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin (2019) Graph neural networks for social recommendation. In The World Wide Web Conference, pp. 417–426. Cited by: §1.
  • [8] A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani, et al. (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic acids research 40 (D1), pp. D1100–D1107. Cited by: §4.1.2.
  • [9] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §2.3.
  • [10] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §4.1.1.
  • [11] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §2.3.
  • [12] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216. Cited by: §1, §2.2.
  • [13] K. Hassani and A. H. Khasahmadi (2020) Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning, pp. 4116–4126. Cited by: §1, §2.3, §3.1.
  • [14] R. Hataya, J. Zdenek, K. Yoshizoe, and H. Nakayama (2020)

    Faster autoaugment: learning augmentation strategies using backpropagation

    In European Conference on Computer Vision, pp. 1–16. Cited by: §2.4.
  • [15] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §2.3.
  • [16] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec (2019) Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265. Cited by: §1, §2.2, §4.1.2, §4.1.2.
  • [17] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §1, §3.2.
  • [18] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1.
  • [19] T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §1, §2.2, §2.4, §3.1.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §2.2.
  • [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §A.2.
  • [22] Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson, and Y. Yang (2020) Differentiable automatic data augmentation. In European Conference on Computer Vision, pp. 580–595. Cited by: §2.4.
  • [23] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2.4.
  • [24] L. Logeswaran and H. Lee (2018) An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893. Cited by: §1.
  • [25] K. Madhawa, K. Ishiguro, K. Nakago, and M. Abe (2019) Graphnvp: an invertible flow model for generating molecular graphs. arXiv preprint arXiv:1905.11600. Cited by: §2.4.
  • [26] A. Mayr, G. Klambauer, T. Unterthiner, M. Steijaert, J. K. Wegner, H. Ceulemans, D. Clevert, and S. Hochreiter (2018) Large-scale comparison of machine learning methods for drug target prediction on chembl. Chemical science 9 (24), pp. 5441–5451. Cited by: §4.1.2.
  • [27] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein (2017)

    Geometric deep learning on graphs and manifolds using mixture model cnns

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5115–5124. Cited by: §A.2, §A.2, §1, §4.2.
  • [28] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann (2020) TUDataset: a collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), External Links: 2007.08663, Link Cited by: §4.1.3.
  • [29] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017)

    Graph2vec: learning distributed representations of graphs

    arXiv preprint arXiv:1707.05005. Cited by: §4.1.1.
  • [30] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich (2005) To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, Vol. 898, pp. 1–4. Cited by: §2.2.
  • [31] R. A. Rossi and N. K. Ahmed (2015) The network data repository with interactive graph analytics and visualization. In AAAI, External Links: Link Cited by: §A.3, §4.3.
  • [32] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels.. Journal of Machine Learning Research 12 (9). Cited by: §4.1.1.
  • [33] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt (2009) Efficient graphlet kernels for large graph comparison. In Artificial intelligence and statistics, pp. 488–495. Cited by: §4.1.1.
  • [34] W. Shi and R. Rajkumar (2020) Point-gnn: graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1711–1719. Cited by: §1.
  • [35] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 1857–1865. Cited by: §3.3.1.
  • [36] F. Sun, J. Hoffmann, V. Verma, and J. Tang (2019) Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §4.1.1.
  • [37] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1.
  • [38] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §1, §2.4, §3.3.3, §3.3.
  • [39] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §A.3, §4.3.
  • [40] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1.
  • [41] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2018) Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §1, §2.3.
  • [42] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §A.3, §1, §2.1, §3.2, §4.1.2, §4.3.
  • [43] P. Yanardag and S. Vishwanathan (2015) Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1365–1374. Cited by: §4.1.1.
  • [44] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1.
  • [45] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen (2020) Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33. Cited by: §A.1, §A.2, §1, §2.3, §2.4, §3.1, §3.3.1, §3.3.2, §3.3.3, §4.1.1, §4.1.1, §4.1.2, §4.1.3, §4.2.
  • [46] Y. You, T. Chen, Z. Wang, and Y. Shen (2020) When does self-supervision help graph convolutional networks?. In International Conference on Machine Learning, pp. 10871–10880. Cited by: §1.
  • [47] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny (2021) Barlow twins: self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230. Cited by: §2.3.
  • [48] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang (2020) Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131. Cited by: §3.1.
  • [49] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang (2020) Graph contrastive learning with adaptive augmentation. arXiv preprint arXiv:2010.14945. Cited by: §1, §2.3, §3.1.