Graph Contrastive Learning with Augmentations

10/22/2020 ∙ by Yuning You, et al. ∙ Google The University of Texas at Austin Texas A&M University USTC 13

Generalizable, transferrable, and robust representation learning on graph-structured data remains a challenge for current graph neural networks (GNNs). Unlike what has been developed for convolutional neural networks (CNNs) for image data, self-supervised learning and pre-training are less explored for GNNs. In this paper, we propose a graph contrastive learning (GraphCL) framework for learning unsupervised representations of graph data. We first design four types of graph augmentations to incorporate various priors. We then systematically study the impact of various combinations of graph augmentations on multiple datasets, in four different settings: semi-supervised, unsupervised, and transfer learning as well as adversarial attacks. The results show that, even without tuning augmentation extents nor using sophisticated GNN architectures, our GraphCL framework can produce graph representations of similar or better generalizability, transferrability, and robustness compared to state-of-the-art methods. We also investigate the impact of parameterized graph augmentation extents and patterns, and observe further performance gains in preliminary experiments. Our codes are available at https://github.com/Shen-Lab/GraphCL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

Code Repositories

GraphCL

[NeurIPS 2020] "Graph Contrastive Learning with Augmentations" by Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, Yang Shen


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural networks (GNNs) Kipf and Welling (2016a); Veličković et al. (2017); Xu et al. (2018), following a neighborhood aggregation scheme, are increasingly popular for graph-structured data. Numerous variants of GNNs have been proposed to achieve state-of-the-art performances in graph-based tasks, such as node or link classification Kipf and Welling (2016a); Veličković et al. (2017); You et al. (2020a); Liu et al. (2020a); Zou et al. (2019), link prediction Zhang and Chen (2018) and graph classification Ying et al. (2018); Xu et al. (2018). Intriguingly, in most scenarios of graph-level tasks, GNNs are trained end-to-end under supervision. For GNNs, there is little exploration (except Hu et al. (2019)) of (self-supervised) pre-training, a technique commonly used as a regularizer in training deep architectures that suffer from gradient vanishing/explosion Erhan et al. (2009); Glorot and Bengio (2010). The reasons behind the intriguing phenomena could be that most studied graph datasets, as shown in Dwivedi et al. (2020), are often limited in size and GNNs often have shallow architectures to avoid over-smoothing Li et al. (2018) or “information loss” Oono and Suzuki (2019).

We however argue for the necessity of exploring GNN pre-training schemes. Task-specific labels can be extremely scarce for graph datasets (e.g. in biology and chemistry labeling through wet-lab experiments is often resource- and time-intensive) Zitnik et al. (2018); Hu et al. (2019), and pre-training can be a promising technique to mitigate the issue, as it does in convolutional neural networks (CNNs) Goyal et al. (2019); Kolesnikov et al. (2019); Chen et al. (2020b). As to the conjectured reasons for the lack of GNN pre-training: first, real-world graph data can be huge and even benchmark datasets are recently getting larger Dwivedi et al. (2020); Hu et al. (2020a); second, even for shallow models, pre-training could initialize parameters in a “better" attraction basin around a local minimum associated with better generalization Glorot and Bengio (2010). Therefore, we emphasize the significance of GNN pre-training.

Compared to CNNs for images, there are unique challenges of designing GNN pre-training schemes for graph-structured data. Unlike geometric information in images, rich structured information of various contexts exist in graph data Veličković et al. (2018); Sun et al. (2019) as graphs are abstracted representations of raw data with diverse nature (e.g. molecules made of chemically-bonded atoms and networks of socially-interacting people). It is thus difficult to design a GNN pre-training scheme generically beneficial to down-stream tasks. A naïve GNN pre-training scheme for graph-level tasks is to reconstruct the vertex adjacency information (e.g. GAE Kipf and Welling (2016b) and GraphSAGE Hamilton et al. (2017) in network embedding). This scheme can be very limited (as seen in Veličković et al. (2018) and our Sec. 5) because it over-emphasizes proximity that is not always beneficial Veličković et al. (2018), and could hurt structural information Ribeiro et al. (2017). Therefore, a well designed pre-training framework is needed to capture highly heterogeneous information in graph-structured data.

Recently, in visual representation learning, contrastive learning has renewed a surge of interest Wu et al. (2018); Ye et al. (2019); Ji et al. (2019); Chen et al. (2020b); He et al. (2020). Self-supervision with handcrafted pretext tasks Noroozi and Favaro (2016); Carlucci et al. (2019); Trinh et al. (2019); Chen et al. (2020a)

relies on heuristics to design, and thus could limit the generality of the learned representations. In comparison, contrastive learning aims to learn representations by maximizing feature consistency under differently augmented views, that exploit data- or task-specific augmentations

Herzig et al. (2019), to inject the desired feature invariance. If extended to pre-training GCNs, this framework can potentially overcome the aforementioned limitations of proximity-based pre-training methods Kipf and Welling (2016b); Hamilton et al. (2017); You et al. (2020b); Jin et al. (2020); Zhu et al. (2020); Zhang et al. (2020); Hu et al. (2020b); Liu et al. (2020b). However, it is not straightforward to be directly applied outside visual representation learning and demands significant extensions to graph representation learning, leading to our innovations below.

Our Contributions. In this paper, we have developed contrastive learning with augmentations for GNN pre-training to address the challenge of data heterogeneity in graphs. (i) Since data augmentations are the prerequisite for constrastive learning but are under-explored in graph-data Verma et al. (2019), we first design four types of graph data augmentations, each of which imposes certain prior over graph data and parameterized for the extent and pattern. (ii) Utilizing them to obtain correlated views, we propose a novel graph contrastive learning framework (GraphCL) for GNN pre-training, so that representations invariant to specialized perturbations can be learned for diverse graph-structured data. Moreover, we show that GraphCL actually performs mutual information maximization, and the connection is drawn between GraphCL and recently proposed contrastive learning methods that we demonstrate that GraphCL can be rewritten as a general framework

unifying a broad family of contrastive learning methods on graph-structured data. (iii) Systematic study is performed to assess the performance of contrasting different augmentations on various types of datasets, revealing the rationale of the performances and providing the guidance to adopt the framework for specific datasets. (iv) Experiments show that GraphCL achieves state-of-the-art performance in the settings of semi-supervised learning, unsupervised representation learning and transfer learning. It additionally boosts robustness against common adversarial attacks.

2 Related Work

Graph neural networks. In recent years, graph neural networks (GNNs) Kipf and Welling (2016a); Veličković et al. (2017); Xu et al. (2018) have emerged as a promising approach for analyzing graph-structured data. They follow an iterative neighborhood aggregation (or message passing) scheme to capture the structural information within nodes’ neighborhood. Let denote an undirected graph, with as the feature matrix where is the

-dimensional attribute vector of the node

. Considering a -layer GNN , the propagation of the th layer is represented as:

(1)

where is the edmbedding of the vertex at the th layer with , is a set of vertices adjacent to , and and are component functions of the GNN layer. After the -layer propagation, the output embedding for

is summarized on layer embeddings through the READOUT function. Then a multi-layer perceptron (MLP) is adopted for the graph-level downstream task (classification or regression):

(2)

Various GNNs have been proposed Kipf and Welling (2016a); Veličković et al. (2017); Xu et al. (2018), achieving state-of-the-art performance in graph tasks.

Graph data augmentation. Augmentation for graph-structured data still remains under-explored, with some work along these lines but requiring prohibitive additional computation cost Verma et al. (2019). Traditional self-training methods Verma et al. (2019); Li et al. (2018) utilize the trained model to annotate unlabelled data; Ding et al. (2018)

proposes to train a generator-classifier network in the adversarial learning setting to generate fake nodes; and

Deng et al. (2019); Feng et al. (2019) generate adversarial perturbations to node feature over the graph structure.

Pre-training GNNs. Although (self-supervised) pre-training is a common and effective scheme for convolutional neural networks (CNNs) Goyal et al. (2019); Kolesnikov et al. (2019); Chen et al. (2020b), it is rarely explored for GNNs. One exception Hu et al. (2019) is restricted to studying pre-training strategies in the transfer learning setting, We argue that a pre-trained GNN is not easy to transfer, due to the diverse fields that graph-structured data source from. During transfer, substantial domain knowledge is required for both pre-training and downstream tasks, otherwise it might lead to negative transfer Hu et al. (2019); Rosenstein et al. (2005).

Contrastive learning. The main idea of contrastive learning is to make representations agree with each other under proper transformations, raising a recent surge of interest in visual representation learning Becker and Hinton (1992); Wu et al. (2018); Ye et al. (2019); Ji et al. (2019); Chen et al. (2020b). On a parallel note, for graph data, traditional methods trying to reconstruct the adjacency information of vertices Kipf and Welling (2016b); Hamilton et al. (2017) can be treated as a kind of “local contrast”, while over-emphasizing the proximity information at the expense of the structural information Ribeiro et al. (2017). Motivated by Belghazi et al. (2018); Hjelm et al. (2018), Ribeiro et al. (2017); Sun et al. (2019); Peng et al. (2020a) propose to perform contrastive learning between local and global representations to better capture structure information. However, graph contrastive learning has not been explored from the perspective of enforcing perturbation invariance as Ji et al. (2019); Chen et al. (2020b) have done.

3 Methodology

3.1 Data Augmentation for Graphs

Data augmentation aims at creating novel and realistically rational data through applying certain transformation without affecting the semantics label. It still remains under-explored for graphs except some with expensive computation cost (see Sec. 2). We focus on graph-level augmentations. Given a graph in the dataset of graphs, we formulate the augmented graph satisfying: , where is the augmentation distribution conditioned on the original graph, which is pre-defined, representing the human prior for data distribution. For instance for image classification, the applications of rotation and cropping encode the prior that people will acquire the same classification-based semantic knowledge from the rotated image or its local patches Xie et al. (2019); Berthelot et al. (2019).

When it comes to graphs, the same spirit could be followed. However, one challenge as stated in Sec. 1 is that graph datasets are abstracted from diverse fields and therefore there may not be universally appropriate data augmentation as those for images. In other words, for different categories of graph datasets some data augmentations might be more desired than others. We mainly focus on three categories: biochemical molecules (e.g. chemical compounds, proteins) Hu et al. (2019), social networks Kipf and Welling (2016a) and image super-pixel graphs Dwivedi et al. (2020). Next, we propose four general data augmentations for graph-structured data and discuss the intuitive priors that they introduce.

Data augmentation Type Underlying Prior
Node dropping Nodes, edges Vertex missing does not alter semantics.
Edge perturbation Edges Semantic robustness against connectivity variations.
Attribute masking Nodes Semantic robustness against losing partial attributes per node.
Subgraph Nodes, edges Local structure can hint the full semantics.
Table 1: Overview of data augmentations for graphs.

Node dropping. Given the graph , node dropping will randomly discard certain portion of vertices along with their connections. The underlying prior enforced by it is that missing part of vertices does not affect the semantic meaning of

. Each node’s dropping probability follows a default i.i.d. uniform distribution (or any other distribution).

Edge perturbation. It will perturb the connectivities in through randomly adding or dropping certain ratio of edges. It implies that the semantic meaning of

has certain robustness to the edge connectivity pattern variances. We also follow an i.i.d. uniform distribution to add/drop each edge.

Attribute masking. Attribute masking prompts models to recover masked vertex attributes using their context information, i.e., the remaining attributes. The underlying assumption is that missing partial vertex attributes does not affect the model predictions much.

Subgraph. This one samples a subgraph from using random walk (the algorithm is summarized in Appendix A). It assumes that the semantics of can be much preserved in its (partial) local structure.

The default augmentation (dropping, perturbation, masking and subgraph) ratio is set at 0.2.

Figure 1: A framework of graph contrastive learning. Two graph augmentations and are sampled from an augmentation pool and applied to input graph . A shared GNN-based encoder and a projection head are trained to maximize the agreement between representations and via a contrastive loss.

3.2 Graph Contrastive Learning

Motivated by recent contrastive learning developments in visual representation learning (see Sec. 2), we propose a graph contrastive learning framework (GraphCL) for (self-supervised) pre-training of GNNs. In graph contrastive learning, pre-training is performed through maximizing the agreement between two augmented views of the same graph via a contrastive loss in the latent space as shown in Fig. 1. The framework consists of the following four major components:

(1) Graph data augmentation. The given graph undergoes graph data augmentations to obtain two correlated views , as a positive pair, where independently. For different domains of graph datasets, how to strategically select data augmentations matters (Sec. 4).

(2) GNN-based encoder. A GNN-based encoder (defined in (2)) extracts graph-level representation vectors for augmented graphs . Graph contrastive learning does not apply any constraint on the GNN architecture.

(3) Projection head.

A non-linear transformation

named projection head maps augmented representations to another latent space where the contrastive loss is calculated, as advocated in Chen et al. (2020b). In graph contrastive learning, a two-layer perceptron (MLP) is applied to obtain .

(4) Contrastive loss function.

A contrastive loss function

is defined to enforce maximizing the consistency between positive pairs compared with negative pairs. Here we utilize the normalized temperature-scaled cross entropy loss (NT-Xent) Sohn (2016); Wu et al. (2018); Oord et al. (2018).

During GNN pre-training, a minibatch of graphs are randomly sampled and processed through contrastive learning, resulting in augmented graphs and corresponding contrastive loss to optimize. Negative pairs are not explicitly sampled but generated from the other augmented graphs within the same minibatch as in Chen et al. (2017, 2020b)

. Denoting the cosine similarity function as

, NT-Xent for a positive pair of examples is defined as:

(3)

where denotes the temperature parameter. The final loss is computed across all positive pairs in the minibatch. The proposed graph contrastive learning is summarized in Appendix A.

Discussion. We first show that GraphCL can be viewed as one way of mutual information maximization between the latent representations of two kinds of augmented graphs. The full derivation is in Appendix F, with the loss form rewritten as below:

(4)

The above loss essentially maximizes a lower bound of the mutual information between that the compositions of determine our desired views of graphs. Furthermore, we draw the connection between GraphCL and recently proposed contrastive learning methods that we demonstrate that GraphCL can be rewrited as a general framework unifying a broad family of contrastive learning methods on graph-structured data, through reinterpreting (4). In our implementation, we choose and generate through data augmentation, while with various choices of the compositions result in (4) instantiating as other specific contrastive learning algorithms including Velickovic et al. (2019); Ren et al. (2019); Park et al. (2020); Sun et al. (2019); Peng et al. (2020b); Hassani and Khasahmadi (2020); Qiu et al. (2020) also shown in in Appendix F.

4 The Role of Data Augmentation in Graph Contrastive Learning

Datasets Category Graph Num. Avg. Node Avg. Degree
NCI1 Biochemical Molecules 4110 29.87 1.08
PROTEINS Biochemical Molecules 1113 39.06 1.86
COLLAB Social Networks 5000 74.49 32.99
RDT-B Social Networks 2000 429.63 1.15
Table 2: Datasets statistics.

In this section, we assess and rationalize the role of data augmentation for graph-structured data in our GraphCL framework. Various pairs of augmentation types are applied, as illustrated in Fig. 2, to three categories of graph datasets (Table 2, and we leave the discussion on superpixel graphs in Appendix C). Experiments are performed in the semi-supervised setting, following the pre-training & finetuning approach Chen et al. (2020b). Detailed settings are in Appendix B.

4.1 Data Augmentations are Crucial. Composing Augmentations Benefits.

We first examine whether and when applying (different) data augmentations helps graph contrastive learning in general. We summarize the results in Fig. 2 using the accuracy gain compared to training from scratch (no pre-training). And we list the following Observations.

Figure 2: Semi-supervised learning accuracy gain (%) when contrasting different augmentation pairs, compared to training from scratch, under four datasets: NCI1, PROTEINS, COLLAB, and RDT-B. Pairing “Identical" stands for a no-augmentation baseline for contrastive learning, where the positive pair diminishes and the negative pair consists of two non-augmented graphs. Warmer colors indicate better performance gains. The baseline training-from-scratch accuracies are 60.72%, 70.40%, 57.46%, 86.63% for the four datasets respectively.

Obs. 1. Data augmentations are crucial in graph contrastive learning. Without any data augmentation graph contrastive learning is not helpful and often worse compared with training from scratch, judging from the accuracy losses in the upper right corners of Fig. 2. In contrast, composing an original graph and its appropriate augmentation can benefit the downstream performance. Judging from the top rows or the right-most columns in Fig. 2, graph contrastive learning with single best augmentations achieved considerable improvement without exhaustive hyper-parameter tuning: 1.62% for NCI1, 3.15% for PROTEINS, 6.27% for COLLAB, and 1.66% for RDT-B.

The observation meets our intuition. Without augmentation, graphCL simply compares two original samples as a negative pair (with the positive pair loss becoming zero), leading to homogeneously pushes all graph representations away from each other, which is non-intuitive to justify. Importantly, when appropriate augmentations are applied, the corresponding priors on the data distribution are instilled, enforcing the model to learn representations invariant to the desired perturbations through maximizing the agreement between a graph and its augmentation.

Figure 3: Contrastive loss curves for different augmentation pairs. In the two figures of the left attribute masking is contrasted with other augmentations and that of the right for edge perturbation, where contrasting the same augmentations always leads to the fastest loss descent.

Obs. 2. Composing different augmentations benefits more. Composing augmentation pairs of a graph rather than the graph and its augmentation further improves the performance: the maximum accuracy gain was 2.10% for NCI1, 3.15% for PROTEINS, 7.11% for COLLAB, and 1.85% for RDT-B. Interestingly, applying augmentation pairs of the same type (see the diagonals of Fig. 2) does not usually lead to the best performance (except for node dropping), compared with augmentation pairs of different types (off-diagonals). Similar observations were made in visual representation learning Chen et al. (2020b).As conjectured in Chen et al. (2020b), composing different augmentations avoids the learned features trivially overfitting low-level “shortcuts”, making features more generalizable.

Here we make a similar conjecture that contrasting isogenous graph pairs augmented in different types presents a harder albeit more useful task for graph representation learning We thus plot the contrastive loss curves composing various augmentations (except subgraph) together with attribute masking or edge perturbation for NCI1 and PROTEINS. Fig. 3 shows that, with augmentation pairs of different types, the contrastive loss always descents slower than it does with pairs of the same type, when the optimization procedure remains the same. This result indicates that composing augmentation pairs of different types does correspond to a “harder" contrastive prediction task. We will explore in Sec. 4.3 how to quantify a “harder" task in some cases and whether it always helps.

4.2 The Types, the Extent, and the Patterns of Effective Graph Augmentations

We then note that the (most) beneficial combinations of augmentation types can be dataset-specific, which matches our intuition as graph-structured data are of highly heterogeneous nature (see Sec. 1). We summarize our observations and derive insights below. And we further analyze the impact of the extent and/or the pattern of given types of graph augmentations.

Obs. 3. Edge perturbation benefits social networks but hurts some biochemical molecules. Edge perturbation as one of the paired augmentations improves the performances for social-network data COLLAB and ROT-B as well as biomolecule data PROTEINS, but hurts the other biomolecule data NCI1. We hypothesize that, compared to the case of social networks, the “semantemes” of some biomolecule data are more sensitive to individual edges. Specifically, a single-edge change in NCI1 corresponds to a removal or addition of a covalent bond, which can drastically change the identity and even the validity of a compound, let alone its property for the down-stream semantemes. In contrast the semantemes of social networks are more tolerant to individual edge perturbation Dai et al. (2018); Zügner et al. (2018). Therefore, for chemical compounds, edge perturbation demonstrates a prior that is conceptually incompatible with the domain knowledge and empirically unhelpful for down-stream performance.

We further examine whether the extent or strength of edge perturbation can affect the conclusion above. We evaluate the downstream performances on representative examples NCI1 and COLLAB. And we use the combination of the original graph (“identical”) and edge perturbation of various ratios in our GraphCL framework. Fig. 4A shows that edge perturbation worsens the NCI1 performances regardless of augmentation strength, confirming that our earlier conclusion was insensitive to the extent of edge perturbation. Fig. 4B suggests that edge perturbation could improve the COLLAB performances more with increasing augmentation strength.

Obs. 4. Applying attribute masking achieves better performance in denser graphs. For the social network datasets, composing the identical graph and attribute masking achieves 5.12% improvement for COLLAB (with higher average degree) while only 0.17% for RDT-B. Similar observations are made for the denser PROTEINS versus NCI1. To assess the impact of augmentation strength on this observation, we perform similar experiments on RDT-B and COLLAB, by composing the identical graph and its attributes masked to various extents. Fig. 4C and D show that, masking less for the very sparse RDT-B does not help, although masking more for the very dense COLLAB does.

Figure 4: Performance versus augmentation strength. Left two figures implemented edge perturbation with different ratios. The right two figures apply attribute masking with different masking ratios.

We further hypothesize that masking patterns also matter, and masking more hub nodes with high degrees benefit denser graphs, because GNNs cannot reconstruct the missing information of isolated nodes, according to the message passing mechanism Gilmer et al. (2017). To test the hypothesis, we perform an experiment to mask nodes with more connections with higher probability on denser graphs PROTEINS and COLLAB. Specifically, we adopt a masking distribution rather than the uniform distribution, where is the degree of vertex and is the control factor. A positive indicates more masking for high-degree nodes. Fig. 5C and D showing that, for very dense COLLAB, there is an apparent upward tendency on performance if masking nodes with more connections.

Obs. 5. Node dropping and subgraph are generally beneficial across datasets. Node dropping and subgraph, especially the latter, seem to be generally beneficial in our studied datasets. For node dropping, the prior that missing certain vertices (e.g. some hydrogen atoms in chemical compounds or edge users for social networks) does not alter the semantic information is emphasized, intuitively fitting for our cognition. For subgraph, previous works Veličković et al. (2018); Sun et al. (2019) show that enforcing local (the subgraphs we extract) and global information consistency is helpful for representation learning, which explains the observation. Even for chemical compounds in NCI1, subgraphs can represent structural and functional “motifs” important for the down-stream semantemes.

We similarly examined the impact of node dropping patterns by adopting the non-uniform distribution as mentioned in changing attribute-masking patterns. Fig. 5B shows that, for the dense social-network COLLAB graphs, more GraphCL improvements were observed while dropping hub nodes more in the range considered. Fig. 5A shows that, for the not-so-dense PROTEINS graphs, changing the node-dropping distribution away from uniform does not necessarily help.

Figure 5: Performance versus augmentation patterns. Node dropping and attribute masking are performed with various control factors (negative to positive: dropping/masking more low-degree vertices to high-degree ones).

4.3 Unlike “Harder” Ones, Overly Simple Contrastive Tasks Do Not Help.

As discussed in Obs. 2, “harder” contrastive learning might benefit more, where the “harder” task is achieved by composing augmentations of different types. In this section we further explore quantifiable difficulty in relationship to parameterized augmentation strengths/patterns and assess the impact of the difficulty on performance improvement.

Intuitively, larger dropping/masking ratios or control factor leads to harder contrastive tasks, which did result in better COLLAB performances (Fig. 4 and 5) in the range considered. Very small ratios or negative , corresponding to overly simple tasks, We also design subgraph variants of increasing difficulty levels and reach similar conclusions. More details are in Appendix D.

Summary. In total, we decide the augmentation pools for Section 5 as: node dropping and subgraph for biochemical molecules; all for dense social networks; and all except attribute masking for sparse social networks. Strengths or patterns are default even though varying them could help more.

5 Comparison with the State-of-the-art Methods

In this section, we compare our proposed (self-supervised) pre-training framework, GraphCL, with state-of-the-art methods (SOTAs) in the settings of semi-supervised, unsupervised Sun et al. (2019) and transfer learning Hu et al. (2019) on graph classification (for node classification experiments please refer to Appendix G). Dataset statistics and training details for the specific settings are in Appendix E.

Semi-supervised learning. We first evaluate our proposed framework in the semi-supervised learning setting on graph classification Chen et al. (2019); Xu et al. (2018) on the benchmark TUDataset Morris et al. (2020). Since pre-training & finetuning in semi-supervised learning for the graph-level task is unexplored before, we take two conventional network embedding methods as pre-training tasks for comparison: adjacency information reconstruction (we refer to GAE Kipf and Welling (2016b) for implementation) and local & global representation consistency enforcement (refer to Infomax Sun et al. (2019) for implementation). Besides, the performance of training from scratch and that with augmentation (without contrasting) is also reported. We adopt graph convolutional network (GCN) with the default setting in Chen et al. (2019) as the GNN-based encoder which achieves comparable SOTA performance in the fully-supervised setting. Table 3 shows that GraphCL outperforms traditional pre-training schemes.

Dataset NCI1 PROTEINS DD COLLAB RDT-B RDT-M5K GITHUB MNIST CIFAR10
1% baseline 60.720.45 - - 57.460.25 - - 54.250.22 60.391.95 27.360.75
1% Aug. 60.490.46 - - 58.400.97 - - 56.360.42 67.430.36 27.390.44
1% GAE 61.630.84 - - 63.200.67 - - 59.440.44 57.582.07 21.090.53
1% Infomax 62.720.65 - - 61.700.77 - - 58.990.50 63.240.78 27.860.43
1% GraphCL 62.550.86 - - 64.571.15 - - 58.560.59 83.410.33 30.010.84
10% baseline 73.720.24 70.401.54 73.560.41 73.710.27 86.630.27 51.330.44 60.870.17 79.710.65 35.780.81
10% Aug. 73.590.32 70.290.64 74.300.81 74.190.13 87.740.39 52.010.20 60.910.32 83.992.19 34.242.62
10% GAE 74.360.24 70.510.17 74.540.68 75.090.19 87.690.40 53.580.13 63.890.52 86.670.93 36.351.04
10% Infomax 74.860.26 72.270.40 75.780.34 73.760.29 88.660.95 53.610.31 65.210.88 83.340.24 41.070.48
10% GraphCL 74.630.25 74.170.34 76.171.37 74.230.21 89.110.19 52.550.45 65.810.79 93.110.17 43.870.77
Table 3: Semi-supervised learning with pre-training & finetuning. Red

numbers indicate the best performance and the number that overlap with the standard deviation of the best performance (comparable ones). 1% or 10% is label rate; baseline and Aug. represents training from scratch without and with augmentations, respectively.

Unsupervised representation learning. Furthermore, GraphCL is evaluated in the unsupervised representation learning following Narayanan et al. (2017); Sun et al. (2019), where unsupervised methods generate graph embeddings that are fed into a down-stream SVM classifier Sun et al. (2019). Aside from SOTA graph kernel methods that graphlet kernel (GL), Weisfeiler-Lehman sub-tree kernel (WL) and deep graph kernel (DGK), we also compare with four unsupervised graph-level representation learning methods as node2vec Grover and Leskovec (2016), sub2vec Adhikari et al. (2018), graph2vec Narayanan et al. (2017) and InfoGraph Sun et al. (2019). We adopt graph isomorphism network (GIN) with the default setting in Sun et al. (2019) as the GNN-based encoder which is SOTA in representation learning. Table 4

shows GraphCL outperforms in most cases except on datasets with small graph size (e.g. MUTAG and IMDB-B consists of graphs with average node number less than 20).

Dataset NCI1 PROTEINS DD MUTAG COLLAB RDT-B RDT-M5K IMDB-B
GL - - - 81.662.11 - 77.340.18 41.010.17 65.870.98
WL 80.010.50 72.920.56 - 80.723.00 - 68.820.41 46.060.21 72.303.44
DGK 80.310.46 73.300.82 - 87.442.72 - 78.040.39 41.270.18 66.960.56
node2vec 54.891.61 57.493.57 - 72.6310.20 - - - -
sub2vec 52.841.47 53.035.55 - 61.0515.80 - 71.480.41 36.680.42 55.261.54
graph2vec 73.221.81 73.302.05 - 83.159.25 - 75.781.03 47.860.26 71.100.54
InfoGraph 76.201.06 74.440.31 72.851.78 89.011.13 70.651.13 82.501.42 53.461.03 73.030.87
GraphCL 77.870.41 74.390.45 78.620.40 86.801.34 71.361.15 89.530.84 55.990.28 71.140.44
Table 4: Comparing classification accuracy on top of graph representations learned from graph kernels, SOTA representation learning methods, and GIN pre-trained with GraphCL. The compared numbers are from the corresponding papers under the same experiment setting.

Transfer learning. Lastly, experiments are performed on transfer learning on molecular property prediction in chemistry and protein function prediction in biology following Hu et al. (2019), which pre-trains and finetunes the model in different datasets to evaluate the transferability of the pre-training scheme. We adopt GIN with the default setting in Hu et al. (2019) as the GNN-based encoder which is SOTA in transfer learning. Experiments are performed for 10 times with mean and standard deviation of ROC-AUC scores (%) reported as Hu et al. (2019). Although there is no universally beneficial pre-training scheme especially for the out-of-distribution scenario in transfer learning (Sec. 1), Table 5 shows that GraphCL still achieves SOTA performance on 5 of 9 datasets compared to the previous best schemes.

Dataset BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE PPI
No Pre-Train 65.84.5 74.00.8 63.40.6 57.31.6 58.04.4 71.82.5 75.31.9 70.15.4 64.81.0
Infomax 68.80.8 75.30.5 62.70.4 58.40.8 69.93.0 75.32.5 76.00.7 75.91.6 64.11.5
EdgePred 67.32.4 76.00.6 64.10.6 60.40.7 64.13.7 74.12.1 76.31.0 79.90.9 65.71.3
AttrMasking 64.32.8 76.70.4 64.20.5 61.00.7 71.84.1 74.71.4 77.21.1 79.31.6 65.21.6
ContextPred 68.02.0 75.70.7 63.90.6 60.90.6 65.93.8 75.81.7 77.31.0 79.61.2 64.41.3
GraphCL 69.680.67 73.870.66 62.400.57 60.530.88 75.992.65 69.802.66 78.471.22 75.381.44 67.880.85
Table 5: Transfer learning comparison with different manually designed pre-training schemes, where the compared numbers are from Hu et al. (2019).

Adversarial robustness. In addition to generalizability, we claim that GNNs also gain robustness using GraphCL. The experiments are performed on synthetic data to classify the component number in graphs, facing the RandSampling, GradArgmax and RL-S2V attacks following the default setting in Dai et al. (2018). Structure2vec Dai et al. (2016) is adopted as the GNN-based encoder as in Dai et al. (2018). Table 6 shows that GraphCL boosts GNN robustness compared with training from scratch, under three evasion attacks.

Two-Layer Three-Layer Four-Layer
Methods No Pre-Train GraphCL No Pre-Train GraphCL No Pre-Train GraphCL
Unattack 93.20 94.73 98.20 98.33 98.87 99.00
RandSampling 78.73 80.68 92.27 92.60 95.13 97.40
GradArgmax 69.47 69.26 64.60 89.33 95.80 97.00
RL-S2V 42.93 42.20 41.93 61.66 70.20 84.86
Table 6: Adversarial performance under three adversarial attacks for GNN with different depth (following the protocol in Dai et al. (2018)). Red numbers indicate the best performance.

6 Conclusion

In this paper, we perform explicit study to explore contrastive learning for GNN pre-training, facing the unique challenges in graph-structured data. Firstly, several graph data augmentations are proposed with the discussion of each of which on introducing certain human prior of data distribution. Along with new augmentations, we propose a novel graph contrastive learning framework (GraphCL) for GNN pre-training to facilitate invariant representation learning along with rigorous theoretical analysis. We systematically assess and analyze the influence of data augmentations in our proposed framework, revealing the rationale and guiding the choice of augmentations. Experiment results verify the state-of-the-art performance of our proposed framework in both generalizability and robustness.

Broader Impact

Empowering deep learning for reasoning and predicting over graph-structured data is of broad interests and wide applications, such as recommendation systems, neural architecture search, and drug discovery. The proposed graph contrastive learning framework with augmentations contributes a general framework that can potentially benefit the effectiveness and efficiency of graph neural networks through model pre-training. The numerical results and analyses would also inspire the design of proper augmentations toward positive knowledge transfer on downstream tasks.

References

  • [1] B. Adhikari, Y. Zhang, N. Ramakrishnan, and B. A. Prakash (2018) Sub2vec: feature learning for subgraphs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 170–182. Cited by: §5.
  • [2] S. Becker and G. E. Hinton (1992) Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature 355 (6356), pp. 161–163. Cited by: §2.
  • [3] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm (2018)

    Mine: mutual information neural estimation

    .
    arXiv preprint arXiv:1801.04062. Cited by: §2.
  • [4] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) Mixmatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5050–5060. Cited by: §3.1.
  • [5] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi (2019) Domain generalization by solving jigsaw puzzles. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2229–2238. Cited by: §1.
  • [6] T. Chen, S. Liu, S. Chang, Y. Cheng, L. Amini, and Z. Wang (2020) Adversarial robustness: from self-supervised pre-training to fine-tuning. arXiv preprint arXiv:2003.12862. Cited by: §1.
  • [7] T. Chen, S. Bian, and Y. Sun (2019) Are powerful graph neural nets necessary? a dissection on graph classification. arXiv preprint arXiv:1905.04579. Cited by: §5.
  • [8] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1, §1, §2, §2, §3.2, §3.2, §4.1, §4.
  • [9] T. Chen, Y. Sun, Y. Shi, and L. Hong (2017) On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 767–776. Cited by: §3.2.
  • [10] H. Dai, B. Dai, and L. Song (2016) Discriminative embeddings of latent variable models for structured data. In

    International conference on machine learning

    ,
    pp. 2702–2711. Cited by: §5.
  • [11] H. Dai, H. Li, T. Tian, X. Huang, L. Wang, J. Zhu, and L. Song (2018) Adversarial attack on graph structured data. arXiv preprint arXiv:1806.02371. Cited by: §4.2, Table 6, §5.
  • [12] Z. Deng, Y. Dong, and J. Zhu (2019) Batch virtual adversarial training for graph convolutional networks. arXiv preprint arXiv:1902.09192. Cited by: §2.
  • [13] M. Ding, J. Tang, and J. Zhang (2018) Semi-supervised learning on graphs with generative adversarial nets. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 913–922. Cited by: §2.
  • [14] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson (2020) Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982. Cited by: §1, §1, §3.1.
  • [15] D. Erhan, P. Manzagol, Y. Bengio, S. Bengio, and P. Vincent (2009) The difficulty of training deep architectures and the effect of unsupervised pre-training. In Artificial Intelligence and Statistics, pp. 153–160. Cited by: §1.
  • [16] F. Feng, X. He, J. Tang, and T. Chua (2019) Graph adversarial training: dynamically regularizing based on graph structure. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.
  • [17] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §4.2.
  • [18] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §1, §1.
  • [19] P. Goyal, D. Mahajan, A. Gupta, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. arXiv preprint arXiv:1905.01235. Cited by: §1, §2.
  • [20] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §5.
  • [21] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §1, §1, §2.
  • [22] K. Hassani and A. H. Khasahmadi (2020) Contrastive multi-view representation learning on graphs. arXiv preprint arXiv:2006.05582. Cited by: §3.2.
  • [23] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1.
  • [24] R. Herzig, A. Bar, H. Xu, G. Chechik, T. Darrell, and A. Globerson (2019) Learning canonical representations for scene graph to image generation. arXiv preprint arXiv:1912.07414. Cited by: §1.
  • [25] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.
  • [26] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020) Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: §1.
  • [27] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec (2019) Pre-training graph neural networks. arXiv preprint arXiv:1905.12265. Cited by: §1, §1, §2, §3.1, Table 5, §5, §5.
  • [28] Z. Hu, Y. Dong, K. Wang, K. Chang, and Y. Sun (2020) GPT-gnn: generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1857–1867. Cited by: §1.
  • [29] X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9865–9874. Cited by: §1, §2.
  • [30] W. Jin, T. Derr, H. Liu, Y. Wang, S. Wang, Z. Liu, and J. Tang (2020) Self-supervised learning on graphs: deep insights and new direction. arXiv preprint arXiv:2006.10141. Cited by: §1.
  • [31] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2, §2, §3.1.
  • [32] T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §1, §1, §2, §5.
  • [33] A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005. Cited by: §1, §2.
  • [34] Q. Li, Z. Han, and X. Wu (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
  • [35] M. Liu, H. Gao, and S. Ji (2020) Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 338–348. Cited by: §1.
  • [36] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang (2020) Self-supervised learning: generative or contrastive. arXiv preprint arXiv:2006.08218. Cited by: §1.
  • [37] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann (2020) TUDataset: a collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), External Links: 2007.08663, Link Cited by: §5.
  • [38] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017)

    Graph2vec: learning distributed representations of graphs

    .
    arXiv preprint arXiv:1707.05005. Cited by: §5.
  • [39] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §1.
  • [40] K. Oono and T. Suzuki (2019) Graph neural networks exponentially lose expressive power for node classification. arXiv preprint cs.LG/1905.10947. Cited by: §1.
  • [41] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.2.
  • [42] C. Park, D. Kim, J. Han, and H. Yu (2020) Unsupervised attributed multiplex network embedding.. In AAAI, pp. 5371–5378. Cited by: §3.2.
  • [43] Z. Peng, Y. Dong, M. Luo, X. Wu, and Q. Zheng (2020) Self-supervised graph representation learning via global context prediction. arXiv preprint arXiv:2003.01604. Cited by: §2.
  • [44] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, and J. Huang (2020) Graph representation learning via graphical mutual information maximization. In Proceedings of The Web Conference 2020, pp. 259–270. Cited by: §3.2.
  • [45] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang (2020) GCC: graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1150–1160. Cited by: §3.2.
  • [46] Y. Ren, B. Liu, C. Huang, P. Dai, L. Bo, and J. Zhang (2019) Heterogeneous deep graph infomax. arXiv preprint arXiv:1911.08538. Cited by: §3.2.
  • [47] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo (2017) Struc2vec: learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394. Cited by: §1, §2.
  • [48] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich (2005) To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, Vol. 898, pp. 1–4. Cited by: §2.
  • [49] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems, pp. 1857–1865. Cited by: §3.2.
  • [50] F. Sun, J. Hoffmann, and J. Tang (2019) InfoGraph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §1, §2, §3.2, §4.2, §5, §5, §5.
  • [51] T. H. Trinh, M. Luong, and Q. V. Le (2019) Selfie: self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940. Cited by: §1.
  • [52] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §2, §2.
  • [53] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2018) Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §1, §4.2.
  • [54] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2019) Deep graph infomax.. In ICLR (Poster), Cited by: §3.2.
  • [55] V. Verma, M. Qu, A. Lamb, Y. Bengio, J. Kannala, and J. Tang (2019) GraphMix: regularized training of graph neural networks for semi-supervised learning. arXiv preprint arXiv:1909.11715. Cited by: §1, §2.
  • [56] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1, §2, §3.2.
  • [57] Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation. arXiv preprint arXiv:1904.12848. Cited by: §3.1.
  • [58] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §1, §2, §2, §5.
  • [59] M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §1, §2.
  • [60] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: §1.
  • [61] Y. You, T. Chen, Z. Wang, and Y. Shen (2020) L-gcn: layer-wise and learned efficient training of graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2127–2135. Cited by: §1.
  • [62] Y. You, T. Chen, Z. Wang, and Y. Shen (2020) When does self-supervision help graph convolutional networks?. arXiv preprint arXiv:2006.09136. Cited by: §1.
  • [63] J. Zhang, H. Zhang, L. Sun, and C. Xia (2020) Graph-bert: only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140. Cited by: §1.
  • [64] M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: §1.
  • [65] Q. Zhu, B. Du, and P. Yan (2020) Self-supervised training of graph convolutional networks. arXiv preprint arXiv:2006.02380. Cited by: §1.
  • [66] M. Zitnik, J. Leskovec, et al. (2018) Prioritizing network communities. Nature communications 9 (1), pp. 1–9. Cited by: §1.
  • [67] D. Zou, Z. Hu, Y. Wang, S. Jiang, Y. Sun, and Q. Gu (2019) Layer-dependent importance sampling for training deep and large graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 11249–11259. Cited by: §1.
  • [68] D. Zügner, A. Akbarnejad, and S. Günnemann (2018) Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2847–2856. Cited by: §4.2.