[NeurIPS 2020] "Graph Contrastive Learning with Augmentations" by Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, Yang Shen
Generalizable, transferrable, and robust representation learning on graph-structured data remains a challenge for current graph neural networks (GNNs). Unlike what has been developed for convolutional neural networks (CNNs) for image data, self-supervised learning and pre-training are less explored for GNNs. In this paper, we propose a graph contrastive learning (GraphCL) framework for learning unsupervised representations of graph data. We first design four types of graph augmentations to incorporate various priors. We then systematically study the impact of various combinations of graph augmentations on multiple datasets, in four different settings: semi-supervised, unsupervised, and transfer learning as well as adversarial attacks. The results show that, even without tuning augmentation extents nor using sophisticated GNN architectures, our GraphCL framework can produce graph representations of similar or better generalizability, transferrability, and robustness compared to state-of-the-art methods. We also investigate the impact of parameterized graph augmentation extents and patterns, and observe further performance gains in preliminary experiments. Our codes are available at https://github.com/Shen-Lab/GraphCL.READ FULL TEXT VIEW PDF
Real world data is mostly unlabeled or only few instances are labeled.
Graph neural networks (GNNs) apply deep learning techniques to
Although the self-supervised pre-training of transformer models has resu...
Self-supervision as an emerging technique has been employed to train
Graph motifs are significant subgraph patterns occurring frequently in
Learning meaningful representations of free-hand sketches remains a
Many machine learning techniques have been proposed in the last few year...
[NeurIPS 2020] "Graph Contrastive Learning with Augmentations" by Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, Yang Shen
Graph neural networks (GNNs) Kipf and Welling (2016a); Veličković et al. (2017); Xu et al. (2018), following a neighborhood aggregation scheme, are increasingly popular for graph-structured data. Numerous variants of GNNs have been proposed to achieve state-of-the-art performances in graph-based tasks, such as node or link classification Kipf and Welling (2016a); Veličković et al. (2017); You et al. (2020a); Liu et al. (2020a); Zou et al. (2019), link prediction Zhang and Chen (2018) and graph classification Ying et al. (2018); Xu et al. (2018). Intriguingly, in most scenarios of graph-level tasks, GNNs are trained end-to-end under supervision. For GNNs, there is little exploration (except Hu et al. (2019)) of (self-supervised) pre-training, a technique commonly used as a regularizer in training deep architectures that suffer from gradient vanishing/explosion Erhan et al. (2009); Glorot and Bengio (2010). The reasons behind the intriguing phenomena could be that most studied graph datasets, as shown in Dwivedi et al. (2020), are often limited in size and GNNs often have shallow architectures to avoid over-smoothing Li et al. (2018) or “information loss” Oono and Suzuki (2019).
We however argue for the necessity of exploring GNN pre-training schemes. Task-specific labels can be extremely scarce for graph datasets (e.g. in biology and chemistry labeling through wet-lab experiments is often resource- and time-intensive) Zitnik et al. (2018); Hu et al. (2019), and pre-training can be a promising technique to mitigate the issue, as it does in convolutional neural networks (CNNs) Goyal et al. (2019); Kolesnikov et al. (2019); Chen et al. (2020b). As to the conjectured reasons for the lack of GNN pre-training: first, real-world graph data can be huge and even benchmark datasets are recently getting larger Dwivedi et al. (2020); Hu et al. (2020a); second, even for shallow models, pre-training could initialize parameters in a “better" attraction basin around a local minimum associated with better generalization Glorot and Bengio (2010). Therefore, we emphasize the significance of GNN pre-training.
Compared to CNNs for images, there are unique challenges of designing GNN pre-training schemes for graph-structured data. Unlike geometric information in images, rich structured information of various contexts exist in graph data Veličković et al. (2018); Sun et al. (2019) as graphs are abstracted representations of raw data with diverse nature (e.g. molecules made of chemically-bonded atoms and networks of socially-interacting people). It is thus difficult to design a GNN pre-training scheme generically beneficial to down-stream tasks. A naïve GNN pre-training scheme for graph-level tasks is to reconstruct the vertex adjacency information (e.g. GAE Kipf and Welling (2016b) and GraphSAGE Hamilton et al. (2017) in network embedding). This scheme can be very limited (as seen in Veličković et al. (2018) and our Sec. 5) because it over-emphasizes proximity that is not always beneficial Veličković et al. (2018), and could hurt structural information Ribeiro et al. (2017). Therefore, a well designed pre-training framework is needed to capture highly heterogeneous information in graph-structured data.
Recently, in visual representation learning, contrastive learning has renewed a surge of interest Wu et al. (2018); Ye et al. (2019); Ji et al. (2019); Chen et al. (2020b); He et al. (2020). Self-supervision with handcrafted pretext tasks Noroozi and Favaro (2016); Carlucci et al. (2019); Trinh et al. (2019); Chen et al. (2020a)
relies on heuristics to design, and thus could limit the generality of the learned representations. In comparison, contrastive learning aims to learn representations by maximizing feature consistency under differently augmented views, that exploit data- or task-specific augmentationsHerzig et al. (2019), to inject the desired feature invariance. If extended to pre-training GCNs, this framework can potentially overcome the aforementioned limitations of proximity-based pre-training methods Kipf and Welling (2016b); Hamilton et al. (2017); You et al. (2020b); Jin et al. (2020); Zhu et al. (2020); Zhang et al. (2020); Hu et al. (2020b); Liu et al. (2020b). However, it is not straightforward to be directly applied outside visual representation learning and demands significant extensions to graph representation learning, leading to our innovations below.
Our Contributions. In this paper, we have developed contrastive learning with augmentations for GNN pre-training to address the challenge of data heterogeneity in graphs. (i) Since data augmentations are the prerequisite for constrastive learning but are under-explored in graph-data Verma et al. (2019), we first design four types of graph data augmentations, each of which imposes certain prior over graph data and parameterized for the extent and pattern. (ii) Utilizing them to obtain correlated views, we propose a novel graph contrastive learning framework (GraphCL) for GNN pre-training, so that representations invariant to specialized perturbations can be learned for diverse graph-structured data. Moreover, we show that GraphCL actually performs mutual information maximization, and the connection is drawn between GraphCL and recently proposed contrastive learning methods that we demonstrate that GraphCL can be rewritten as a general framework
unifying a broad family of contrastive learning methods on graph-structured data. (iii) Systematic study is performed to assess the performance of contrasting different augmentations on various types of datasets, revealing the rationale of the performances and providing the guidance to adopt the framework for specific datasets. (iv) Experiments show that GraphCL achieves state-of-the-art performance in the settings of semi-supervised learning, unsupervised representation learning and transfer learning. It additionally boosts robustness against common adversarial attacks.
Graph neural networks. In recent years, graph neural networks (GNNs) Kipf and Welling (2016a); Veličković et al. (2017); Xu et al. (2018) have emerged as a promising approach for analyzing graph-structured data. They follow an iterative neighborhood aggregation (or message passing) scheme to capture the structural information within nodes’ neighborhood. Let denote an undirected graph, with as the feature matrix where is the
-dimensional attribute vector of the node. Considering a -layer GNN , the propagation of the th layer is represented as:
where is the edmbedding of the vertex at the th layer with , is a set of vertices adjacent to , and and are component functions of the GNN layer. After the -layer propagation, the output embedding for
is summarized on layer embeddings through the READOUT function. Then a multi-layer perceptron (MLP) is adopted for the graph-level downstream task (classification or regression):
Graph data augmentation. Augmentation for graph-structured data still remains under-explored, with some work along these lines but requiring prohibitive additional computation cost Verma et al. (2019). Traditional self-training methods Verma et al. (2019); Li et al. (2018) utilize the trained model to annotate unlabelled data; Ding et al. (2018)
proposes to train a generator-classifier network in the adversarial learning setting to generate fake nodes; andDeng et al. (2019); Feng et al. (2019) generate adversarial perturbations to node feature over the graph structure.
Pre-training GNNs. Although (self-supervised) pre-training is a common and effective scheme for convolutional neural networks (CNNs) Goyal et al. (2019); Kolesnikov et al. (2019); Chen et al. (2020b), it is rarely explored for GNNs. One exception Hu et al. (2019) is restricted to studying pre-training strategies in the transfer learning setting, We argue that a pre-trained GNN is not easy to transfer, due to the diverse fields that graph-structured data source from. During transfer, substantial domain knowledge is required for both pre-training and downstream tasks, otherwise it might lead to negative transfer Hu et al. (2019); Rosenstein et al. (2005).
Contrastive learning. The main idea of contrastive learning is to make representations agree with each other under proper transformations, raising a recent surge of interest in visual representation learning Becker and Hinton (1992); Wu et al. (2018); Ye et al. (2019); Ji et al. (2019); Chen et al. (2020b). On a parallel note, for graph data, traditional methods trying to reconstruct the adjacency information of vertices Kipf and Welling (2016b); Hamilton et al. (2017) can be treated as a kind of “local contrast”, while over-emphasizing the proximity information at the expense of the structural information Ribeiro et al. (2017). Motivated by Belghazi et al. (2018); Hjelm et al. (2018), Ribeiro et al. (2017); Sun et al. (2019); Peng et al. (2020a) propose to perform contrastive learning between local and global representations to better capture structure information. However, graph contrastive learning has not been explored from the perspective of enforcing perturbation invariance as Ji et al. (2019); Chen et al. (2020b) have done.
Data augmentation aims at creating novel and realistically rational data through applying certain transformation without affecting the semantics label. It still remains under-explored for graphs except some with expensive computation cost (see Sec. 2). We focus on graph-level augmentations. Given a graph in the dataset of graphs, we formulate the augmented graph satisfying: , where is the augmentation distribution conditioned on the original graph, which is pre-defined, representing the human prior for data distribution. For instance for image classification, the applications of rotation and cropping encode the prior that people will acquire the same classification-based semantic knowledge from the rotated image or its local patches Xie et al. (2019); Berthelot et al. (2019).
When it comes to graphs, the same spirit could be followed. However, one challenge as stated in Sec. 1 is that graph datasets are abstracted from diverse fields and therefore there may not be universally appropriate data augmentation as those for images. In other words, for different categories of graph datasets some data augmentations might be more desired than others. We mainly focus on three categories: biochemical molecules (e.g. chemical compounds, proteins) Hu et al. (2019), social networks Kipf and Welling (2016a) and image super-pixel graphs Dwivedi et al. (2020). Next, we propose four general data augmentations for graph-structured data and discuss the intuitive priors that they introduce.
|Data augmentation||Type||Underlying Prior|
|Node dropping||Nodes, edges||Vertex missing does not alter semantics.|
|Edge perturbation||Edges||Semantic robustness against connectivity variations.|
|Attribute masking||Nodes||Semantic robustness against losing partial attributes per node.|
|Subgraph||Nodes, edges||Local structure can hint the full semantics.|
Node dropping. Given the graph , node dropping will randomly discard certain portion of vertices along with their connections. The underlying prior enforced by it is that missing part of vertices does not affect the semantic meaning of
Edge perturbation. It will perturb the connectivities in through randomly adding or dropping certain ratio of edges. It implies that the semantic meaning of
has certain robustness to the edge connectivity pattern variances. We also follow an i.i.d. uniform distribution to add/drop each edge.
Attribute masking. Attribute masking prompts models to recover masked vertex attributes using their context information, i.e., the remaining attributes. The underlying assumption is that missing partial vertex attributes does not affect the model predictions much.
Subgraph. This one samples a subgraph from using random walk (the algorithm is summarized in Appendix A). It assumes that the semantics of can be much preserved in its (partial) local structure.
The default augmentation (dropping, perturbation, masking and subgraph) ratio is set at 0.2.
Motivated by recent contrastive learning developments in visual representation learning (see Sec. 2), we propose a graph contrastive learning framework (GraphCL) for (self-supervised) pre-training of GNNs. In graph contrastive learning, pre-training is performed through maximizing the agreement between two augmented views of the same graph via a contrastive loss in the latent space as shown in Fig. 1. The framework consists of the following four major components:
(1) Graph data augmentation. The given graph undergoes graph data augmentations to obtain two correlated views , as a positive pair, where independently. For different domains of graph datasets, how to strategically select data augmentations matters (Sec. 4).
(2) GNN-based encoder. A GNN-based encoder (defined in (2)) extracts graph-level representation vectors for augmented graphs . Graph contrastive learning does not apply any constraint on the GNN architecture.
(3) Projection head.
A non-linear transformationnamed projection head maps augmented representations to another latent space where the contrastive loss is calculated, as advocated in Chen et al. (2020b). In graph contrastive learning, a two-layer perceptron (MLP) is applied to obtain .
(4) Contrastive loss function.
A contrastive loss functionis defined to enforce maximizing the consistency between positive pairs compared with negative pairs. Here we utilize the normalized temperature-scaled cross entropy loss (NT-Xent) Sohn (2016); Wu et al. (2018); Oord et al. (2018).
During GNN pre-training, a minibatch of graphs are randomly sampled and processed through contrastive learning, resulting in augmented graphs and corresponding contrastive loss to optimize. Negative pairs are not explicitly sampled but generated from the other augmented graphs within the same minibatch as in Chen et al. (2017, 2020b)
. Denoting the cosine similarity function as, NT-Xent for a positive pair of examples is defined as:
where denotes the temperature parameter. The final loss is computed across all positive pairs in the minibatch. The proposed graph contrastive learning is summarized in Appendix A.
Discussion. We first show that GraphCL can be viewed as one way of mutual information maximization between the latent representations of two kinds of augmented graphs. The full derivation is in Appendix F, with the loss form rewritten as below:
The above loss essentially maximizes a lower bound of the mutual information between that the compositions of determine our desired views of graphs. Furthermore, we draw the connection between GraphCL and recently proposed contrastive learning methods that we demonstrate that GraphCL can be rewrited as a general framework unifying a broad family of contrastive learning methods on graph-structured data, through reinterpreting (4). In our implementation, we choose and generate through data augmentation, while with various choices of the compositions result in (4) instantiating as other specific contrastive learning algorithms including Velickovic et al. (2019); Ren et al. (2019); Park et al. (2020); Sun et al. (2019); Peng et al. (2020b); Hassani and Khasahmadi (2020); Qiu et al. (2020) also shown in in Appendix F.
|Datasets||Category||Graph Num.||Avg. Node||Avg. Degree|
In this section, we assess and rationalize the role of data augmentation for graph-structured data in our GraphCL framework. Various pairs of augmentation types are applied, as illustrated in Fig. 2, to three categories of graph datasets (Table 2, and we leave the discussion on superpixel graphs in Appendix C). Experiments are performed in the semi-supervised setting, following the pre-training & finetuning approach Chen et al. (2020b). Detailed settings are in Appendix B.
We first examine whether and when applying (different) data augmentations helps graph contrastive learning in general. We summarize the results in Fig. 2 using the accuracy gain compared to training from scratch (no pre-training). And we list the following Observations.
Obs. 1. Data augmentations are crucial in graph contrastive learning. Without any data augmentation graph contrastive learning is not helpful and often worse compared with training from scratch, judging from the accuracy losses in the upper right corners of Fig. 2. In contrast, composing an original graph and its appropriate augmentation can benefit the downstream performance. Judging from the top rows or the right-most columns in Fig. 2, graph contrastive learning with single best augmentations achieved considerable improvement without exhaustive hyper-parameter tuning: 1.62% for NCI1, 3.15% for PROTEINS, 6.27% for COLLAB, and 1.66% for RDT-B.
The observation meets our intuition. Without augmentation, graphCL simply compares two original samples as a negative pair (with the positive pair loss becoming zero), leading to homogeneously pushes all graph representations away from each other, which is non-intuitive to justify. Importantly, when appropriate augmentations are applied, the corresponding priors on the data distribution are instilled, enforcing the model to learn representations invariant to the desired perturbations through maximizing the agreement between a graph and its augmentation.
Obs. 2. Composing different augmentations benefits more. Composing augmentation pairs of a graph rather than the graph and its augmentation further improves the performance: the maximum accuracy gain was 2.10% for NCI1, 3.15% for PROTEINS, 7.11% for COLLAB, and 1.85% for RDT-B. Interestingly, applying augmentation pairs of the same type (see the diagonals of Fig. 2) does not usually lead to the best performance (except for node dropping), compared with augmentation pairs of different types (off-diagonals). Similar observations were made in visual representation learning Chen et al. (2020b).As conjectured in Chen et al. (2020b), composing different augmentations avoids the learned features trivially overfitting low-level “shortcuts”, making features more generalizable.
Here we make a similar conjecture that contrasting isogenous graph pairs augmented in different types presents a harder albeit more useful task for graph representation learning We thus plot the contrastive loss curves composing various augmentations (except subgraph) together with attribute masking or edge perturbation for NCI1 and PROTEINS. Fig. 3 shows that, with augmentation pairs of different types, the contrastive loss always descents slower than it does with pairs of the same type, when the optimization procedure remains the same. This result indicates that composing augmentation pairs of different types does correspond to a “harder" contrastive prediction task. We will explore in Sec. 4.3 how to quantify a “harder" task in some cases and whether it always helps.
We then note that the (most) beneficial combinations of augmentation types can be dataset-specific, which matches our intuition as graph-structured data are of highly heterogeneous nature (see Sec. 1). We summarize our observations and derive insights below. And we further analyze the impact of the extent and/or the pattern of given types of graph augmentations.
Obs. 3. Edge perturbation benefits social networks but hurts some biochemical molecules. Edge perturbation as one of the paired augmentations improves the performances for social-network data COLLAB and ROT-B as well as biomolecule data PROTEINS, but hurts the other biomolecule data NCI1. We hypothesize that, compared to the case of social networks, the “semantemes” of some biomolecule data are more sensitive to individual edges. Specifically, a single-edge change in NCI1 corresponds to a removal or addition of a covalent bond, which can drastically change the identity and even the validity of a compound, let alone its property for the down-stream semantemes. In contrast the semantemes of social networks are more tolerant to individual edge perturbation Dai et al. (2018); Zügner et al. (2018). Therefore, for chemical compounds, edge perturbation demonstrates a prior that is conceptually incompatible with the domain knowledge and empirically unhelpful for down-stream performance.
We further examine whether the extent or strength of edge perturbation can affect the conclusion above. We evaluate the downstream performances on representative examples NCI1 and COLLAB. And we use the combination of the original graph (“identical”) and edge perturbation of various ratios in our GraphCL framework. Fig. 4A shows that edge perturbation worsens the NCI1 performances regardless of augmentation strength, confirming that our earlier conclusion was insensitive to the extent of edge perturbation. Fig. 4B suggests that edge perturbation could improve the COLLAB performances more with increasing augmentation strength.
Obs. 4. Applying attribute masking achieves better performance in denser graphs. For the social network datasets, composing the identical graph and attribute masking achieves 5.12% improvement for COLLAB (with higher average degree) while only 0.17% for RDT-B. Similar observations are made for the denser PROTEINS versus NCI1. To assess the impact of augmentation strength on this observation, we perform similar experiments on RDT-B and COLLAB, by composing the identical graph and its attributes masked to various extents. Fig. 4C and D show that, masking less for the very sparse RDT-B does not help, although masking more for the very dense COLLAB does.
We further hypothesize that masking patterns also matter, and masking more hub nodes with high degrees benefit denser graphs, because GNNs cannot reconstruct the missing information of isolated nodes, according to the message passing mechanism Gilmer et al. (2017). To test the hypothesis, we perform an experiment to mask nodes with more connections with higher probability on denser graphs PROTEINS and COLLAB. Specifically, we adopt a masking distribution rather than the uniform distribution, where is the degree of vertex and is the control factor. A positive indicates more masking for high-degree nodes. Fig. 5C and D showing that, for very dense COLLAB, there is an apparent upward tendency on performance if masking nodes with more connections.
Obs. 5. Node dropping and subgraph are generally beneficial across datasets. Node dropping and subgraph, especially the latter, seem to be generally beneficial in our studied datasets. For node dropping, the prior that missing certain vertices (e.g. some hydrogen atoms in chemical compounds or edge users for social networks) does not alter the semantic information is emphasized, intuitively fitting for our cognition. For subgraph, previous works Veličković et al. (2018); Sun et al. (2019) show that enforcing local (the subgraphs we extract) and global information consistency is helpful for representation learning, which explains the observation. Even for chemical compounds in NCI1, subgraphs can represent structural and functional “motifs” important for the down-stream semantemes.
We similarly examined the impact of node dropping patterns by adopting the non-uniform distribution as mentioned in changing attribute-masking patterns. Fig. 5B shows that, for the dense social-network COLLAB graphs, more GraphCL improvements were observed while dropping hub nodes more in the range considered. Fig. 5A shows that, for the not-so-dense PROTEINS graphs, changing the node-dropping distribution away from uniform does not necessarily help.
As discussed in Obs. 2, “harder” contrastive learning might benefit more, where the “harder” task is achieved by composing augmentations of different types. In this section we further explore quantifiable difficulty in relationship to parameterized augmentation strengths/patterns and assess the impact of the difficulty on performance improvement.
Intuitively, larger dropping/masking ratios or control factor leads to harder contrastive tasks, which did result in better COLLAB performances (Fig. 4 and 5) in the range considered. Very small ratios or negative , corresponding to overly simple tasks, We also design subgraph variants of increasing difficulty levels and reach similar conclusions. More details are in Appendix D.
Summary. In total, we decide the augmentation pools for Section 5 as: node dropping and subgraph for biochemical molecules; all for dense social networks; and all except attribute masking for sparse social networks. Strengths or patterns are default even though varying them could help more.
In this section, we compare our proposed (self-supervised) pre-training framework, GraphCL, with state-of-the-art methods (SOTAs) in the settings of semi-supervised, unsupervised Sun et al. (2019) and transfer learning Hu et al. (2019) on graph classification (for node classification experiments please refer to Appendix G). Dataset statistics and training details for the specific settings are in Appendix E.
Semi-supervised learning. We first evaluate our proposed framework in the semi-supervised learning setting on graph classification Chen et al. (2019); Xu et al. (2018) on the benchmark TUDataset Morris et al. (2020). Since pre-training & finetuning in semi-supervised learning for the graph-level task is unexplored before, we take two conventional network embedding methods as pre-training tasks for comparison: adjacency information reconstruction (we refer to GAE Kipf and Welling (2016b) for implementation) and local & global representation consistency enforcement (refer to Infomax Sun et al. (2019) for implementation). Besides, the performance of training from scratch and that with augmentation (without contrasting) is also reported. We adopt graph convolutional network (GCN) with the default setting in Chen et al. (2019) as the GNN-based encoder which achieves comparable SOTA performance in the fully-supervised setting. Table 3 shows that GraphCL outperforms traditional pre-training schemes.
numbers indicate the best performance and the number that overlap with the standard deviation of the best performance (comparable ones). 1% or 10% is label rate; baseline and Aug. represents training from scratch without and with augmentations, respectively.
Unsupervised representation learning. Furthermore, GraphCL is evaluated in the unsupervised representation learning following Narayanan et al. (2017); Sun et al. (2019), where unsupervised methods generate graph embeddings that are fed into a down-stream SVM classifier Sun et al. (2019). Aside from SOTA graph kernel methods that graphlet kernel (GL), Weisfeiler-Lehman sub-tree kernel (WL) and deep graph kernel (DGK), we also compare with four unsupervised graph-level representation learning methods as node2vec Grover and Leskovec (2016), sub2vec Adhikari et al. (2018), graph2vec Narayanan et al. (2017) and InfoGraph Sun et al. (2019). We adopt graph isomorphism network (GIN) with the default setting in Sun et al. (2019) as the GNN-based encoder which is SOTA in representation learning. Table 4
shows GraphCL outperforms in most cases except on datasets with small graph size (e.g. MUTAG and IMDB-B consists of graphs with average node number less than 20).
Transfer learning. Lastly, experiments are performed on transfer learning on molecular property prediction in chemistry and protein function prediction in biology following Hu et al. (2019), which pre-trains and finetunes the model in different datasets to evaluate the transferability of the pre-training scheme. We adopt GIN with the default setting in Hu et al. (2019) as the GNN-based encoder which is SOTA in transfer learning. Experiments are performed for 10 times with mean and standard deviation of ROC-AUC scores (%) reported as Hu et al. (2019). Although there is no universally beneficial pre-training scheme especially for the out-of-distribution scenario in transfer learning (Sec. 1), Table 5 shows that GraphCL still achieves SOTA performance on 5 of 9 datasets compared to the previous best schemes.
Adversarial robustness. In addition to generalizability, we claim that GNNs also gain robustness using GraphCL. The experiments are performed on synthetic data to classify the component number in graphs, facing the RandSampling, GradArgmax and RL-S2V attacks following the default setting in Dai et al. (2018). Structure2vec Dai et al. (2016) is adopted as the GNN-based encoder as in Dai et al. (2018). Table 6 shows that GraphCL boosts GNN robustness compared with training from scratch, under three evasion attacks.
|Methods||No Pre-Train||GraphCL||No Pre-Train||GraphCL||No Pre-Train||GraphCL|
In this paper, we perform explicit study to explore contrastive learning for GNN pre-training, facing the unique challenges in graph-structured data. Firstly, several graph data augmentations are proposed with the discussion of each of which on introducing certain human prior of data distribution. Along with new augmentations, we propose a novel graph contrastive learning framework (GraphCL) for GNN pre-training to facilitate invariant representation learning along with rigorous theoretical analysis. We systematically assess and analyze the influence of data augmentations in our proposed framework, revealing the rationale and guiding the choice of augmentations. Experiment results verify the state-of-the-art performance of our proposed framework in both generalizability and robustness.
Empowering deep learning for reasoning and predicting over graph-structured data is of broad interests and wide applications, such as recommendation systems, neural architecture search, and drug discovery. The proposed graph contrastive learning framework with augmentations contributes a general framework that can potentially benefit the effectiveness and efficiency of graph neural networks through model pre-training. The numerical results and analyses would also inspire the design of proper augmentations toward positive knowledge transfer on downstream tasks.
Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062. Cited by: §2.
International conference on machine learning, pp. 2702–2711. Cited by: §5.
Graph2vec: learning distributed representations of graphs. arXiv preprint arXiv:1707.05005. Cited by: §5.