DeepAI
Log In Sign Up

Adversarial Cross-View Disentangled Graph Contrastive Learning

Graph contrastive learning (GCL) is prevalent to tackle the supervision shortage issue in graph learning tasks. Many recent GCL methods have been proposed with various manually designed augmentation techniques, aiming to implement challenging augmentations on the original graph to yield robust representation. Although many of them achieve remarkable performances, existing GCL methods still struggle to improve model robustness without risking losing task-relevant information because they ignore the fact the augmentation-induced latent factors could be highly entangled with the original graph, thus it is more difficult to discriminate the task-relevant information from irrelevant information. Consequently, the learned representation is either brittle or unilluminating. In light of this, we introduce the Adversarial Cross-View Disentangled Graph Contrastive Learning (ACDGCL), which follows the information bottleneck principle to learn minimal yet sufficient representations from graph data. To be specific, our proposed model elicits the augmentation-invariant and augmentation-dependent factors separately. Except for the conventional contrastive loss which guarantees the consistency and sufficiency of the representations across different contrastive views, we introduce a cross-view reconstruction mechanism to pursue the representation disentanglement. Besides, an adversarial view is added as the third view of contrastive loss to enhance model robustness. We empirically demonstrate that our proposed model outperforms the state-of-the-arts on graph classification task over multiple benchmark datasets.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/28/2021

InfoGCL: Information-Aware Graph Contrastive Learning

Various graph contrastive learning models have been proposed to improve ...
07/05/2022

Features Based Adaptive Augmentation for Graph Contrastive Learning

Self-Supervised learning aims to eliminate the need for expensive annota...
03/15/2022

Unpaired Deep Image Dehazing Using Contrastive Disentanglement Learning

We present an effective unpaired learning based image dehazing network f...
06/05/2020

Robust Face Verification via Disentangled Representations

We introduce a robust algorithm for face verification, i.e., deciding wh...
10/01/2022

Heterogeneous Graph Contrastive Multi-view Learning

Inspired by the success of contrastive learning (CL) in computer vision ...
10/05/2022

Revisiting Graph Contrastive Learning from the Perspective of Graph Spectrum

Graph Contrastive Learning (GCL), learning the node representations by a...
05/11/2022

Simple Contrastive Graph Clustering

Contrastive learning has recently attracted plenty of attention in deep ...

1 Introduction

Graph representation learning (GRL) has attracted significant attention due to its widespread applications in the real-world interaction systems, such as social, molecules, biological and citation networks GRL_survey

. The current state-of-the-art supervised GRL methods are mostly based on Graph Neural Networks (GNNs)

GCN ; GAT ; hamilton_inductive_2017 ; GIN , which require a large amount of task-specific supervised information. Despite the remarkable performances, they are usually limited by the deficiency of label supervision in real-world graph data due to the fact that it is usually easy to collect unlabeled graph but could be very costly to obtain enough annotated label, especially in certain fields, like biochemistry. Therefore, many recent works GCC ; ContrastMultiView ; InfoGraph

have studied how to fully utilize the unlabeled information on graph and thus stimulating the application of self-supervised learning (SSL) for GRL where only limited or even no label is needed.

As a prevalent and effective strategy of SSL, contrastive learning follows the mutual information maximization principle (InfoMax) DGI to maximize the agreements of the positive pairs while minimizing that of the negative pairs in embedding space. In particular, graph contrastive learning (GCL) methods GCC ; ContrastMultiView ; GraphCL usually take two different augmentation views from the same graph as the positive pair and maintain a high consistency between the learned representations of the views to preserve the invariance of graphs properties while ignore the nuance. However, InfoMax has been proved to be risky because it could push encoders to capture redundant information, which is useless for the label identification and may lead to brittle representation on_mutual_info . Recent works AD-GCL ; JOAO ; GASSL try to implement challenging augmentation views or add perturbations on the original graph to improve representation robustness. But such aggressive transformation may break an assumption of GCL that the data distribution shift induced by augmentation does not affect label information. Intuitively, each commonly-used augmentation technique transforms the original data in a certain way and therefore injecting distinct yet meaningless factors to the original graph data. The injected misleading factors could be highly entangled with the label-relevant factors, making the augmented graph less informative. Therefore, simply adding regularizations AD-GCL ; GASSL to eliminate redundant information within the highly entangled graph may lead to another extreme where the learned representations are not informative enough to identify different graphs. To mitigate this issue, a method which can disentangle the augmentation view-invariant and the view-dependent latent factors without sacrificing the information sufficiency of learned representation is in urgent need.

In this paper, we address this challenge by proposing a novel cross-view disentangled graph contrastive learning model with adversarial training, named ACDGCL. Specifically, ACDGCL consists of a graph encoder followed with two feature extractors that disentangle representations particular to the essential and augmentation-induced latent factors, respectively. We follow the mutual information maximization principle to maximize the correspondence between the essential representation of two augmentation views. To ensure the disentanglement constraint, we propose a reconstruction-based representation learning, including intra-view and inter-view reconstructions, to explicitly disentangle the augmentation relevant and irrelevant factors of the learned representation. Besides, a perturbed adversarial graph is added as the third contrasitve view besides the two augmentation views to further remove the redundant information from the learned essential representation. We further provide theoretical analysis to show that ACDGCL is capable to learn a minimal sufficient representation with the designs above. Finally, we conduct experiments to validate the effectiveness of ACDGCL, on the commonly-used graph benchmark datasets. The experiment results show that ACDGCL achieves significant performance gains over different datasets and settings compared with state-of-the-art baselines.

To sum up, our main contributions of this work include three aspects: (i) We propose ACDGCL to learn the augmentation-disentangled representation with a cross-view reconstruction mechanism; (ii) We add adversarial samples as the third contrastive view to improve the robustness of the learned representation; (iii) We conduct thorough experiments to demonstrate that ACDGCL significantly outperforms the state-of-the-art baselines over multiple graph classification benchmark datasets.

2 Preliminaries

2.1 Graph Representation Learning

In this work, we focus on the graph-level task, let denote a graph dataset with N graphs, where and are the node set and edge set of graph , respectively. We use and

to denote the attribute vector of each node

and edge . Each graph is associated with a label, denoted as , the goal the graph representation learning is to learn an encoder so that the learned representation is sufficient to identify in the downstream task. We clarify sufficiency as containing the same amount of information as for label identification sufficiency , and it is formulated as:

(1)

where denotes the mutual information between two variables. We demonstrate the general optimization result of classical representation learning in Figure 1(a).

2.2 Contrastive Learning

Contrastive Learning (CL) is a self-supervised representation learning method which leverages instance-level identity as supervision. During the training phase, each instance firstly goes through proper data transformation to generate two data augmentation views and , where and are transformation functions. Then, the CL method learns a encoder (a backbone network plus a projection layer) which maps and closer in the hidden space so that the learned representations and maintain all the information shared by and . The encoder is usually optimized by contrastive loss, such as NCE loss NCE , InfoNCE loss InfoNCE and NT-Xent loss SimCLR . We hereby provide the general optimization result of CL in Figure 1(b). In Graph Contrastive Learning (GCL), we usually use GNNs, like GCN GCN and GIN GIN , as the backbone networks and the commonly-used graph data augmentation operators GraphCL , such as node dropping, edge perturbation, subgraph sampling, and attribute masking.

All the CL-based methods are built on an assumption that augmentations do not change the information regarding to the label. Here, we follow robust_rep to clear up the definition of redundancy. is redundant to for iff and share the same label-relevant information. In CL, and are supposed to be mutually redundant, we define the mutual redundancy as:

(2)

2.3 Adversarial Training

Deep neural networks have been demonstrated to be vulnerable to adversarial attacks AT_intro . Among all approaches proposed against adversarial attacks, Adversarial Training (AT) achieves remarkable robustness. Specifically, AT introduces a perturbation variable to improve the model robustness by training the network on the adversarial samples . During the training phase, the model is optimized to minimize the training loss, and the is optimized within the radius to maximize the loss. The supervised setting of AT is defined as:

(3)

where are the data feature and label sampled from training set respectively, and denotes the supervised training objective, such as the cross-entropy loss. Except improving model robustness to adversarial attacks, AT is also capable of reducing overfitting and further increasing the generalization performance AdvProp ; FreeLB . One possible reason behind this phenomena is that AT follows the Information Bottleneck principle IB ; IB_AT , in which the optimal representations only contain minimal yet sufficient information. Depending on whether relevant to label or not, the mutual information between representation and can be decomposed into two parts:

(4)

Ideally, the learned representation through AT is expected to keep all the label-relevant information from intact while ignoring other futile information, we plot this ideal situation in Figure 1(c).

Figure 1: Illustration of the relation between feature , label and representation in terms of information entropy under the three scenarios above. (a) The ideal optimization result of classical graph representation learning, the green area becomes null when Equation 1 is satisfied. (b) The green area is optimized to null in CL, and and share the same intersection with when the assumption in Equation 2 holds. (c) The green area becomes null when AT achieves its optimal optimization results, in which the second term in Equation 4 is minimized to 0.

3 Proposed Model

In this section, we introduce the proposed ACDGCL, and its framework is shown in Figure 2. Before we dive into the details of ACDGCL, we will briefly analyze the sufficiency and robustness in GCL, and provide illustration of the disentanglement hypothesis.

Figure 2: The illustration of proposed ACDGCL. (1) Graph augmentations are applied to the input graph to produce two augmented graphs, which are then fed into the shared graph encoder to generate two graph representations and . (2) and are used as the inputs of the two feature extractors to generate two pairs of disentangled graph representations, where is specific to augmentation-induced factors, and captures the essential factors. Then we use the two pairs of representations to reconstruct and in both of the intra-view and inter-view. (3) An adversarial sample generated by will go through the same procedure to generate . We take it as the third view besides and in CL to maximize their consistency with each other.

3.1 Motivation of ACDGCL

To minimize the redundant information in graph representation, the principle of information bottleneck (IB) has been introduced in graph representation learning yu_graph_2020 , and the learned model is empirically proved to be more robust to adversarial attacks. In the circumstances of graph contrastive learning (GCL), labels are not accessible to guide the optimization process, thus it is more challenging to discern the predictive information and redundant information. For each graph , GCL methods generally pick two augmentation operators and IID sampled from the same family of augmentation to generate two augmented graph and for contradistinction. As stated in Section 2.2, all GCL models are optimized based on the assumption that the two contrastive views are mutually redundant for label information. However, this assumption does not necessarily hold, especially when aggressive augmentations are applied on the graph data. Intuitively, and own completely different distributions, but the factors related to them can be highly entangled through the data augmentation process, hence shifting the distributions compared with the original graph data in some extent. In this case, aggressive data augmentation operators AD-GCL or perturbations GASSL could overly enlarge the distribution shift so that the augmentation views fail to meet the redundancy assumption in Section 2.2, and thereby the learned representation will not satisfy the sufficiency requirement in Section 2.1.

To address this dilemma, we propose to learn the augmentation-disentangled representation to improve its robustness without sacrificing the information sufficiency. Given an augmented graph , we aim to learn a pair of disentangled representation , where is expected to be specific to the augmentation information, while is optimized to elicit all the essential factors (the distribution of original graph) from , i.e., . We provide the illustration of the optimal representation disentanglement in Figure 3(a) and (b): (1) is sufficient for its corresponding graph view regarding to , and the union of the and cover all the information in ; (2) and are mutually excluded (disentangled), and only is relevant to label information. To reach to this optimal point, we need to propose corresponding designs to guarantee the sufficiency and disentanglement of the representation. Next, we will introduce our framework details to explain how do we achieve the two restrictions, respectively.

Figure 3: Illustration of the relation between augmented graph , label , augmentation-dependent representation and augmentation-invariant representation in terms of information entropy under the optimal situation. The green areas in the three figure become null when the optimal situation is achieved. (a) The union of and covers all the information of its corresponding augmentation view . (b) and are mutually excluded. and share all the information, including the label-relevant (white) and label-irrelevant (shadow) information. and are specific to their corresponding augmentation information. They are naturally independent because and are IID sampled from . (c) and only contain the label-relevant information in , and the label-irrelevant information is minimized to 0.

3.2 Disentanglement by Cross-View Reconstruction

In GCL, we usually leverage a graph encoder to aggregate the feature of graph data as its representation. There are multiple choices of graph encoders in GCL, including GCN GCN and GIN GIN , etc. In this work, we adopt GIN as the backbone network for simplicity. Note that any other commonly-used graph encoders can also be applied to our model. Given two augmentation views and , we firstly use the encoder to map the the them into a lower dimension hidden space to generate two embeddings and . Instead of directly maximizing the agreement between the two entangled representations and , we further feed them into a pair feature extractors (both of them are MLP-based networks) to learn the disentangled embeddings:

(5)

where we can generate a pair of disentangled embeddings for both and through the procedure above. Ideally, the mutual redundancy assumption between can thus be guaranteed because and are augmented from the same original graph, and they naturally share the same essential factors, including those relevant to label identification. Here, we clarify the lower bound of the mutual information between one augmentation view and the learned augmentation-invariant representation of another augmentation view in Theorem 1.

Theorem 1

Suppose is a GNN encoder as powerful as 1-WL test. Let elicits only the augmentation information from meanwhile extracts the essential factors of from and . Then we have:

The detailed proof is provided in Section C of Appendix. Therefore, we can maximize the consistency between the representations of the two views by maximizing the mutual information of between and . Therefore, we can derive our objective to ensure the view invariance as follow:

(6)

where denotes the contrastive objective and we adopt InfoNCE loss in this work InfoNCE . Meanwhile, to ensure the feature sufficiency and disentanglement as stated above, we thus propose to use the cross-view reconstruction mechanism to pursue these two objectives. To be specific, we will use the representation pair

within and cross the augmentation views to recover the original raw data so that the two objectives can be guaranteed simultaneously. Due to the reason that graph data is a kind of non-Euclidean structured data which can not be represented in the euclidean space like the raw data in computer vision domain, we turn to infer the output of

based on . Firstly, we do the reconstruction within the augmentation view, namely mapping to , where representing the augmentation view. The optimal result of the this step is shown in Figure 3(a), where the joint of and can cover all the information in its corresponding augmentation graph view . Since only is involved in the contrastive loss in Equation 6, we could thus avoid the graph encoder demanded to be less powerful than 1-WL test especially when aggressive augmentation or perturbation is applied, i.e., . Then, we define the as a cross-view representation pair and the reconstruction procedure will be repeated on it to predict , aiming to ensure and is disentangled and is specific to the augmentation-induced factors, where or . The optimal disentanglement result is illustrated in Figure 3(b), where and . Here, we formulate the reconstruction procedures as:

(7)

where is the parameterized reconstruction model and is the pre-defined fusion operator, like element-wise product. The reconstruction procedures are optimized by minimizing the entropy , where or . Ideally, we can reach the optimal situation demonstrated in Figure 3(a) and (b) iff , where

is exactly recovered given its augmentation-dependent representation and the augmentation-invariant representation of any view. Nevertheless, the condition probability

is unknown for us, we hence use the variation distribution approximated by instead, denoted as . We provide the upper bound of in Theorem 2.

Theorem 2

Assume

is a Gaussian distribution,

is the parameterized reconstruction model which infer from . Then we have:

The detailed proof is demonstrated in Section C of Appendix. Since we adopt two augmentation views, the objective function constraining representation sufficiency and disentanglement can be formulated as:

(8)

3.3 Adversarial Contrastive View

With the cross-view reconstruction mechanism above, we disentangle the essential factors from the two augmentation views and thus and can be considered as the mutually redundant for because both of them are similarly distributed as original graph , i.e., and . However, not all of the information in original graph is label-relevant and redundant information may still exist in as illustrated in Figure 3(b). To further remove the redundant information and enhance the robustness of learned representation, we introduce adversarial training in our model by adding crafted perturbations upon . Many existing works RoCL about adversarial contrastive learning usually add crafted perturbations on the augmented samples and optimize the perturbation to maximize the contrastive loss. Despite its generalization ability, this kind of Adversarial-to-Adversarial training strategy will aggressively demand minimizing the “worst-case” consistency and thereby is unsuitable to apply on original graph view ADVCL . We thereby build the adversarial sample over rather than augmented sample , denoted as . The adversarial objective is defined as

(9)

where the adversarial sample is employed as another positive pair with the two augmentation views. We also clarify the lower bound of the mutual information between and adversarial view in theorem 3.

Theorem 3

Assume and are the optimal encoder and feature extractor stated in theorem 1, is the optimized perturbation in Equation 10 to satisfy . Then we have:

The detailed proof is in Section C of Appendix. Our implementation of crafting perturbation is spurred by recent work GASSL that add perturbation on the output of first hidden layer because it is empirically proved to generate more challenging view than adding perturbation on the initial node feature. With the lower bound stated in Theorem 3, we can utilize the min-max training strategy to further improve representation robustness, where the inner maximization is solved by projected gradient descent (PGD) PGD . Then the adversarial objective is defined as:

(10)

3.4 The Joint Objective

We design the joint objective of ACDGCL by combining all of objectives above together. Given the graph , the graph encoder and feature extractor can be optimized with the objective below:

(11)

where and are the coefficients to balance the magnitude of each loss term. Our proposed model is able to learn optimal representation illustrated in Figure 3(c) with the joint objective.

4 Experiments

In this section, we demonstrate the empirical evaluation results of ACDGCL on public graph benchmark datasets. Ablation study and robustness analysis are conducted to evaluate the effectiveness of the designs in ACDGCL. We provide the dataset statistics, training details and more analysis about hyperparameters in the Appendix.

4.1 Experimental Setups

Datasets.

We evaluate our model on five graph benchmark datasets from the field of bioinformatics, including MUTAG, PTC-MR, NCI1, DD, and PROTEINS, and other five from the field of social network, which are COLLAB, IMDB-B, RDT-B, RDT-M5K, and IMDB-M, for the task of graph-level property classification. Additionally, We use ogbg-molhiv from Open Graph Benchmark Dataset

OGB to demonstrate our model’s advantages over large-scale dataset. More details about dataset statistics are included in Section A of Appendix.

Baselines. Under the unsupervised representation learning setting, we compare ACDGCL with the seven SOTA self-supervised learning methods GraphCL GraphCL , InfoGraphInfoGraph , MVGRL ContrastMultiView , AD-GCLAD-GCL , GASSLGASSL , InfoGCLInfoGCL and DGCLDGCL , as well as four classical unsupervised representation learning methods, including node2vec node2vec , sub2vec sub2vec , graph2vec graph2vec , and GVAEVGAE .

Evaluation Protocol. We follow the evaluation protocols in the previous works InfoGraph ; GraphCL ; DGCL

to verify the effectiveness of our model. The learned representation is fine-tuned by a linear SVM classifier for task-specific prediction. We report the mean test accuracy evaluated by a 10-fold cross validation with standard deviation of five random seeds as the final performance. In addition, we follow the setting of semi-supervised representation learning from GraphCL on the ogbg-molhiv dataset, with the finetune label rates as 1%, 10%, and 20%. The final performance is reported as the mean ROC-AUC of five initialization random seeds

Implementation Details.

We implement our framework with PyTorch and employ the data augmentation function provided by PyGCL library

PyGCL . We choose GIN GIN as the backbone graph encoder and the model is optimized through Adam optimizer. There are two specific hyperparameters in our model, namely and , the search space of them are and , respectively. More details about implementation details is provided in the Section B of Appendix. All of the experiments are conducted on Nvidia GeForce RTX 2080ti GPU.

4.2 Overall Performance Comparison

Unsupervised representation learning. The overall performance comparison is shown in Table 1

. From the results, we can have three observations: (1) The GCL-based methods generally yield higher performances than classical unsupervised learning methods, indicating the effectiveness of utilizing instance-level supervision; (2) InfoGCL and GASSL achieve better performances than GraphCL, which empirically proves the conclusion that InfoMax object could suffer from the overwhelmed information and thus more challenging augmentations or perturbations are in need to produce robust representations; (3) Our proposed ACDGCL and DGCL consistently outperform other baselines, proving the advantage of disentangled representation. More importantly, ADGCL achieves state-of-the-art results on most of the datasets, which further demonstrate the success of our model to learn minimal yet sufficient representations.

width=center MUTAG PTC-MR COLLAB NCI1 PROTEINS IMDB-B RDT-B IMDB-M RDT-M5K DD node2vecnode2vec 72.6±10.2 58.6±8.0 - 54.9±1.6 57.5±3.6 - - - - - sub2vecsub2vec 61.1±15.8 60.0±6.4 - 52.8±1.5 53.0±5.6 55.3±1.5 71.5±0.4 36.7±0.8 36.7±0.4 - graph2vecgraph2vec 83.2±9.3 60.2±6.9 - 73.2±1.8 73.3±2.1 71.1±0.5 75.8±1.0 50.4±0.9 47.9±0.3 - InfoGraphInfoGraph 89.0±1.1 61.7±1.4 70.7±1.1 76.2±1.1 74.4±0.3 73.0±0.9 82.5±1.4 49.7±0.5 53.5±1.0 72.9±1.8 VGAEVGAE 87.7±0.7 61.2±1.8 - - - 70.7±0.7 87.1±0.1 49.3±0.4 52.8±0.2 - MVGRLContrastMultiView 89.7±1.1 62.5±1.7 - - - 74.2±0.7 84.5±0.6 51.2±0.5 - - GraphCLGraphCL 86.8±1.3 63.6±1.8 71.4±1.2 77.9±0.4 74.4±0.5 71.1±0.4 89.5±0.8 - 56.0±0.3 78.6±0.4 InfoGCLInfoGCL 91.2±1.3 63.5±1.5 80.0±1.3 80.2±0.6 - 75.1±0.9 - 51.4±0.8 - - DGCLDGCL 92.1±0.8 65.8±1.5 81.2±0.3 81.9±0.2 76.4±0.5 75.9±0.7 91.8±0.2 51.9±0.4 56.1±0.2 - AD-GCLAD-GCL 89.7±1.0 - 73.3±0.6 69.7±0.5 73.8±0.5 72.3±0.6 85.5±0.8 49.9±0.7 54.9±0.4 75.1±0.4 GASSLGASSL 90.9 64.6±6.1 78 80.2 - 74.2 - 51.7 - - ACDGCL 92.6±0.9 67.4±1.3 80.5±0.5 82.0±1.0 77.3±0.4 76.7±0.5 92.4±0.9 52.2±0.5 57.2±0.4 80.5±0.5

Table 1: Overall comparison on multiple graph classification benchmarks. Results are reported as mean±std%, the best performance is bolded and runner-ups are underlined. "-" indicates the result is not reported in original papers.
Figure 4:

Performance comparison of semi-supervised learning on ogbg-molhiv.

Semi-supervised representation learning. The semi-supervised representation learning results for ogbg-molhiv are shown in Figure 4. It is obvious that our model gains significant improvements under the three label-rate fine-tuning settings. We also notice that as the label rate increases, the amount of improvement increases as well (1%, 1.8%, and 4.4% for label rate 1%, 10%, and 20%, respectively). A possible explanation could be that as more trainable data is included in the process of fine-tuning when the label rate increases, so does the affiliated redundant information, which as a result, deteriorate the performance even more. Therefore, removing redundant information causes a higher performance boost.

4.3 Ablation Study

To further verify the effectiveness of different modules in ACDGCL, we perform ablation studies on each one of the module by creating the model variants illustrated below. The comparison results are shown in Table 2.

  • [leftmargin=*]

  • w/o Intra-view Recon. Reconstruction is only executed within the cross view i.e., .

  • w/o Inter-view Recon. Reconstruction is only executed within the same view i.e., .

  • w/o Adv. Training. Adversarial view is discarded in the contrastive loss.

From Table 2 we can see that our model with the combination of cross-view reconstruction and adversarial training module outperforms all of the variants. Discarding any reconstruction view could cause the failure to reach the optimal situation illustrated in Figure 3. We can not guarantee the representation disentanglement assumption if we skip the inter-view reconstruction, and the sufficiency assumption may not hold if we abandon intra-view reconstruction. Either way, the augmentation-invariant representations may suffer from enormous information loss during the contrastive learning and further lead to the performance deterioration. Compared with our model, the variant w/o Adv. Training may bring too much redundant information to the downstream classifier, therefore creating more confusions. The relatively larger performance deterioration for the two variants w/o Intra-view Recon and w/o Inter-view suggests the rule "better than nothing". That is, having redundant information is better than having it partially.

width=center MUTAG PTC-MR COLLAB NCI1 PROTEINS IMDB-B RDT-B IMDB-M RDT-M5K DD w/o Intra Recon 91.5±1.2 65.8±1.3 78.4±0.7 79.6±0.7 75.6±0.5 75.4±0.8 92.0±0.4 51.5±0.4 55.8±0.6 79.3±0.7 w/o Inter Recon 91.0±0.9 64.7±1.4 78.0±0.8 78.7±1.2 74.9±0.7 75.0±0.6 91.1±0.7 50.8±0.2 55.6±0.4 79.0±0.8 w/o Adv. Training 92.1±0.6 66.8±0.5 80.3±0.5 81.2±0.9 77.0±0.3 76.4±0.6 92.2±1.0 52.0±0.4 56.8±0.5 80.1±0.6 ACDGCL 92.6±0.9 67.4±0.5 80.5±0.5 82.0±1.0 77.3±0.4 76.7±0.5 92.5±0.9 52.2±0.5 57.2±0.4 80.5±0.5

Table 2: Overall comparison of the model variants’ performance. Results are reported as mean±std%, the best performance is bolded.

4.4 Robustness Analysis

In this section, we conduct extra experiments on ogbg-molhiv dataset to evaluate the effectiveness of our design in ensuring the representation robustness under aggressive augmentation and perturbation. The results are shown in Figure 5. In the left two subplots, we plot accuracy verses edge perturbation and attribute masking strengths, respectively. Specifically, we keep the GraphCL and our proposed ACDGCL under the same hyperparameter setting and set the and of ACDGCL as 5.0 and 0.5, respectively. From the results we can see that ACDGCL not only consistently outperforms GraphCL but also is less affected by larger augmentation strengths. Similar observation can be find in the right two subplots, where we compare our method with GASSL under different perturbation bounds and attack steps to demonstrate its robustness against adversarial attacks. Since both our model and GASSL use GIN as the backbone network, we hereby add the performance of GIN as the compared baseline. Although aggressive adversarial attacks can largely deteriorate the performance, our proposed ACDGCL still achieves more robust performance than GASSL.

Figure 5: Performance versus augmentation strengths, perturbation bound and attack step.

5 Related Work

Graph contrastive learning. Contrastive learning is firstly proposed in the compute vision field SimCLR and raises a surge of interests in the area of self-supervised graph representation learning for the past few years. The principle behind contrastive learning is to utilize the instance-level identity as supervision and maximize the consistency between positive pairs in hidden space through designed contrast mode. Previous graph contrastive learning works generally rely on various graph augmentation (transformation) techniques DGI ; GCC ; MVGRL ; GraphCL ; InfoGraph to generate positive pair from original data as similar samples. Recent works in this field try to improve the effectiveness of graph contrastive learning by finding more challenge view AD-GCL ; InfoGCL ; JOAO or adding adversarial perturbation GASSL . However, most of the existing methods contrast over entangled embeddings, where the complex intertwined information may pose obstacles to extracting useful information for downstream tasks. Our model is spared from the issue by contrasting over disentangled representations.

Disentangled representation learning on graphs. Disentangled representation learning arises from the computer vision field hsieh_learning_2018 ; zhao_learning_2021 to disentangle the heterogeneous latent factors of the representations, and therefore making the representations more robust and interpretable RLReview . This idea has now been widely adopted in graph representation learning. IPGDN ; DisenGCN utilizes neighborhood routing mechanism to identify the latent factors in the node representations. Some other generative models VGAE ; GraphVAE

utilize Variational Autoencoders to balance reconstruction and disentanglement. Recent work

DGCL outspreads the application of disentangled representations learning in self-supervised graph learning by contrasting the factorized representations. Although these methods gain significant benefit from the representation disentanglement, the underlined excessive information could still overload the model, thus resulting in limited capacities. Our model targets the issue by removing the redundant information that is considered irrelevant to the graph property.

Graph information bottleneck. The Information bottleneck (IB) IB has been widely adopted as a critical principle of representation learning. A representation contains minimal yet sufficient information is considered to be in compliance with the IB priciple and many works VIB ; blackbox ; robust_rep have empirically and theoretically proved that representation agree with IB principle is both informative and robust. Recently, IB principle is also borrowed to guide the representation learning of graph structure data. Current methods GIB ; InfoGCL ; AD-GCL usually propose different regularization designs to learn compressed yet informative representations in accordance with IB principle. We follow the information bottleneck to learn the expressive and robust representation from disentangled information in this work.

6 Conclusion

In this paper, we study graph representation learning in light of information bottleneck. To reach the optimum we illustrate, we propose a novel model, namely ACDGCl, which is designed to disentangle the essential factors from augmented graph through a cross-view reconstruction mechanism so that the information entanglement brought by augmentations will not cause the loss of predictive information during contrastive learning. We also add an adversarial view as the third view of the contrastive learning to further remove redundant information and enhance representation robustness. In addition, we theoretically analyze the effectiveness of each component in our model and derive the objective based on the analysis. Extensive experiments on multiple graph benchmark datasets and different settings prove the ability of ACDGCL to learn robust and informative graph representation. In the future, we can explore how to come up with a practical objective to further decrease the upper bound of the mutual information between the disentangled representations and try to utilize more efficient training strategy to make the proposed model more time-saving on large-scale graphs.

References

  • (1) Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. JMLR, 2018.
  • (2) Bijaya Adhikari, Yao Zhang, Naren Ramakrishnan, and B. Aditya Prakash. Sub2Vec: Feature Learning for Subgraphs. In KDD, 2018.
  • (3) Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep Variational Information Bottleneck. ICLR, 2017.
  • (4) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives. TPAMI, 2013.
  • (5) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A Simple Framework for Contrastive Learning of Visual Representations. In ICML, 2020.
  • (6) Lijie Fan, Sijia Liu, Pin-Yu Chen, Gaoyuan Zhang, and Chuang Gan. When does contrastive learning preserve adversarial robustness from pretraining to finetuning? In NeurIPS, 2021.
  • (7) Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. In ICLR, 2020.
  • (8) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2014.
  • (9) Aditya Grover and Jure Leskovec. node2vec: Scalable Feature Learning for Networks. In KDD, 2016.
  • (10) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs. In NeurIPS, 2017.
  • (11) William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
  • (12) Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive Multi-View Representation Learning on Graphs. In ICML, 2020.
  • (13) Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In ICML, 2020.
  • (14) Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to Decompose and Disentangle Representations for Video Prediction. In NeurIPS, 2018.
  • (15) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec.

    Open Graph Benchmark: Datasets for Machine Learning on Graphs.

    In NeurIPS, 2020.
  • (16) Minseon Kim, Jihoon Tack, and Sung Ju Hwang. Adversarial self-supervised contrastive learning. In NeurIPS, 2020.
  • (17) Thomas N. Kipf and Max Welling. Variational Graph Auto-Encoders. In NeurIPS, 2016.
  • (18) Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR, 2017.
  • (19) Haoyang Li, Xin Wang, Ziwei Zhang, Zehuan Yuan, Hang Li, and Wenwu Zhu. Disentangled Contrastive Learning on Graphs. In NeurIPS, 2021.
  • (20) Yanbei Liu, Xiao Wang, Shu Wu, and Zhitao Xiao. Independence promoted graph disentangled networks. In AAAI, 2020.
  • (21) Jianxin Ma, Peng Cui, Kun Kuang, Xin Wang, and Wenwu Zhu. Disentangled Graph Convolutional Networks. In ICML, 2019.
  • (22) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.

    Towards deep learning models resistant to adversarial attacks.

    In ICLR, 2018.
  • (23) Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Galstyan, and Greg Ver Steeg. Invariant representations without adversarial training. NeurIPS, 2018.
  • (24) A. Narayanan, Mahinthan Chandramohan, R. Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal.

    graph2vec: Learning Distributed Representations of Graphs.

    ArXiv, 2017.
  • (25) Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. In KDD, 2020.
  • (26) Ravid Shwartz-Ziv and Naftali Tishby. Opening the Black Box of Deep Neural Networks via Information. arXiv:1703.00810 [cs], April 2017.
  • (27) Martin Simonovsky and Nikos Komodakis. GraphVAE: Towards generation of small graphs using variational autoencoders. In ICLR, 2018.
  • (28) Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In ICLR, 2019.
  • (29) Susheel Suresh, Pan Li, Cong Hao, and Jennifer Neville. Adversarial Graph Augmentation to Improve Graph Contrastive Learning. In NeurIPS, 2021.
  • (30) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
  • (31) Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. In ICLR, 2019.
  • (32) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. arXiv e-prints, 2018.
  • (33) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. In ICLR, 2018.
  • (34) Petar Veličković, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm. Deep Graph Infomax. In ICLR, 2019.
  • (35) Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. Graph Information Bottleneck. In NeurIPS. Curran Associates, Inc., 2020.
  • (36) Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In CVPR, 2018.
  • (37) Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L Yuille, and Quoc V Le. Adversarial examples improve image recognition. In CVPR, 2020.
  • (38) Dongkuan Xu, Wei Cheng, Dongsheng Luo, Haifeng Chen, and Xiang Zhang. InfoGCL: Information-Aware Graph Contrastive Learning. In NeurIPS, 2021.
  • (39) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In ICLR, 2019.
  • (40) Longqi Yang, Liangliang Zhang, and Wenjing Yang. Graph Adversarial Self-Supervised Learning. In NeurIPS, 2021.
  • (41) Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. Graph contrastive learning automated. In ICLR, 2021.
  • (42) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph Contrastive Learning with Augmentations. In NeurIPS, 2020.
  • (43) Junchi Yu, Tingyang Xu, Yu Rong, Yatao Bian, Junzhou Huang, and Ran He. Graph Information Bottleneck for Subgraph Recognition. In ICLR, 2020.
  • (44) Long Zhao, Yuxiao Wang, Jiaping Zhao, Liangzhe Yuan, Jennifer J. Sun, Florian Schroff, Hartwig Adam, Xi Peng, Dimitris Metaxas, and Ting Liu. Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization. In CVPR, 2021.
  • (45) Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding. In ICLR, 2020.
  • (46) Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu Wu. An Empirical Study of Graph Contrastive Learning. arXiv.org, 2021.