Group Contrastive Self-Supervised Learning on Graphs

07/20/2021 ∙ by Cheng Deng, et al. ∙ 39

We study self-supervised learning on graphs using contrastive methods. A general scheme of prior methods is to optimize two-view representations of input graphs. In many studies, a single graph-level representation is computed as one of the contrastive objectives, capturing limited characteristics of graphs. We argue that contrasting graphs in multiple subspaces enables graph encoders to capture more abundant characteristics. To this end, we propose a group contrastive learning framework in this work. Our framework embeds the given graph into multiple subspaces, of which each representation is prompted to encode specific characteristics of graphs. To learn diverse and informative representations, we develop principled objectives that enable us to capture the relations among both intra-space and inter-space representations in groups. Under the proposed framework, we further develop an attention-based representor function to compute representations that capture different substructures of a given graph. Built upon our framework, we extend two current methods into GroupCL and GroupIG, equipped with the proposed objective. Comprehensive experimental results show our framework achieves a promising boost in performance on a variety of datasets. In addition, our qualitative results show that features generated from our representor successfully capture various specific characteristics of graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the advances of deep learning, graph neural networks 

[19, 8, 34, 21] (GNNs) have been developed for learning from graph-structured data such as molecules and proteins. GNNs have achieved great success on various graph learning tasks [34, 8, 9, 48, 28, 39]. However, such success hinges on a large amount of labeled data, which is expensive and even not available. To mitigate the dependence on labels, self-supervised learning (SSL) [41, 38, 46, 3, 26, 6] is proposed to use supervisions from graphs themselves. SSL is initially investigated for unsupervised image tasks [3, 26, 6, 41] and later successfully applied on unsupervised sequence tasks [5, 37, 45]. Inspired by its success in both image and sequence domains, a variety of SSL methods based on GNNs are proposed [46, 31, 32, 49, 15, 27, 35, 11, 36, 18]. Among them, contrastive learning (CL) methods [46, 31, 32, 49, 15, 27, 35] are becoming the mainstream approaches in this field. CL methods train models on pretext tasks that encode the agreement between two views of representations. These two views can be global-local pairs [31, 12, 33, 13] or differently transformed graph data [46, 12, 38, 28]. The learning goal is to make these two-view representations similar if they are from the same graph and dissimilar if they are from different graphs.

Existing CL methods usually compute graph representations from a single perspective as components of the contrastive objectives [46, 31, 49, 32]. However, we argue that contrasting two graphs in multiple subspaces has the potential of capturing more abundant characteristics of graphs. In domains of proteins and images, studying data from multiple perspectives has been shown to effectively capture powerful features. For example, previous studies [29, 20] extract the feature of images by investigating the similarities of various local parts progressively. Other methods [47, 16] study the properties of protein sequences by exploring diverse gene sub-sequences. Based on this motivation, we propose a group contrastive learning framework for graphs in this work. Different from existing methods that perform contrastive learning in a single space, our framework embeds a given graph into representations in various subspaces and perform contrastive learning in each subspaces. For simplicity, we refer to a group as a set of representations of different graph views within the same subspace. Note that when the number of groups is set to one, our framework reduces to prior methods without group contrast.

We investigate the agreement for each group which encourages the representation to encode one specific characteristic of the given graph. We design our objective based on the mutual information (MI) to capture both intra-space and inter-space relations. More specifically, we propose to maximize the MI between two views of representations in the same group while minimizing the MI between the representations of one view across different groups. To enable the optimization, we derive the MI lower bound under both parametric and nonparametric cases. Under the proposed framework, we further develop an attention-based representor function to compute multiple representations, of which each one is encouraged to focus on some specific nodes, thereby encoding one specific substructure. Note that the idea of using groups has been shown to be effective in the image domain [2, 24, 17, 44], and here we successfully employ it to the graph SSL task.

Built upon our framework, we extend two previous methods into GroupCL and GroupIG. We evaluate the effectiveness of these two methods on both unsupervised graph classification and transfer learning tasks. Comprehensive quantitative experimental results demonstrate that our methods achieve new state-of-the-art performance on a majority of datasets when compared with previous methods. Particularly, GroupCL and GroupIG show superior performance consistently comparing with the corresponding non-grouping methods across different datasets and tasks. Furthermore, we conduct a visualization experiment using the attention weights learned by our representor function. The qualitative results intuitively illustrate the substructures captured by various representations.

2 Background and Related Work

2.1 Notations and Problem Setup

Let denote a graph with the node set and the edge set , where denotes the number of nodes in this graph. The node set is initially represented by a node feature matrix , where denotes the feature dimension. And the edge set is represented by an adjacency matrix , of which the element is determined by . We are interested in the unsupervised graph representation learning task. Given graph data , the goal is to learn a graph-level encoder, to encode the graph into a high-level representation of dimension , formulated as . In particular, a graph-level encoder usually consists of a node encoder and a readout function , where denotes the dimension of node embeddings. The node encoder computes the node embedding matrix from , i.e., . And the readout function summarize the node embeddings into the desired graph-level embedding, i.e., .

2.2 Graph Neural Networks

Graph neural networks (GNNs) [43, 19, 34] have demonstrated their effectiveness in learning the representation of graph-structured data such as molecules and social networks. GNNs iteratively update the representation of each node by aggregating information from their neighbor nodes, aiming at capturing the local structural information. For each node, a single GNN-layer aggregates information from its -hop neighborhood. Stacking aggregation layers hence enables each node representation to capture information within the -hop neighborhood. To be concrete, the update of the -th layer in an -layer GNN can be described as

(1)

where

denotes the feature vector of node

at the -th layer, the initial is set to input node features, . denotes the ultimate representation of node , and is a set of vertices that connect to node . There are different types of and functions. For instance, GIN [43] integrates the and functions as:

(2)

where can be a learnable parameter or a fixed scalar. For the graph-level task, a READOUT function is employed to summarize the ultimate node embeddings, then an optional projection head, usually a multi-layer preceptron (MLP), is used to perform linear projection to generate the graph-level embedding. Mathematically,

(3)

2.3 Graph Contrastive Learning

We take advantage of the contrastive learning (CL) technique to solve the unsupervised learning issue. We describe the graph contrastive learning framework in a two-view case. The contrastive learning is performed between the representations of two views

and . It aims to enlarge the agreement between representations of the positive pairs, i.e., two views associated with the same graph instance, and weakening that of the negative pairs, i.e., two views associated with different instances. Views and of a graph are usually generated by data transformation functions denoted by and . Previous studies implement and in diverse manners. We specifically introduce two approaches to generate views in Sections 3.5 and 3.6 respectively.

3 Group Contrastive Learning on Graphs

In self-supervised learning, the contrastive learning technique has demonstrated its capability in learning representations without labels. Some previous studies perform the contrastive process between the representations of two augmented views or the global-and-local pair. However, they share a common drawback. That is, they contrast two objectives in a single space, which captures limited characteristics. For images and proteins, previous approaches study the objectives through contrast various local parts and sub-sequences respectively. Thus, we argue that contrasting two graphs in multiple subspaces has the potential of capturing more abundant characteristics of graphs. To this end, we propose a graph group contrastive learning framework. Overall, our framework embeds the given graph into various subspaces, resulting in multiple groups of representations for two views. We investigate the agreement of representations group-wisely which aims at enabling these representations to encode various characteristics of the given graph.

3.1 The Proposed Group Contrastive Learning Framework

Our proposed framework considers two views and as the main view and auxiliary view, respectively, in which case the processing procedures may be asymmetric. For each branch of view, our framework performs data transformations on the given graph to obtain the view and employ an individual GNN-based encoder to compute the corresponding multiple representations. In particular, the main branch computes multiple representations based on different learnable parameters, formulated as

(4)

where denotes the number of representations and is the -th graph-level embedding. Here, is a multi-representation encoder which seeks to encode multiple characteristics of the given graph. In contrast, there are two approaches to compute the multiple representations for the auxiliary view . One can employ an encoder that only computes one representation, and then duplicates it for times to obtain feature vectors, given by

(5)
(6)

where is the encoder of view . In an alternative way, we can use the same encoder as branch and produce multiple feature vectors, given by

(7)

Given representations of those two views, We formulate groups of representations, those are . we perform times contrastive learning studies between representations of those groups.

During prediction, we input one graph into the learned multi-representation encoder for the main branch without performing transformations, resulting in feature vectors. We then combine these vectors to jointly represent the graph. Here, we adopt the simple concatenation operation. Formally,

(8)
(9)

Here is a vector of size which is computed by .

3.2 Intra-Space Objective Function

As introduced in Section 3.1, the output of our group contrastive learning framework is groups of representation regarding two views, i.e., . Our goal is to optimize such that the vectors in encode diverse characteristics of the input graph. To achieve this goal, we employ two essential objectives based on mutual information (MI) regarding both intra-space representation pairs and inter-space representation pairs, respectively.

For the intra-space objective, we seek to maximize the mutual information between representations of two views within each group. The intra-space MI maximization enables the learning of informative representation for each individual group. In particular, the paired representations are . We hence formulate the intra-space objective as

(10)

where are parameters of encoders in the main-view branch and the auxiliary-view branch, respectively. As the mutual information becomes intractable when distributions of and

are unknown, a common substitute approach to maximizing MI is to maximize its lower-bound estimation based on the sampled examples 

[42, 23]. Among all existing MI lower bounds, we adopt the Jensen-Shannon estimator of MI, which is computed as

(11)

where and is a discriminator that takes representations of two views as inputs, and scores the agreement between them. We simply instantiate the discriminator as the dot product between two representations, i.e.,

3.3 Inter-Space Objective Function

In addition to the intra-space objective, we also constraint the pairwise relation across different groups of the same view to enforce the diversity of inter-space. In particular, we propose to employ an inter-space optimization objective based on mutual information minimization. The objective prompts any two representations in the same view to capture different characteristics of the given graph. The inter-space objective focuses on each pair of representations across different groups within view , i.e., , and can be formulated as

(12)

In order to minimize the MI, we introduce an upper bound of MI as an efficient estimation. The upper bound MI we adopt here is based on the contrastive log-ratio upper bound [4]

(CLUB). Concretely, for two random variables

and , the upper bound of MI is defined as

(13)

To enable the computation of CLUB, a key challenge is to model the intractable conditional distribution . We propose two approaches to model , i.e., the non-parameterized estimation and the parameterized estimation, based on whether the same dimensions of and correspond to each other across different representations. For both cases, we follow [4] to assume that the distribution conditional on

is subject to a Gaussian distribution, of which the mean and variance are determined respectively depending on their concrete situations.

Non-parameterized estimation. We first consider the case where the correspondence exists between each dimension of and , two random vectors as different representations of a given graph. In other words, the same dimensions of the two vectors and are associated with the same specific feature. In this case, we introduce the assumption to simplify the computation, i.e., the expectation of conditional on equals to . Such an assumption commonly exists in many scenarios such as noise models, where is considered as observed values and is considered as their corresponding signals. We further assume that the variance is a diagonal matrix with the same values on its diagonal, i.e., each dimension of only depends on the corresponding dimension of and has equal variance. Concretely, for two representations and , the distribution of conditional on is subject to

(14)

where denotes the variance shared by all dimensions, and denotes the -th dimension of and , respecticely. In Section 4, we further empirically demonstrate the effectiveness of models under the above assumptions. Given the assumptions, we are able to simplify the CLUB objective for a more efficient and specific computation of MI upper bound. We first rewrite Equation (13) into

(15)

Next, we apply -normalization to each representation and rewrite Equation (15) with further simplification, formulated as

(16)

where and denotes the dimension of every representation. Based on this formulation, we can develop Equation (15) as

(17)

where the inequality is derived based on . Our goal minimizing is hence equivalent to minimizing . To be consistent with Equation  (11), we also employ the soft plus (SP) function and obtain the final objective

(18)

Note that in Equation (17), we apply a constant upper bound to the term . This is because including this term potentially results in an reversal effect to the intra-space objective in Equation (11). More specifically, Equation (11) enlarges the agreement between and , while Equation (17) enlarges that between and

, under the joint distribution. These two goals encourage the distribution of

to be similar to two different distributions, being mutually incompatible.

Parameterized estimation. We then consider the second case where there is no correspondence between dimensions of and . In this case, we are required to estimate the mean and variance of the conditional Gaussian distribution via learning a parameterized variational distribution [4]

. Concretely, we employ two independent multi-layer perceptrons (MLP) to generate the mean and variance respectively. Then Equation (

14) can be rewritten as

(19)

where and

are two MLPs for generating the mean and the standard deviation respectively.

and are the parameters of and respectively. To be simplified, we denote as and as . We rewrite Equation (19) in its log-form as

(20)

where , , and are the -th dimension of , , and respectively. Furthermore, . Based on this formulation, we can further derive the CLUB as following

(21)

We remove the second term in this equation in the same manner as used in Equation (17), so the MI minimization problem is equivalent to

(22)

To determine the parameters used to generate the mean and variance, i.e. and , we maximize the likelihood by

(23)

To sum up, the parameters involved in the conditional distribution and the encoder are trained adversarially by

(24)
Fig. 1: The pipelines of GraphCL and GroupCL. These two methods share the same trunk, in which the data are successively processed by the data augmentation and the GNN encoder. GroupCL employs a representor function to generate multiple graph-level embeddings for two views. The maximizing MI objectives are employed on the representations in the same group from different views. The minimizing MI objectives are employed on the representations in different groups from the same view. GraphCL adopts a sum pooling and projection head to generate one graph embedding.

3.4 The Overall Objective Function

We combine the intra-space and the inter-space objectives and obtain the final objective as

(25)

which is approximated by the following objective based on the corresponding bounds. We can either adopt the non-parameterized upper bound to form the final objective as

(26)

or the parameterized bound to form the final objective as

(27)

where the is a parameter that balances the influences of intra-space and inter-space objectives.

3.5 GroupCL: GraphCL with Group Contrast

Built upon our framework, we extend the previous approach GraphCL into GroupCL, equipped with the proposed group contrastive objective. Figure 1 compares the frameworks of the original GraphCL and the extended variant. The two frameworks share the same data augmentation approach. Different from the graph-level encoder of GraphCL that only computes a single representation vector, the GroupCL encoder generates multiple vectors as representations. In particular, to control the computational cost, the generation of multiple representations shares the same node encoder and involves a parameterized representor function that computes multiple graph representations from node embeddings of a given graph.

We first recap the computation of node embeddings in GraphCL, which is a common part shared by GraphCL and GroupCL. For both GraphCL and GroupCL, the computing procedures of views and are symmetric. We, therefore, introduce the procedure of view , and that of view is similar. For view , GroupCL first performs a random data augmentation on the input graph to generate the view . The node encoder then encodes each nodes of into node embeddings . The mathematical formulations of these two steps are incorporated by

(28)

Given node embeddings , we take advantage of the attention mechanism to capture the information from different node combinations, where each graph representation depends greatly on the heavily attended nodes with respect to different queries. We hence instantiate the proposed representor function as to capture information and encode diversified substructures of the graph. Figure 2 demonstrates a complete computation pipeline of the proposed representor function . It employs two independent linear projections to map the node embedding matrix and outputs two matrices, the key matrix and the value matrix . Mathematically,

(29)

where and denote the projection matrices corresponding to the key and value, respectively, and and are the dimensions of them.

Fig. 2: The computation pipeline of . It takes the node embedding matrix as input and outputs multiple graph-level embeddings. The node embedding matrix is firstly projected to the Key and Value matrices by two independent linear functions. Each vector in the Query matrix is involved in a dot product with the Key, leading to an attention weights vector. Then the attention vector is used to perform a weighted sum over the Value, generating one embedding.

The representor function seeks to differently combine the nodes thereupon capture different graph substructures. To this end, we introduce a set of query vectors , where each query induces one specific representation and the number of query vectors determines the number of groups in the proposed objective. In addition, the dimension of each query should be equal to the columns of . This matrix is composed of trainable parameters which are randomly initialized and trained along with parameters in node encoders. The query is then employed to attend the key , producing the node-wise attention weights to be used for the computation of representations. A larger attention weight indicates a more informative node with respect to the corresponding query vector. We normalize the attention weights along the node dimension and finally perform the weighted summation over the value based on the attention weights to obtain the graph embedding. Mathematically,

(30)
(31)

where we let as the determines the dimension of each embedding.

Given the representor function , the representations of the and views are computed as

(32)

We then compute the group contrastive loss on representations and in groups by Equation (26) and back propagate it to optimize the model.

To sum up, generates graph-level representations associated with different combinations of nodes, where each combination leads to individual substructural representation. Subject to our optimization objectives, the multiple representations are prompted to focus on different and informative combinations of nodes and thereupon encode different subgraph patterns.

Dataset NCI1 PROTEINS DD MUTAG COLLAB RDT-B RDT-M5K IMDB-B
GL - - - 81.66 2.11 - 77.34 0.18 41.01 0.17 65.87 0.98
WL 80.01 0.50 72.92 0.56 - 80.72 3.00 - 68.82 0.41 46.06 0.21 72.30 3.44
DGK 80.31 0.46 73.30 0.82 - 87.44 2.72 - 78.04 0.39 41.27 0.18 66.96 0.56
node2vec 54.89 1.61 57.49 3.57 - 72.63 10.20 - - - -
sub2vec 52.84 1.61 53.03 5.55 - 61.05 15.80 - 71.48 0.41 36.68 0.42 55.26 1.54
graph2vec 73.22 1.81 73.30 2.05 - 83.15 9.25 - 75.78 1.03 47.86 0.26 71.10 0.54
InfoGraph 76.20 1.06 74.44 0.31 72.85 1.78 89.01 1.13 70.65 1.13 82.50 1.42 53.46 1.03 73.03 0.87
GroupIG 81.13 0.73 74.74 1.16 73.67 1.26 89.82 1.85 76.18 0.74 90.55 0.67 54.72 0.60 72.62 0.78
GraphCL 77.87 0.41 74.39 0.45 78.62 0.40 86.60 1.34 71.36 1.15 89.53 0.84 55.99 0.28 71.14 0.44
GroupCL 81.69 0.30 75.04 0.54 76.76 1.20 91.67 1.37 76.38 0.33 90.89 0.85 55.42 0.22 73.52 0.95
TABLE I: The results of the unsupervised learning experiment. We run -folder cross-validation and report the mean and the standard deviation of the classification accuracy. The best performance is highlighted by the bold number. The underlined numbers stand that the developed grouping methods are better than corresponding non-grouping methods.

3.6 GroupIG: InfoGraph with Group Contrast

In addition to the symmetric contrastive framework represented by GraphCL, our framework can be widely employed in any existing contrastive method for graph representation learning. To show the wide usability of our framework, we additionally employ it with InfoGraph [31] and propose the Group InfoGraph (GroupIG). In InfoGraph, representations of view and are the graph-level embedding and the node-level embedding, respectively. They are hence processed in an asymmetric manner. For view , we use the same GNN encoder and representor function as GroupCL. In particular, the multiple representations are obtained by

(33)

For view , we generate the node embeddings through and duplicate them for times to obtain representations, mathematically given by

(34)

We compute the loss of and by Equation (26) and back propagate it to optimize the model.

4 Experimental Studies

In this section, we assess the effectiveness of our group contrastive learning framework on both graph unsupervised classification and graph transfer learning tasks. To illustrate the property of groups, we also investigate the correlation between different groups and visualize the content captured by different groups in Section 4.3. Furthermore, we conduct comprehensive ablation studies to demonstrate the influence of various hyperparameters, including the number of groups

, the weight of diversity loss , and two different types of MI upper bound estimators in Section 4.4. Finally, we analyze the training complexity of the grouping method and non-grouping methods in Section 4.5.

Our implementation is based on Pytorch 

[25] and Pytorch Geometric [7] libraries. The Adam optimizer is adopted for the optimization of the model. We set the number of groups to , which means that we generate feature vectors for each graph example. The total embedding dimension is set to for these two tasks, and the dimension for each representation is determined by the total embedding dimension dividing the number of groups. We adopt Equation (26) as the objective function for our main results in Section 4.1 and Section 4.2, since the correspondence exists between the same dimension across multiple representations generated by . The objective function introduces one additional hyper-parameter , of which the value is chosen from .

Dataset BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE
No Pre-Train 65.8 4.5 74.0 0.8 63.4 0.6 57.3 1.6 58.0 4.4 71.8 2.5 75.3 1.9 70.1 5.4
Infomax 68.8 0.8 75.3 0.5 62.7 0.4 58.4 0.8 69.9 3.0 75.3 2.5 76.0 0.7 75.9 1.6
EdgePred 67.3 2.4 76.0 0.6 64.1 0.6 60.4 0.7 64.1 3.7 74.1 2.1 76.3 1.0 79.9 0.9
AttrMasking 64.3 2.8 76.7 0.4 64.2 0.5 61.0 0.7 71.8 4.1 74.7 1.4 77.2 1.1 79.3 1.6
ContextPred 68.0 2.0 75.7 0.7 63.9 0.6 60.9 0.6 65.9 3.8 75.8 1.7 77.3 1.0 79.6 1.2
GraphCL 69.7 0.7 73.9 0.7 62.4 0.6 60.5 0.9 76.0 2.7 69.8 2.7 78.5 1.2 75.4 1.4
GroupCL 71.04 1.25 75.47 0.40 62.66 0.95 61.48 0.90 80.90 2.86 73.22 2.25 76.68 1.17 80.95 1.88
TABLE II: The results of the transfer learning experiment. We run -folder cross-validation and report the mean and the standard deviation of the ROC-AUC scores. The best performance is highlighted by the bold number. The underlined numbers stand that the developed grouping methods are better than corresponding non-grouping methods.

4.1 Unsupervised Learning

We evaluate our proposed framework with unsupervised graph classification tasks. Following the learning procedure in [46]

, we first train the GNN model in a self-supervised fashion. The trained model is then fixed and used to generate the graph embeddings during the evaluation phase. Finally, an SVM is adopted to classify the embeddings into different categories.

Datasets and baselines. In this experiment, two types of datasets are considered, those are the biochemical molecules and the social networks. The statistics information is summarized in Table VII. We compare our GroupCL and GroupIG with the original approaches as baselines, i.e., GraphCL and InfoGraph, respectively. Additionally, we compare our grouping methods with other six SOTA graph unsupervised learning methods, including graphlet kernel (GL), Weisfeiler-Lehman sub-tree kernel (WL), deep graph kernel (DGK), node2vec [10], sub2vec [1], and graph2vec[22].

Experimental configurations. For the node encoder , we build a three-layer the graph isomorphism network (GIN) [43] with hidden units. To a fair comparison, the total embedding dimensions are to the same number for both grouping methods and their baselines. The Adam algorithm with a learning rate of 0.001 to optimizing the GNN model. We follow the GraphCL [46]

to choose the number of epochs as

, batch size as , and the C parameter of SVM from . We closely follow the evaluation settings adopted by the previous state-of-the-art approaches [46, 12]. In particular, we split the dataset into the train, test, and validation sets at the proportion of and report the mean classification accuracy with standard deviation after 5 runs followed by a linear SVM classifier. The SVM is trained using cross-validation on training folds of data and the model for testing is selected by the best validation performance.

Results. Table I shows the results of the unsupervised learning experiments. Shown numbers of the baseline methods come from GraphCL [46]. Numbers are highlighted with underlines when grouping methods obtained better performance than the original ones. It shows that GroupIG outperforms InfoGraph on seven out of eight datasets, and GroupCL achieves better performance on six out of eight datasets than GraphCL. In some datasets, our grouping methods outperform the original version with a large margin. For example, GroupIG obtained about , , and higher accuracy than InfoGraph, on the NCI1, COLLAB, and RDT-B respectively. GroupCL achieved about , , and higher accuracy than GraphCL, on the NCI1, MUTAG, and COLLAB correspondingly. The results substantially demonstrate the effectiveness of the grouping technique. We then compare the results of our methods with the SOTA performance achieved by the previous methods. The best results among all methods are in bold. GroupCL achieves new SOTA accuracy on six out of eight datasets and the second-best accuracy on the rest two datasets. Moreover, our method outperforms the state-of-the-art approaches at significant margins on some datasets, like about on COLLAB, on MUTAG. To sum up, our grouping methods outperform the previous single-embedding one consistently and significantly. We will further verify it by ablation study on the number of groups in Section 4.5.

4.2 Transfer Learning

Furthermore, we conduct an evaluation with transfer learning on molecular chemical property prediction tasks. Specifically, we pre-train the model on a large-scale unlabeled dataset, fine-tune and evaluate the model on multiple different downstream datasets. The goal is to evaluate the transferability of models trained under different pre-training schemes.

Datasets and baselines. We adopt ZINC-2M for pre-training, which contains 2 million unlabeled molecules sampled from the ZINC15 database [30]. For downstream tasks, we include binary classification datasets provided by MoleculeNet [40]. The concrete information is given in Table VIII. For baseline methods, we include results of various pre-training strategies, including Infomax, EdgePred, AttrMasking, and ContextPred, provided in [14] as well as results without pre-training.

Experimental configurations. We adopt the same GNN encoder in this experiment to the one used in the unsupervised learning experiments. Following [14], the batch size and training epochs are set to and for pre-training, then and for finetuning. For both pretraining and finetuning, the learning rate is fixed to . The same data splitting method used in unsupervised learning is applied to the transfer learning evaluation. We conduct times cross-validation experiments and report the mean and standard deviation of ROC-AUC scores in Table II.

Results. Table II shows the comparison results with baselines. Our method achieves state-of-the-art performance on BBBP, SIDER, ClinTox, and BACE. It is worth noting that our method is higher than the previous best performance on ClinTox. In this task, the best performance of each dataset distributes on the baseline methods in a decentralized manner as different downstream tasks differ significantly. However, our method, in general, achieves the best performance on more datasets than the previous methods do: the state-of-the-art performances of four datasets are achieved by our methods, that of two datasets achieved by AttrNasking, and that of the other two datasets achieved by the ContextPred and GraphCL respectively. Regarding the comparison with our direct baseline GraphCL, our method obtains better results on 7 in 8 datasets. This comparison indicates the effectiveness of our proposed framework.

Fig. 3: The cosine distance of queries of PROTEINS dataset (left) and MUTAG dataset (right) on the unsupervised learning task.

4.3 Study of Groups

Our group contrastive learning framework performs contrastive learning under multiple measures, making each representation capture one specific characteristic. To support this claim, we conduct the quantitative representation correlation study and the qualitative representation attention study.

Fig. 4: Visualization of attention weights of BACE dataset on the transfer learning task. The first three rows exhibit the independent results of three representations, in which the nodes being paid large attention are highlighted by the red, green, and cyan circles correspondingly. The last row shows the results of all-representation attentions on one graph.
CLUB Estimators NCI1 MUTAG COLLAB IMDB-B
No CLUB 77.87 0.41 86.60 1.34 71.36 1.15 71.14 0.44
Parameterized 79.87 0.59 90.63 1.07 75.18 0.88 73.38 0.73
Non-parameterized 81.69 0.30 91.67 1.37 76.38 0.33 73.52 0.95
TABLE III: Ablation study on MI upper bound estimator. We run -folder cross-validation and report the mean and the standard deviation of the classification accuracy. The better performance is highlighted by the bold number.

Group correlation. We investigate the correlation of representations across different groups to demonstrate that diversities among them. We conduct this study on the PROTEINS dataset and the MUTAG dataset in the unsupervised learning experiment. We set the group to . As introduced in Section 3.5, the groups of representations are generated by employing attention with different query vectors. Hence, we compute the cosine distance between two queries as the correlation strength between each pair of groups. Figure 3 shows the results in a matrix, in which diagonal elements are the cosine distances of the same queries, and the other elements are those of different queries. We can observe that the cosine distance between any two different queries is quite small, implying that the correlation between any two different groups is weak.

Attention maps of different groups. We study the attention weights to reveal what kind of characteristics do representations across different groups encode. Specifically, we find the nodes which have the largest attention weights for each representation and visualize the graph that those nodes belong to. We conduct this study on the BACE dataset in the transfer learning experiment, in which the number of groups is set to . Figure 4 demonstrate the attentions of three representations independently, as well as those of three representations on one graph. Nodes that have larger attention weights are highlighted by the red, green, and cyan circles respectively for those three groups. The three rows on top show that one representation consistently focuses on the same nodes or motifs among different molecule instances and different representations capture information of different substructures. The above results support our claim that representations learned with our framework can encode diverse characteristics of given graphs. Putting visualization results of all representations together and studying more examples, the bottom row in Figure 4 indicates that one representation is not trivially limited to a single type of node or substructure. In particular, when the corresponding type of node does not appear in a graph, a representation is also able to capture nodes in similar substructures.

4.4 MI Upper Bound Estimators

We study the performance of two MI upper bound estimators introduced in Section 3.3, which are the non-parameterized CLUB and the parameterized CLUB. We conduct this experiment on two small datasets: MUTAG and IMDB-B, and two large datasets: NCI1 and COLLAB in the unsupervised learning task. Table III shows the results, in which we observe that the non-parameterized CLUB outperforms the parameterized one on the four datasets with our developed representor function. Two potential causes lead to these results. First, the correspondence exists among the same dimensions across different representations generated by our representor function. The non-parameterized CLUB is suitable for this case and can well estimate the lower bound of MI. Furthermore, compared to the non-parameterized estimator, the parameterized CLUB introduces additional complexity to the model which makes the learning process more difficult.

However, models optimized with parameterized CLUB still consistently outperforms the baseline methods without group contrast. In cases when the correspondence does not exist among the same dimensions across different groups and the non-parameterized CLUB becomes inapplicable, one can still adopt the parameterized estimator that significantly improves the baseline performance.

PROTEINS MUTAG DD IMDB-B
1 73.64 0.81 90.85 1.84 73.92 0.42 71.28 0.74
2 73.98 0.50 90.20 1.69 74.98 0.99 71.90 1.64
3 74.25 0.39 89.68 0.99 74.84 0.87 71.98 0.88
4 74.67 0.74 91.67 1.37 76.76 1.20 73.52 0.95
5 75.04 0.54 90.08 2.52 75.87 0.45 70.64 2.11
TABLE IV: Ablation study on the number of groups . We run -folder cross-validation and report the mean and the standard deviation of the classification accuracy. The best performance is highlighted by the bold number.
PROTEINS MUTAG DD IMDB-B
0.0 73.03 1.05 90.52 1.88 74.82 0.65 71.96 1.23
0.1 74.64 1.21 91.37 1.44 75.19 0.37 72.32 1.35
0.3 74.16 0.97 91.37 1.03 75.43 0.40 72.53 0.99
0.5 74.59 0.47 91.67 1.37 76.76 1.20 73.52 0.95
0.7 75.04 0.54 91.47 1.12 75.63 0.44 72.42 1.23
0.9 74.32 0.70 91.37 1.02 75.82 0.27 71.95 1.20
TABLE V: Ablation study on diversity loss weight . We run -folder cross-validation and report the mean and the standard deviation of the classification accuracy. The best performance is highlighted by the bold number.

4.5 Ablation Studies

We conduct ablation studies to explore the effect of critical hyperparameters, including the number of groups and the weight that balances the influence between the intra-space objective and inter-space objective. We consider four smaller datasets in all the eight ones, which pursues high efficiency. The ablation studies are conducted in the unsupervised learning setting.

Number of groups: . Compared to the prior work, the main breakthrough made by our approach is to learn multiple graph-level embeddings. Therefore, the number of groups: is a vital parameter to demonstrate the effectiveness of our approach. Through the studies of , we want to find the difference between the results of single group and multi-group, as well as the proper number of groups in the multi-group case. To this end, we set . Table IV gives the results. We achieve the best performance on PROTEINS when and on the other three datasets when . Note that in most cases the performance of is worse than that of , which demonstrates the effectiveness of the grouping scheme.

Weight of diversity loss: . There are two components in objectives employed to optimize our model, the intra-space terms and the inter-space terms. To balance their effects, we employ a hyperparameter to weight the inter-space objective. We compare the values varying from . Note that indicates that no diversity is enforced among groups. We hence let serve as the controlled experiment, which is used to justify the necessity of the diversities of groups. The results are shown in Table V. We observe that the results are consistently worse when , which demonstrates that the diversities of groups are crucial for boosting performance. For those four datasets, the best performance is obtained when equals , , , and accordingly.

Methods GroupCL GraphCL
No. of Params 22,800 51,200
TABLE VI: Comparison of the number of parameters. The number of parameters excludes the parameters in GNN which is shared by GraphCL and GroupCL.

4.6 Complexity Study

Taking GroupCL as an example, we analyze the training complexity of our grouping framework. We compare it with the non-grouping method GraphCL. According to Figure 1, GroupCL and GraphCL share the same GNN, so we only need to explore the complexity of the operations after the GNN, i.e., the representor function and projection heads. GraphCL sets the dimension of the node embedding to and that of the graph embedding to as well. We follow this setting. For the hyperparameters involved in GroupCL, we set the number of groups to and the dimension of the Key matrix to . Thus we have that the Query matrix is of size , is of size , and is of size . The projection head of GraphCL is a neural network of two fully connected layers, each of which has units. So the number of parameters is . Table VI summarizes the results, and we can observe that GroupCL has fewer parameters than GraphCL.

Datasets Type #Graphs #Graph Classes #Avg. Nodes #Avg. Edges #Node Classes
NCI1 Chemistry 4110 2 29.9 32.3 37
Proteins Chemistry 1113 2 39.1 72.8 3
Mutag Chemistry 188 2 17.9 19.8 7
DD Chemistry 1178 2 284.3 725.7 89
COLLAB Social 5000 3 74.5 2457.8 -
IMDB-B Social 1000 2 19.8 96.5 -
Reddit-B Social 2000 2 429.6 497.8 -
Reddit-5K Social 4999 5 508.8 594.9 -
TABLE VII: The statistics of the datasets used in the unsupervised learning experiment.
Datasets Type #Graphs #Graph Classes #Avg. Nodes #Avg. Edges #Binary prediction tasks
ZINC-2M Molecule, Pre-training 2,000,000 - 25.5 27.5 1
BBBP Molecule, Finetuning 2039 2 24.1 51.9 1
Tox21 Molecule, Finetuning 7831 2 18.6 38.6 12
ToxCast Molecule, Finetuning 8575 2 18.8 18.8 617
SIDER Molecule, Finetuning 1427 2 33.6 70.7 27
Clintox Molecule, Finetuning 1478 2 26.1 55.7 2
MUV Molecule, Finetuning 93087 2 24.2 52.6 17
HIV Molecule, Finetuning 41127 2 25.5 54.9 1
BACE Molecule, Finetuning 1513 2 34.1 73.7 1
TABLE VIII: The statistics of the datasets used in the transfer learning experiment.

5 Conclusions and Outlook

In this work, we have proposed a group contrastive learning framework to perform unsupervised graph representation learning. Our framework is more powerful than most previous contrastive learning methods, since it contrasts multiple representations in various subspaces, thereby enables these representations to encode abundant characteristics of graphs. To learn informative and diverse representations, we have developed two principled objectives regarding the intra-space and the inter-space. We have further proposed an attention-based representor function to incorporate into our framework. The generated representations by this function are capable of encoding informative and diverse graph substructures. Built on our framework, we have extended two prior non-grouping methods into GroupCL and GroupIG. To verify their effectiveness, we have conducted thorough experiments on graph unsupervised learning and transfer learning tasks. The quantitative results demonstrate that our methods achieve new state-of-the-art performance on a majority datasets and outperform the non-grouping methods obviously and consistently. In particular, we have visualized the attention weights of each node, which are learned by our representor function. The qualitative results illustrate that these representations focus on important and diverse graph substructures.

Currently, we consider one type of function, inserted after the GNN, to generate multiple representations. In the future, we plan to explore more multi-representation generating manners and consider different inserting positions.

References

  • [1] B. Adhikari, Y. Zhang, N. Ramakrishnan, and B. A. Prakash (2018) Sub2vec: feature learning for subgraphs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 170–182. Cited by: §4.1.
  • [2] B. Chen and W. Deng (2019)

    Hybrid-attention based decoupled metric learning for zero-shot image retrieval

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2750–2759. Cited by: §1.
  • [3] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International Conference on Machine Learning

    ,
    pp. 1597–1607. Cited by: §1.
  • [4] P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin (2020) Club: a contrastive log-ratio upper bound of mutual information. In International Conference on Machine Learning, pp. 1779–1788. Cited by: §3.3, §3.3.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, Cited by: §1.
  • [6] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §1.
  • [7] M. Fey and J. E. Lenssen (2019) Fast graph representation learning with PyTorch Geometric. In International Conference on Learning Representations Workshop, Cited by: §4.
  • [8] H. Gao and S. Ji (2019) Graph representation learning via hard and channel-wise attention networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 741–749. Cited by: §1.
  • [9] H. Gao, Z. Wang, and S. Ji (2018) Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1416–1424. Cited by: §1.
  • [10] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Cited by: §4.1.
  • [11] W. L. Hamilton (2020) Graph representation learning. Synthesis Lectures on Artifical Intelligence and Machine Learning 14 (3), pp. 1–159. Cited by: §1.
  • [12] K. Hassani and A. H. Khasahmadi (2020) Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning, pp. 4116–4126. Cited by: §1, §4.1.
  • [13] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, Cited by: §1.
  • [14] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec (2020) Strategies for pre-training graph neural networks. In International Conference on Learning Representations, Cited by: §4.2, §4.2.
  • [15] Y. Jiao, Y. Xiong, J. Zhang, Y. Zhang, T. Zhang, and Y. Zhu (2020) Sub-graph contrast for scalable self-supervised graph representation learning. In IEEE International Conference on Data Mining, pp. 222–231. Cited by: §1.
  • [16] S. Kim, J. Yoon, J. Yang, and S. Park (2010) Walk-weighted subsequence kernels for protein-protein interaction extraction. BMC Bioinformatics 11 (1), pp. 1–21. Cited by: §1.
  • [17] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based ensemble for deep metric learning. In Proceedings of the European Conference on Computer Vision, pp. 736–751. Cited by: §1.
  • [18] T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. In Advances in Neural Information Processing Systems Workshops, Cited by: §1.
  • [19] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §1, §2.2.
  • [20] C. H. Lampert, H. Nickisch, and S. Harmeling (2013) Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 453–465. Cited by: §1.
  • [21] Y. Liu, H. Yuan, L. Cai, and S. Ji (2020) Deep learning of high-order interactions for protein interface prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 679–687. Cited by: §1.
  • [22] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017)

    Graph2vec: learning distributed representations of graphs

    .
    arXiv preprint arXiv:1707.05005. Cited by: §4.1.
  • [23] S. Nowozin, B. Cseke, and R. Tomioka (2016) F-gan: training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, Cited by: §3.2.
  • [24] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 276–290. Cited by: §1.
  • [25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.
  • [26] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §1.
  • [27] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, and J. Huang (2020) Graph representation learning via graphical mutual information maximization. In Proceedings of the Web Conference, pp. 259–270. Cited by: §1.
  • [28] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang (2020) Self-supervised graph transformer on large-scale molecular data. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [29] E. Shechtman and M. Irani (2007) Matching local self-similarities across images and videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §1.
  • [30] T. Sterling and J. J. Irwin (2015) ZINC 15–ligand discovery for everyone. Journal of Chemical Information and Modeling 55 (11), pp. 2324–2337. Cited by: §4.2.
  • [31] F. Sun, J. Hoffmann, V. Verma, and J. Tang (2020) Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In International Conference on Learning Representations, Cited by: §1, §1, §3.6.
  • [32] S. Thakoor, C. Tallec, M. G. Azar, R. Munos, P. Veličković, and M. Valko (2021) Bootstrapped representation learning on graphs. arXiv preprint arXiv:2102.06514. Cited by: §1, §1.
  • [33] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2020) On mutual information maximization for representation learning. In International Conference on Learning Representations, Cited by: §1.
  • [34] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations, Cited by: §1, §2.2.
  • [35] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2019) Deep graph infomax. In International Conference on Learning Representations, Cited by: §1.
  • [36] C. Wang, S. Pan, G. Long, X. Zhu, and J. Jiang (2017)

    Mgae: marginalized graph autoencoder for graph clustering

    .
    In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 889–898. Cited by: §1.
  • [37] H. Wang, X. Wang, W. Xiong, M. Yu, X. Guo, S. Chang, and W. Y. Wang (2019)

    Self-supervised learning for contextualized extractive summarization

    .
    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
  • [38] X. Wang and G. Qi (2021) Contrastive learning with stronger augmentations. arXiv preprint arXiv:2104.07713. Cited by: §1.
  • [39] Z. Wang and S. Ji (2020) Second-order pooling for graph neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
  • [40] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande (2018) MoleculeNet: a benchmark for molecular machine learning. Chemical Science 9 (2), pp. 513–530. Cited by: §4.2.
  • [41] Y. Xie, Z. Wang, and S. Ji (2020) Noise2same: optimizing a self-supervised bound for image denoising. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [42] Y. Xie, Z. Xu, J. Zhang, Z. Wang, and S. Ji (2021) Self-supervised learning of graph neural networks: a unified review. arXiv preprint arXiv:2102.10757. Cited by: §3.2.
  • [43] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In International Conference on Learning Representations, Cited by: §2.2, §4.1.
  • [44] X. Xu, Z. Wang, C. Deng, H. Yuan, and S. Ji (2020) Towards improved and interpretable deep metric learning via attentive grouping. arXiv preprint arXiv:2011.08877. Cited by: §1.
  • [45] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §1.
  • [46] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen (2020) Graph contrastive learning with augmentations. In Advances in Neural Information Processing Systems, Vol. 33. Cited by: §1, §1, §4.1, §4.1, §4.1.
  • [47] L. Yu, Y. Zhang, I. Gutman, Y. Shi, and M. Dehmer (2017) Protein sequence comparison based on physicochemical properties and the position-feature energy matrix. Scientific Reports 7 (1), pp. 1–9. Cited by: §1.
  • [48] H. Yuan and S. Ji (2020) Structpool: structured graph pooling via conditional random fields. In International Conference on Learning Representations, Cited by: §1.
  • [49] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang (2020) Deep graph contrastive representation learning. In International Conference on Machine Learning Workshops, Cited by: §1, §1.