Log In Sign Up

Iterative Graph Self-Distillation

How to discriminatively vectorize graphs is a fundamental challenge that attracts increasing attentions in recent years. Inspired by the recent success of unsupervised contrastive learning, we aim to learn graph-level representation in an unsupervised manner. Specifically, we propose a novel unsupervised graph learning paradigm called Iterative Graph Self-Distillation (IGSD) which iteratively performs the teacher-student distillation with graph augmentations. Different from conventional knowledge distillation, IGSD constructs the teacher with an exponential moving average of the student model and distills the knowledge of itself. The intuition behind IGSD is to predict the teacher network representation of the graph pairs under different augmented views. As a natural extension, we also apply IGSD to semi-supervised scenarios by jointly regularizing the network with both supervised and unsupervised contrastive loss. Finally, we show that finetuning the IGSD-trained models with self-training can further improve the graph representation power. Empirically, we achieve significant and consistent performance gain on various graph datasets in both unsupervised and semi-supervised settings, which well validates the superiority of IGSD.


page 1

page 2

page 3

page 4


A Novel Self-Knowledge Distillation Approach with Siamese Representation Learning for Action Recognition

Knowledge distillation is an effective transfer of knowledge from a heav...

Iterative Self Knowledge Distillation – From Pothole Classification to Fine-Grained and COVID Recognition

Pothole classification has become an important task for road inspection ...

GKD: Semi-supervised Graph Knowledge Distillation for Graph-Independent Inference

The increased amount of multi-modal medical data has opened the opportun...

Hierarchical Self-supervised Augmented Knowledge Distillation

Knowledge distillation often involves how to define and transfer knowled...

Complementary Relation Contrastive Distillation

Knowledge distillation aims to transfer representation ability from a te...

False Negative Distillation and Contrastive Learning for Personalized Outfit Recommendation

Personalized outfit recommendation has recently been in the spotlight wi...

Contrastive Neighborhood Alignment

We present Contrastive Neighborhood Alignment (CNA), a manifold learning...

1 Introduction

Graphs are ubiquitous representations encoding relational structures across various domains. Learning low-dimensional vector representations of graphs is critical in various domains ranging from social science (Newman and Girvan, 2004) to bioinformatics (Duvenaud et al., 2015; Zhou et al., 2020)

. Many graph neural networks (GNNs)

(Gilmer et al., 2017; Kipf and Welling, 2016; Xu et al., 2018)

have been proposed to learn node and graph representations by aggregating information from every node’s neighbors via non-linear transformation and aggregation functions. However, the key limitation of existing GNN architectures is that they often require a huge amount of labeled data to be competitive but annotating graphs like drug-target interaction networks is challenging since it needs domain-specific expertise. Therefore, unsupervised learning on graphs has been long studied, such as graph kernels

(Shervashidze et al., 2011) and matrix-factorization approaches (Belkin and Niyogi, 2002).

Inspired by the recent success of unsupervised representation learning in various domains like images (Chen et al., 2020a; He et al., 2020) and texts (Radford et al., 2018), most related works in the graph domain either follow the pipeline of unsupervised pretraining (followed by fine-tuning) or InfoMax principle (Hjelm et al., 2018). The former often needs meticulous designs of pretext tasks (Hu et al., 2019; You et al., 2020) while the latter is dominant in unsupervised graph representation learning, which trains encoders to maximize the mutual information (MI) between the representations of the global graph and local patches (such as subgraphs) (Veličković et al., 2018; Sun et al., 2019; Hassani and Khasahmadi, 2020). However, MI-based approaches usually need to sample subgraphs as local views to contrast with global graphs. And they usually require an additional discriminator for scoring local-global pairs and negative samples, which is computationally prohibitive (Tschannen et al., 2019)

. Besides, the performance is also very sensitive to the choice of encoders and MI estimators

(Tschannen et al., 2019). Moreover, MI-based approaches cannot be handily extended to the semi-supervised setting since local subgraphs lack labels that can be utilized for training. Therefore, we are seeking an approach that learns the entire graph representation by contrasting the whole graph directly without the need of MI estimation, discriminator and subgraph sampling.

Motivated by recent progress on contrastive learning, we propose the Iterative Graph Self-Distillation (IGSD), a teacher-student framework to learn graph representations by contrasting graph instances directly. The high-level idea of IGSD is based on graph contrastive learning where we pull similar graphs together and push dissimilar graph away. However, the performance of conventional contrastive learning largely depends on how negative samples are selected. To learn discriminative representations and avoid collapsing to trivial solutions, a large set of negative samples (He et al., 2020; Chen et al., 2020a) or a special mining strategy (Schroff et al., 2015) are necessary. In order to alleviate the dependency on negative samples mining and still be able to learn discriminative graph representations, we propose to use self-distillation as a strong regularization to guide the graph representation learning.

In the IGSD framework, graph instances are augmented as several views to be encoded and projected into a latent space where we define a similarity metric for consistency-based training. The parameters of the teacher network are iteratively updated as an exponential moving average of the student network parameters, allowing the knowledge transfer between them. As merely small amount of labeled data is often available in many real-world applications, we further extend our model to the semi-supervised setting such that we can effectively utilize graph-level labels while considering arbitrary amounts of positive pairs belonging to the same class. Moreover, in order to leverage the information from pseudo-labels with high confidence, we develop a self-training algorithm based on the supervised contrastive loss to fine-tune the encoder.

We experiment with real-world datasets in various scales and compare the performance of IGSD with state-of-the-art graph representation learning methods. Experimental results show that IGSD achieves competitive performance in both unsupervised and semi-supervised settings with different encoders and data augmentation choices. With the help of self-training, our performance can exceed state-of-the-art baselines by a large margin.

To summarize, we make the following contributions in this paper:

  • [leftmargin=*]

  • We propose a self-distillation framework called IGSD for unsupervised graph-level representation learning where the teacher-student distillation is performed for contrasting graph pairs under different augmented views.

  • We further extend IGSD to the semi-supervised scenario, where the labeled data are utilized effectively with the supervised contrastive loss and self-training.

  • We empirically show that IGSD surpasses state-of-the-art methods in semi-supervised graph classification and molecular property prediction tasks and achieves performance competitive with state-of-the-art approaches in unsupervised graph classification tasks.

2 Preliminaries

2.1 Formulation

Unsupervised Graph Representation Learning

Given a set of unlabeled graphs , we aim at learning the low-dimensional representation of every graph favorable for downstream tasks like graph classification.

Semi-supervised Graph Representation Learning

Consider a whole dataset composed by labeled data and unlabeled data (usually ), our goal is to learn a model that can make predictions on graph labels for unseen graphs. And with augmentations, we get and as our training data.

2.2 Graph Representation Learning

We represent a graph instance as with the node set and the edge set . The dominant ways of graph representation learning are graph neural networks with neural message passing mechanisms (Hamilton et al., 2017): for every node , node representation is iteratively computed from the features of their neighbor nodes using a differentiable aggregation function. Specifically, at the iteration we get the node embedding as:


Then the graph-level representations can be attained by aggregating all node representations using a readout function like summation or set2set pooling (Vinyals et al., 2015).

2.3 Graph Data Augmentation

It has been shown that the learning performance of GNNs can be improved via graph diffusion, which serves as a homophily-based denoising filter on both features and edges in real graphs (Klicpera et al., 2019). The transformed graphs can also serve as effective augmented views in contrastive learning (Hassani and Khasahmadi, 2020). Inspired by that, we transform a graph with transition matrix via graph diffusion and sparsification into a new graph with adjacency matrix S as an augmented view in our framework. While there are many design choices in coefficients like heat kernel, we employ Personalized PageRank (PPR) with due to its superior empirical performance (Hassani and Khasahmadi, 2020). Besides, we randomly remove edges of graphs to attain corrupted graphs as augmented views to validate the robustness of models to different augmentation choices.

3 Iterative Graph Self-distillation

Intuitively, the goal of contrastive learning on graphs is to learn graph representations that are close in the metric space for positive pairs (graphs with the same labels) and far between negative pairs (graphs with different labels). To achieve this goal, IGSD employ the teacher-student distillation to iteratively refine representations by contrasting latent representations embedded by two networks and using additional predictor and EMA update to avoid collapsing to trivial solutions. Overall, IGSD encourages the closeness of augmented views from the same graph instances while pushing apart the representations from different ones.

Figure 1: Overview of IGSD. Illustration of our framework in the case where we augment input graphs once to get for only one forward pass. Blue and red arrows denote contrast on positive and negative pairs respectively.

3.1 Iterative Graph Self-Distillation Framework

In IGSD, we introduce a teacher-student architecture comprises two networks in similar structure composed by encoder , projector and predictor . We denote the components of the teacher network and the student network as , and , , respectively.

The overview of IGSD is illustrated in Figure 1. In IGSD we first augment the original input graphs to get augmented view(s) and then feed them respectively into two encoders for extracting graph representations. The following projectors transform graph representations to a higher dimensional projections . To prevent collapsing into a trivial solution (Grill et al., 2020), a specialized predictor is used in the student network for attaining the prediction of the projection .

To contrast latents and , we use norm in the latent space to approximate the semantic distance in the input space and the consistency loss can be defined as the mean square error between the normalized prediction and projection . By passing two instances symmetrically, we can obtain the overall consistency loss:


With the consistency loss, the teacher network provides a regression target to train the student network, and its parameters are updated as an exponential moving average (EMA) of the student parameters after weights of the student model have been updated using gradient descent:


With the above iterative self-distillation procedure, we can aggregate information for averaging model weights over each training step instead of using the final weights directly.

3.2 Unsupervised Learning with IGSD

In IGSD, to contrast the anchor with other graph instances (i.e. negative samples), we employ the following unsupervised InfoNCE objective (Oord et al., 2018):


At the inference time, as semantic interpolations on samples, labels and latents are effective in obtaining better representations and can improve learning performance greatly

(Zhang et al., 2017; Verma et al., 2019; Berthelot et al., 2019), we obtain the graph representation by interpolating the latent representations and with Mixup function :


3.3 Semi-supervised Learning with IGSD

To bridge the gap between unsupervised pretraining and downstream tasks, we extend our model to the semi-supervised setting. In this scenario, it is straightforward to plug in the unsupervised loss as a regularizer for representation learning. However, the instance-wise supervision limited to standard supervised learning may lead to biased negative sampling problems

(Chuang et al., 2020). To tackle this challenge, we can use a small amount of labeled data further to generalize the similarity loss to handle arbitrary numbers of positive samples belonging to the same class:


where denotes the total number of samples in the training set that have the same label as anchor . Thanks to the graph-level contrastive nature of IGSD, we are able to alleviate the biased negative sampling problems (Khosla et al., 2020) with supervised contrastive loss, which is crucial (Chuang et al., 2020) but unachievable in most MI-based contrastive learning models since subgraphs are generally hard to assign labels to. Besides, with this loss we are able to fine-tune our model effectively using self-training where pseudo-labels are assigned iteratively to unlabeled data.

With the standard supervised loss like cross entropy or mean square error , the overall objective can be summarized as:


Common semi-supervised learning methods use consistency regularization to measure discrepancy between predictions made on perturbed unlabeled data points to get better prediction stability and generalization performance

(Oliver et al., 2018). By contrast, our methods enforce consistency constraints between latents from different views, which acts as a regularizer for learning directly from labels.

Labeled data provides additional supervision about graph classes and alleviates biased negative sampling. However, they are costly to attain in many areas. Therefore, we develop a contrastive self-training algorithm to leverage label information more effectively than cross entropy in the semi-supervised scenario. In the algorithm, we train the model using a small amount of labeled data and then fine-tune it by iterating between assigning pseudo-labels to unlabeled examples and training models using the augmented dataset. In this way, we can harvest massive pseudo-labels for unlabeled examples.

With increasing size of augmented labeled dataset, the discriminative power of IGSD can be improved iteratively by contrasting more positive pairs belonging to the same class. In this way, we accumulate high-quality psuedo-labels after each iteration to compute the supervised contrastive loss in Eq. 6 and make distinction from conventional self-training algorithms (Rosenberg et al., 2005). On the other hand, traditional self-training can only use psuedo-labels for computing cross entropy only.

4 Experiments

4.1 Experimental Setup

Evaluation Tasks. We conduct experiments by comparing with state-of-the-art models on three tasks. In graph classification tasks, we experiment in both the unsupervised setting where we only have access to all unlabeled samples in the dataset and the semi-supervised setting where we use a small fraction of labeled examples and treat the rest as unlabeled ones by ignoring their labels. In molecular property prediction tasks where labels are expensive to obtain, we only consider the semi-supervised setting.

Datasets. For graph classification tasks, we employ several widely used benchmark graph kernel datasets (Kersting et al., 2016)

for learning and evaluation: 3 bioinformatics datasets (MUTAG, PTC_MR, NCI1) and 3 social network datasets (COLLAB, IMDB-BINARY, IMDB-MULTI) with statistics summarized in Table


In the semi-supervised graph regression tasks, we use the QM9 dataset containing 134,000 drug-like organic molecules (Ramakrishnan et al., 2014) and select the first ten physicochemical properties as regression targets for training and evaluation.


In the unsupervised graph classification, we compare with the following representative baselines: CMC-Graph (Hassani and Khasahmadi, 2020), InfoGraph (Sun et al., 2019), Graph2Vec (Narayanan et al., 2017) and Graph Kernels including Random Walk Kernel (Gärtner et al., 2003), Shortest Path Kernel (Kashima et al., 2003), Graphlet Kernel (Shervashidze et al., 2009), Weisfeiler-Lehman Sub-tree Kernel (WL SubTree) (Shervashidze et al., 2011), Deep Graph Kernels (Yanardag and Vishwanathan, 2015) and Multi-Scale Laplacian Kernel (MLG) (Kondor and Pan, 2016).

For the semi-supervised graph classification, we compare our method with competitive baselines like InfoGraph, InfoGraph* and Mean Teachers. And the GIN baseline doesn’t have access to the unlabeled data. In the semi-supervised molecular property prediction tasks, baselines include InfoGraph, InfoGraph* and Mean Teachers (Tarvainen and Valpola, 2017).

Model Configuration. In our framework, We use GCNs (Kipf and Welling, 2016) and GINs (Xu et al., 2018) as encoders to attain node representations for the unsupervised and semi-supervised graph classification respectively. For semi-supervised molecular property prediction tasks, we employ message passing neural networks (MPNNs) (Gilmer et al., 2017) as our backbone encoders to encode molecular graphs with rich edge attributes. All projectors and predictors are implemented as two-layer MLPs.

In semi-supervised molecular property prediction tasks, we generate multiple views based on edge attributes (bond types) of rich-annotated molecular graphs for improving performance. Specifically, we perform label-preserving augmentation to attain multiple diffusion matrixes of every graph on different edge attributes while ignoring others respectively. The diffusion matrix gives a denser graph based on each type of edges to leverage edge features better. We train our models using different numbers of augmented training data and select the amount using cross validation.

For parameter tuning, we select number of GCN layers over {2, 8, 12}, batch size over {16, 32, 64, 128, 256, 512}, number of epochs over {20, 40, 100} and learning rate over {1e-4, 1e-3} in unsupervised graph classification. The hyper-parameters we tune for semi-supervised graph classification and molecular property prediction are the same in

(Xu et al., 2018) and (Sun et al., 2019), respectively. In all experiments, we set the weighting coefficient of Mixup function to be 0.5 and tune our projection hidden size over {1024, 2048} and projection size over {256, 512} since bigger projection heads are able to improve representation learning (Chen et al., 2020b). We start self-training after 30 epochs and tune the number of iterations over {20, 50}, pseudo-labeling threshold over {0.9, 0.95}.

For unsupervised graph classification, we adopt LIB-SVM (Chang and Lin, 2011) with C

parameter selected in {1e-3, 1e-2, …, 1e2, 1e3} as our downstream classifier. Then we use 10-fold cross validation accuracy as the classification performance and repeat the experiments 5 times to report the mean and standard deviation. For semi-supervised graph classification, we randomly select

of training data as labeled data and the rest is treated as unlabeled one and report the best test set accuracy in 300 epochs. Following the experimental setup in (Sun et al., 2019), we randomly choose 5000, 10000, 10000 samples for training, validation and testing respectively and the rest are treated as unlabeled training data for the molecular property prediction tasks.

Datasets # graphs 188 1000 1500 4110 5000 344
# classes 2 2 3 2 3 2
Avg # nodes 17.9 19.8 13.0 29.8 74.5 25.5
Graph Kernels Random Walk 83.7 1.5 50.7 0.3 34.7 0.2 OMR OMR 57.9 1.3
Shortest Path 85.2 2.4 55.6 0.2 38.0 0.3 51.3 0.6 49.8 1.2 58.2 2.4
Graphlet Kernel 81.7 2.1 65.9 1.0 43.9 0.4 53.9 0.4 56.3 0.6 57.3 1.4
WL subtree 80.7 3.0 72.3 3.4 47.0 0.5 55.1 1.6 50.2 0.9 58.0 0.5
Deep Graph 87.4 2.7 67.0 0.6 44.6 0.5 54.5 1.2 52.1 1.0 60.1 2.6
MLG 87.9 1.6 66.6 0.3 41.2 0.0 >1 Day >1 Day 63.3 1.5
Unsupervised Graph2Vec 83.2 9.6 71.1 0.5 50.4 0.9 73.2 1.8 47.9 0.3 60.2 6.9
InfoGraph 89.0 1.1 74.2 0.7 49.7 0.5 73.8 0.7 67.6 1.2 61.7 1.7
CMC-Graph 89.71.1 74.20.7 51.20.5 75.0 0.7 68.9 1.9 62.51.7
Ours (Random) 85.72.1 71.61.2 49.20.6 75.1 0.4 65.8 1.0 57.61.5
Ours 90.20.7 74.70.6 51.50.3 75.4 0.3 70.4 1.1 61.41.7
Table 1: Graph classification accuracies (%) for kernels and unsupervised methods on 6 datasets. We report the mean and standard deviation of final results with five runs. ‘>1 day’ represents that the computation exceeds 24 hours. ‘OMR’ means out of memory error.
Mean Teachers 69.0 49.3 72.5 71.1
InfoGraph* 71.0 49.3 67.6 71.1
GIN (Supervised Only) 67.0 50.0 71.4 67.9
Ours (Unsup) 72.0 50.0 72.6 70.6
Ours (SupCon) 75.0 52.0 73.4 67.9
GIN (Supervised Only)+self-training 72.0 51.3 70.4 74.0
Ours (Unsup)+self-training 73.0 54,0 71 72.5
Ours (SupCon)+self-training 77.0 55.3 73.6 77.1
Table 2: Graph classification accuracies (%) of semi-supervised experiments on 4 datasets. We report the best results on test set in 300 epochs.

4.2 Numerical Results

Results on unsupervised graph classification. We first present the results of the unsupervised setting in Table 1. All graph kernels give inferior performance except in the PTC dataset. The Random Walk kernel runs out of memory and the Multi-Scale Laplacian Kernel suffers from a long running time (exceeds 24 hours) in two larger datasets. IGSD outperforms state-of-the-art baselines like InfoGraph and CMC-Graph, showing that IGSD can learn expressive graph-level representations for downstream classifiers. Besides, our model still achieve competitive results in datasets like IMDB-M and NCI1 with random dropping augmentation, which demonstrates the robustness of IGSD with different choices of data augmentation strategies.

Results on semi-supervised graph classification. We further apply our model to semi-supervised graph classification tasks with results demonstrated in Table 2, where we set and in Eq. 7 to be 1 and 0 as Ours (Unsup) while 0 and 1 as Ours (SupCon). In this setting, our model performs better than Mean Teachers and InfoGraph*. Both the unsupervised loss and supervised contrastive loss provide extra performance gain compared with GIN using supervised data only. Besides, both of their performance can be improved significantly combined using self-training especially with supervised contrastive loss. It makes empirical sense since self-training iteratively assigns psuedo-labels with high confidence to unlabeled data, which provides extra supervision on their categories under contrastive learning framework.

Results on semi-supervised molecular property prediction. We present the regression performance of our model measured in the QM9 dataset in Figure 2. We display the performance of our model and baselines as mean square error ratio with respect to supervised results and our model outperforms all baselines in 9 out of 10 tasks compared with strong baselines InfoGraph, InfoGraph* and Mean Teachers. And in some tasks like R2 (5), U0 (7) and U (8), DGSI achieves significant performance gains against its counterparts, which demonstrates the ability to transfer knowledge learned from unsupervised data for supervised tasks.

Figure 2: Semi-supervised molecular property prediction results in terms of mean absolute error (MAE) ratio. The histogram shows error ratio with respect to supervised results (1.0) of every semi-supervised models. Lower scores are better and a model outperforms the supervised baseline when the score is less than 1.0.

4.3 Ablation Studies and Analysis

Figure 3: Ablation studies (a) Unsupervised learning performance with different batch size (different amount of negative pairs); (b) Semi-supervised graph classification accuracy with different proportion of labeled data.

Performance with self-training. We first investigate the effects of self-training for our model performance in table 2. Results show that self-training can improve the GIN baseline and our models with unsupervised loss (Unsup) or supervised contrastive loss (SupCon). The improvement is even more significant combined with supervised contrastive loss since high-quality pseudo-labels provide additional information of graph categories. Moreover, our self-training algorithm consistently outperforms the traditional self-training baseline, which further validates the superiority of our model.

Performance with different amount of negative pairs. We then conduct ablation experiments on the amount of negative pairs by varying batch size over {16, 32, 64, 128} with results on IMDB-BINARY dataset shown in Figure 2(a). Both methods contrast negative pairs batch-wise and increasing batch size improves the performance of IGSD while degrades CMC-Graph. When batch size is greater than 32, IGSD outperforms CMC-Graph and the performance gap becomes larger as the batch size increases.

Performance with different proportion of labeled data. We also investigate the performance of different models with different proportion of labeled data with IMDB-BINARY dataset. As illustrated in Figure 2(b), IGSD outperforms strong InfoGraph* baseline given different amount of labeled data consistently. And the performance gain is most significant when the fraction of labeled data is since our models can leverage labels more effectively by regularizing original unsupervised learning objective when labels are scarce.

5 Related Work

Contrastive Learning Modern unsupervised learning in the form of contrastive learning can be categorized into two types: context-instance contrast and context-context contrast (Liu et al., 2020). The context-instance contrast, or so-called global-local contrast focuses on modeling the belonging relationship between the local feature of a sample and its global context representation. Most unsupervised learning models on graphs like DGI (Veličković et al., 2018), InfoGraph (Sun et al., 2019), CMC-Graph (Hassani and Khasahmadi, 2020) fall into this category, following the InfoMax principle to maximize the the mutual information (MI) between the input and its representation. However, estimating MI is notoriously hard in MI-based contrastive learning and in practice tractable lower bound on this quantity is maximized instead. And maximizing tighter bounds on MI can result in worse representations without stronger inductive biases in sampling strategies, encoder architecture and parametrization of MI estimators (Tschannen et al., 2019). Besides, the intricacies of negative sampling in MI-based approaches impose key research challenges like improper amount of negative samples or biased negative sampling (Tschannen et al., 2019; Chuang et al., 2020). Another line of contrastive learning approaches called context-context contrast directly study the relationships between the global representations of different samples as what metric learning does. For instance, a recently proposed model BYOL (Grill et al., 2020) bootstraps the representations of the whole images directly. Focusing on global representations between samples and corresponding augmented views also allows instance-level supervision to be incorporated naturally like introducing supervised contrastive loss (Khosla et al., 2020) into the framework for learning powerful representations. Graph Contrastive Coding (GCC) (Qiu et al., 2020) is a pioneer to leverage instance discrimination as the pretext task for structural information pre-training. However, our work is fundamentally different from theirs. GCC focuses on structural similarity with InfoNCE as learning objective to find common and transferable structural patterns across different graph datasets and the contrastive scheme is done through subgraph instance discrimination. On the contrary, our model aims at learning graph-level representation by directly contrasting graph instances such that data augmentation strategies and graph labels can be utilized naturally and effectively.

Knowledge Distillation Knowledge distillation (Hinton et al., 2015) is a method for transferring knowledge from one architecture to another, allowing model compression and inductive biases transfer. Self-distillation (Furlanello et al., 2018) is a special case when two architectures are identical, which can iteratively modify regularization and reduce over-fitting if perform suitable rounds (Mobahi et al., 2020). However, they often focus on closing the gap between the predictive results of student and teacher rather than defining similarity loss in latent space for contrastive learning.

Semi-supervised Learning Modern semi-supervised learning can be categorized into two kinds: multi-task learning and consistency training between two separate networks. Most widely used semi-supervised learning methods take the form of multi-task learning: on labeled data and unlabeled data . By regularizing the learning process with unlabeled data, the decision boundary becomes more plausible. Another mainstream of semi-supervised learning lies in introducing student network and teacher network and enforcing consistency between them (Tarvainen and Valpola, 2017; Miyato et al., 2019; Lee, 2013). It has been shown that semi-supervised learning performance can be greatly improved via unsupervised pre-training of a (big) model, supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge (Chen et al., 2020b). However, whether task-agnostic self-distillation would benefit semi-supervised learning is still underexplored.

6 Conclusions

In this paper, we propose IGSD, a novel unsupervised graph-level representation learning framework via self-distillation. Our framework iteratively performs teach-student distillation by contrasting augmented views of graph instances. Experimental results in both unsupervised and semi-supervised settings show that IGSD is not only able to learn effective graph representations competitive with state-of-the-art models but also robust with choices of encoders and augmentation strategies. In the future, we plan to apply our framework to other graph learning tasks and investigate the design of view generators to generative effective views automatically.


  • M. Belkin and P. Niyogi (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pp. 585–591. Cited by: §1.
  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) Mixmatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5049–5059. Cited by: §3.2.
  • C. Chang and C. Lin (2011)

    LIBSVM: a library for support vector machines

    ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: §4.1.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020a) A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: §1, §1.
  • T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton (2020b) Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §4.1, §5.
  • C. Chuang, J. Robinson, L. Yen-Chen, A. Torralba, and S. Jegelka (2020) Debiased contrastive learning. External Links: 2007.00224 Cited by: §3.3, §5.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1.
  • T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born again neural networks. External Links: 1805.04770 Cited by: §5.
  • T. Gärtner, P. Flach, and S. Wrobel (2003) On graph kernels: hardness results and efficient alternatives. In Learning theory and kernel machines, pp. 129–143. Cited by: §4.1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212. Cited by: §1, §4.1.
  • J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko (2020) Bootstrap your own latent: a new approach to self-supervised learning. External Links: 2006.07733 Cited by: §3.1, §5.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §2.2.
  • K. Hassani and A. H. Khasahmadi (2020) Contrastive multi-view representation learning on graphs. arXiv preprint arXiv:2006.05582. Cited by: §A.1, §1, §2.3, §4.1, §5.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning.

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    External Links: ISBN 9781728171685, Link, Document Cited by: §1, §1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §1.
  • W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec (2019) Strategies for pre-training graph neural networks. External Links: 1905.12265 Cited by: §1.
  • H. Kashima, K. Tsuda, and A. Inokuchi (2003) Marginalized kernels between labeled graphs. In

    Proceedings of the 20th international conference on machine learning (ICML-03)

    pp. 321–328. Cited by: §4.1.
  • K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann (2016) Benchmark data sets for graph kernels. External Links: Link Cited by: §4.1.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. External Links: 2004.11362 Cited by: §3.3, §5.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §A.1, §1, §4.1.
  • J. Klicpera, S. Weißenberger, and S. Günnemann (2019) Diffusion improves graph learning. External Links: 1911.05485 Cited by: §A.1, §2.3.
  • R. Kondor and H. Pan (2016) The multiscale laplacian graph kernel. In Advances in Neural Information Processing Systems, pp. 2990–2998. Cited by: §4.1.
  • D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §5.
  • X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang (2020) Self-supervised learning: generative or contrastive. External Links: 2006.08218 Cited by: §5.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2019) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 1979–1993. External Links: ISSN 1939-3539, Link, Document Cited by: §5.
  • H. Mobahi, M. Farajtabar, and P. L. Bartlett (2020) Self-distillation amplifies regularization in hilbert space. arXiv preprint arXiv:2002.05715. Cited by: §5.
  • A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017)

    Graph2vec: learning distributed representations of graphs

    arXiv preprint arXiv:1707.05005. Cited by: §4.1.
  • M. E. Newman and M. Girvan (2004) Finding and evaluating community structure in networks. Physical review E 69 (2), pp. 026113. Cited by: §1.
  • A. Oliver, A. Odena, C. Raffel, E. D. Cubuk, and I. J. Goodfellow (2018) Realistic evaluation of deep semi-supervised learning algorithms. External Links: 1804.09170 Cited by: §3.3.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.2.
  • J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang (2020) Gcc: graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1150–1160. Cited by: §5.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld (2014) Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 1 (1), pp. 1–7. Cited by: §A.1, §4.1.
  • C. Rosenberg, M. Hebert, and H. Schneiderman (2005) Semi-supervised self-training of object detection models. Cited by: §3.3.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1.
  • N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels.. Journal of Machine Learning Research 12 (9). Cited by: §A.1, §1, §4.1.
  • N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt (2009) Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: §4.1.
  • F. Sun, J. Hoffmann, V. Verma, and J. Tang (2019) Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §1, §4.1, §4.1, §4.1, §5.
  • A. Tarvainen and H. Valpola (2017)

    Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results

    External Links: 1703.01780 Cited by: §4.1, §5.
  • Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning?. arXiv preprint arXiv:2005.10243. Cited by: §A.1.
  • M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2019) On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625. Cited by: §1, §5.
  • P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2018) Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §A.1, §1, §5.
  • V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio (2019) Manifold mixup: better representations by interpolating hidden states. In International Conference on Machine Learning, pp. 6438–6447. Cited by: §3.2.
  • O. Vinyals, S. Bengio, and M. Kudlur (2015) Order matters: sequence to sequence for sets. External Links: 1511.06391 Cited by: §2.2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §A.1, §1, §4.1, §4.1.
  • P. Yanardag and S. Vishwanathan (2015) Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. Cited by: §4.1.
  • Y. You, T. Chen, Z. Wang, and Y. Shen (2020) When does self-supervision help graph convolutional networks?. arXiv preprint arXiv:2006.09136. Cited by: §1.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. External Links: 1710.09412 Cited by: §3.2.
  • T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah (2020) Data augmentation for graph neural networks. External Links: 2006.06830 Cited by: §A.1.
  • Y. Zhou, F. Wang, J. Tang, R. Nussinov, and F. Cheng (2020) Artificial intelligence in covid-19 drug repurposing. The Lancet Digital Health. Cited by: §1.

Appendix A Appendix

a.1 Related Work

Graph Representation Learning

Traditionally, graph kernels are widely used for learning node and graph representations. This common process includes meticulous designs like decomposing graphs into substructures and using kernel functions like Weisfeiler-Leman graph kernel (Shervashidze et al., 2011) to measure graph similarity between them. However, they usually require non-trivial hand-crafted substructures and domain-specific kernel functions to measure the similarity while yields inferior performance on downstream tasks like node classification and graph classification. Recently, there has been increasing interest in Graph Neural Network (GNN) approaches for graph representation learning and many GNN variants have been proposed (Ramakrishnan et al., 2014; Kipf and Welling, 2016; Xu et al., 2018). However, they mainly focus on supervised settings.

Data augmentation

Data augmentation strategies on graphs are limited since defining views of graphs is a non-trivial task. There are two common choices of augmentations on graphs (1) feature-space augmentation and (2) structure-space augmentation. A straightforward way is to corrupt the adjacency matrix which preserves the features but adds or removes edges from the adjacency matrix with some probability distribution

(Veličković et al., 2018). Zhao et al. (2020) improves performance in GNN-based semi-supervised node classification via edge prediction. Empirical results show that diffusion matrix can serve as a denoising filter to augment graph data for improving graph representation learning significantly both in supervised (Klicpera et al., 2019) and unsupervised settings (Hassani and Khasahmadi, 2020). Hassani and Khasahmadi (2020) shows the benefits of treating diffusion matrix as an augmented view of mutual information-based contrastive graph representation learning. Attaining effective views is non-trivial since we need to consider factors like mutual information to preserve label information w.r.t the downstream task (Tian et al., 2020).