1 Introduction
Graphs are ubiquitous representations encoding relational structures across various domains. Learning lowdimensional vector representations of graphs is critical in various domains ranging from social science (Newman and Girvan, 2004) to bioinformatics (Duvenaud et al., 2015; Zhou et al., 2020)
. Many graph neural networks (GNNs)
(Gilmer et al., 2017; Kipf and Welling, 2016; Xu et al., 2018)have been proposed to learn node and graph representations by aggregating information from every node’s neighbors via nonlinear transformation and aggregation functions. However, the key limitation of existing GNN architectures is that they often require a huge amount of labeled data to be competitive but annotating graphs like drugtarget interaction networks is challenging since it needs domainspecific expertise. Therefore, unsupervised learning on graphs has been long studied, such as graph kernels
(Shervashidze et al., 2011) and matrixfactorization approaches (Belkin and Niyogi, 2002).Inspired by the recent success of unsupervised representation learning in various domains like images (Chen et al., 2020a; He et al., 2020) and texts (Radford et al., 2018), most related works in the graph domain either follow the pipeline of unsupervised pretraining (followed by finetuning) or InfoMax principle (Hjelm et al., 2018). The former often needs meticulous designs of pretext tasks (Hu et al., 2019; You et al., 2020) while the latter is dominant in unsupervised graph representation learning, which trains encoders to maximize the mutual information (MI) between the representations of the global graph and local patches (such as subgraphs) (Veličković et al., 2018; Sun et al., 2019; Hassani and Khasahmadi, 2020). However, MIbased approaches usually need to sample subgraphs as local views to contrast with global graphs. And they usually require an additional discriminator for scoring localglobal pairs and negative samples, which is computationally prohibitive (Tschannen et al., 2019)
. Besides, the performance is also very sensitive to the choice of encoders and MI estimators
(Tschannen et al., 2019). Moreover, MIbased approaches cannot be handily extended to the semisupervised setting since local subgraphs lack labels that can be utilized for training. Therefore, we are seeking an approach that learns the entire graph representation by contrasting the whole graph directly without the need of MI estimation, discriminator and subgraph sampling.Motivated by recent progress on contrastive learning, we propose the Iterative Graph SelfDistillation (IGSD), a teacherstudent framework to learn graph representations by contrasting graph instances directly. The highlevel idea of IGSD is based on graph contrastive learning where we pull similar graphs together and push dissimilar graph away. However, the performance of conventional contrastive learning largely depends on how negative samples are selected. To learn discriminative representations and avoid collapsing to trivial solutions, a large set of negative samples (He et al., 2020; Chen et al., 2020a) or a special mining strategy (Schroff et al., 2015) are necessary. In order to alleviate the dependency on negative samples mining and still be able to learn discriminative graph representations, we propose to use selfdistillation as a strong regularization to guide the graph representation learning.
In the IGSD framework, graph instances are augmented as several views to be encoded and projected into a latent space where we define a similarity metric for consistencybased training. The parameters of the teacher network are iteratively updated as an exponential moving average of the student network parameters, allowing the knowledge transfer between them. As merely small amount of labeled data is often available in many realworld applications, we further extend our model to the semisupervised setting such that we can effectively utilize graphlevel labels while considering arbitrary amounts of positive pairs belonging to the same class. Moreover, in order to leverage the information from pseudolabels with high confidence, we develop a selftraining algorithm based on the supervised contrastive loss to finetune the encoder.
We experiment with realworld datasets in various scales and compare the performance of IGSD with stateoftheart graph representation learning methods. Experimental results show that IGSD achieves competitive performance in both unsupervised and semisupervised settings with different encoders and data augmentation choices. With the help of selftraining, our performance can exceed stateoftheart baselines by a large margin.
To summarize, we make the following contributions in this paper:

[leftmargin=*]

We propose a selfdistillation framework called IGSD for unsupervised graphlevel representation learning where the teacherstudent distillation is performed for contrasting graph pairs under different augmented views.

We further extend IGSD to the semisupervised scenario, where the labeled data are utilized effectively with the supervised contrastive loss and selftraining.

We empirically show that IGSD surpasses stateoftheart methods in semisupervised graph classification and molecular property prediction tasks and achieves performance competitive with stateoftheart approaches in unsupervised graph classification tasks.
2 Preliminaries
2.1 Formulation
Unsupervised Graph Representation Learning
Given a set of unlabeled graphs , we aim at learning the lowdimensional representation of every graph favorable for downstream tasks like graph classification.
Semisupervised Graph Representation Learning
Consider a whole dataset composed by labeled data and unlabeled data (usually ), our goal is to learn a model that can make predictions on graph labels for unseen graphs. And with augmentations, we get and as our training data.
2.2 Graph Representation Learning
We represent a graph instance as with the node set and the edge set . The dominant ways of graph representation learning are graph neural networks with neural message passing mechanisms (Hamilton et al., 2017): for every node , node representation is iteratively computed from the features of their neighbor nodes using a differentiable aggregation function. Specifically, at the iteration we get the node embedding as:
(1) 
Then the graphlevel representations can be attained by aggregating all node representations using a readout function like summation or set2set pooling (Vinyals et al., 2015).
2.3 Graph Data Augmentation
It has been shown that the learning performance of GNNs can be improved via graph diffusion, which serves as a homophilybased denoising filter on both features and edges in real graphs (Klicpera et al., 2019). The transformed graphs can also serve as effective augmented views in contrastive learning (Hassani and Khasahmadi, 2020). Inspired by that, we transform a graph with transition matrix via graph diffusion and sparsification into a new graph with adjacency matrix S as an augmented view in our framework. While there are many design choices in coefficients like heat kernel, we employ Personalized PageRank (PPR) with due to its superior empirical performance (Hassani and Khasahmadi, 2020). Besides, we randomly remove edges of graphs to attain corrupted graphs as augmented views to validate the robustness of models to different augmentation choices.
3 Iterative Graph Selfdistillation
Intuitively, the goal of contrastive learning on graphs is to learn graph representations that are close in the metric space for positive pairs (graphs with the same labels) and far between negative pairs (graphs with different labels). To achieve this goal, IGSD employ the teacherstudent distillation to iteratively refine representations by contrasting latent representations embedded by two networks and using additional predictor and EMA update to avoid collapsing to trivial solutions. Overall, IGSD encourages the closeness of augmented views from the same graph instances while pushing apart the representations from different ones.
3.1 Iterative Graph SelfDistillation Framework
In IGSD, we introduce a teacherstudent architecture comprises two networks in similar structure composed by encoder , projector and predictor . We denote the components of the teacher network and the student network as , and , , respectively.
The overview of IGSD is illustrated in Figure 1. In IGSD we first augment the original input graphs to get augmented view(s) and then feed them respectively into two encoders for extracting graph representations. The following projectors transform graph representations to a higher dimensional projections . To prevent collapsing into a trivial solution (Grill et al., 2020), a specialized predictor is used in the student network for attaining the prediction of the projection .
To contrast latents and , we use norm in the latent space to approximate the semantic distance in the input space and the consistency loss can be defined as the mean square error between the normalized prediction and projection . By passing two instances symmetrically, we can obtain the overall consistency loss:
(2) 
With the consistency loss, the teacher network provides a regression target to train the student network, and its parameters are updated as an exponential moving average (EMA) of the student parameters after weights of the student model have been updated using gradient descent:
(3) 
With the above iterative selfdistillation procedure, we can aggregate information for averaging model weights over each training step instead of using the final weights directly.
3.2 Unsupervised Learning with IGSD
In IGSD, to contrast the anchor with other graph instances (i.e. negative samples), we employ the following unsupervised InfoNCE objective (Oord et al., 2018):
(4) 
At the inference time, as semantic interpolations on samples, labels and latents are effective in obtaining better representations and can improve learning performance greatly
(Zhang et al., 2017; Verma et al., 2019; Berthelot et al., 2019), we obtain the graph representation by interpolating the latent representations and with Mixup function :(5) 
3.3 Semisupervised Learning with IGSD
To bridge the gap between unsupervised pretraining and downstream tasks, we extend our model to the semisupervised setting. In this scenario, it is straightforward to plug in the unsupervised loss as a regularizer for representation learning. However, the instancewise supervision limited to standard supervised learning may lead to biased negative sampling problems
(Chuang et al., 2020). To tackle this challenge, we can use a small amount of labeled data further to generalize the similarity loss to handle arbitrary numbers of positive samples belonging to the same class:(6) 
where denotes the total number of samples in the training set that have the same label as anchor . Thanks to the graphlevel contrastive nature of IGSD, we are able to alleviate the biased negative sampling problems (Khosla et al., 2020) with supervised contrastive loss, which is crucial (Chuang et al., 2020) but unachievable in most MIbased contrastive learning models since subgraphs are generally hard to assign labels to. Besides, with this loss we are able to finetune our model effectively using selftraining where pseudolabels are assigned iteratively to unlabeled data.
With the standard supervised loss like cross entropy or mean square error , the overall objective can be summarized as:
(7) 
Common semisupervised learning methods use consistency regularization to measure discrepancy between predictions made on perturbed unlabeled data points to get better prediction stability and generalization performance
(Oliver et al., 2018). By contrast, our methods enforce consistency constraints between latents from different views, which acts as a regularizer for learning directly from labels.Labeled data provides additional supervision about graph classes and alleviates biased negative sampling. However, they are costly to attain in many areas. Therefore, we develop a contrastive selftraining algorithm to leverage label information more effectively than cross entropy in the semisupervised scenario. In the algorithm, we train the model using a small amount of labeled data and then finetune it by iterating between assigning pseudolabels to unlabeled examples and training models using the augmented dataset. In this way, we can harvest massive pseudolabels for unlabeled examples.
With increasing size of augmented labeled dataset, the discriminative power of IGSD can be improved iteratively by contrasting more positive pairs belonging to the same class. In this way, we accumulate highquality psuedolabels after each iteration to compute the supervised contrastive loss in Eq. 6 and make distinction from conventional selftraining algorithms (Rosenberg et al., 2005). On the other hand, traditional selftraining can only use psuedolabels for computing cross entropy only.
4 Experiments
4.1 Experimental Setup
Evaluation Tasks. We conduct experiments by comparing with stateoftheart models on three tasks. In graph classification tasks, we experiment in both the unsupervised setting where we only have access to all unlabeled samples in the dataset and the semisupervised setting where we use a small fraction of labeled examples and treat the rest as unlabeled ones by ignoring their labels. In molecular property prediction tasks where labels are expensive to obtain, we only consider the semisupervised setting.
Datasets. For graph classification tasks, we employ several widely used benchmark graph kernel datasets (Kersting et al., 2016)
for learning and evaluation: 3 bioinformatics datasets (MUTAG, PTC_MR, NCI1) and 3 social network datasets (COLLAB, IMDBBINARY, IMDBMULTI) with statistics summarized in Table
1.In the semisupervised graph regression tasks, we use the QM9 dataset containing 134,000 druglike organic molecules (Ramakrishnan et al., 2014) and select the first ten physicochemical properties as regression targets for training and evaluation.
Baselines.
In the unsupervised graph classification, we compare with the following representative baselines: CMCGraph (Hassani and Khasahmadi, 2020), InfoGraph (Sun et al., 2019), Graph2Vec (Narayanan et al., 2017) and Graph Kernels including Random Walk Kernel (Gärtner et al., 2003), Shortest Path Kernel (Kashima et al., 2003), Graphlet Kernel (Shervashidze et al., 2009), WeisfeilerLehman Subtree Kernel (WL SubTree) (Shervashidze et al., 2011), Deep Graph Kernels (Yanardag and Vishwanathan, 2015) and MultiScale Laplacian Kernel (MLG) (Kondor and Pan, 2016).
For the semisupervised graph classification, we compare our method with competitive baselines like InfoGraph, InfoGraph* and Mean Teachers. And the GIN baseline doesn’t have access to the unlabeled data. In the semisupervised molecular property prediction tasks, baselines include InfoGraph, InfoGraph* and Mean Teachers (Tarvainen and Valpola, 2017).
Model Configuration. In our framework, We use GCNs (Kipf and Welling, 2016) and GINs (Xu et al., 2018) as encoders to attain node representations for the unsupervised and semisupervised graph classification respectively. For semisupervised molecular property prediction tasks, we employ message passing neural networks (MPNNs) (Gilmer et al., 2017) as our backbone encoders to encode molecular graphs with rich edge attributes. All projectors and predictors are implemented as twolayer MLPs.
In semisupervised molecular property prediction tasks, we generate multiple views based on edge attributes (bond types) of richannotated molecular graphs for improving performance. Specifically, we perform labelpreserving augmentation to attain multiple diffusion matrixes of every graph on different edge attributes while ignoring others respectively. The diffusion matrix gives a denser graph based on each type of edges to leverage edge features better. We train our models using different numbers of augmented training data and select the amount using cross validation.
For parameter tuning, we select number of GCN layers over {2, 8, 12}, batch size over {16, 32, 64, 128, 256, 512}, number of epochs over {20, 40, 100} and learning rate over {1e4, 1e3} in unsupervised graph classification. The hyperparameters we tune for semisupervised graph classification and molecular property prediction are the same in
(Xu et al., 2018) and (Sun et al., 2019), respectively. In all experiments, we set the weighting coefficient of Mixup function to be 0.5 and tune our projection hidden size over {1024, 2048} and projection size over {256, 512} since bigger projection heads are able to improve representation learning (Chen et al., 2020b). We start selftraining after 30 epochs and tune the number of iterations over {20, 50}, pseudolabeling threshold over {0.9, 0.95}.For unsupervised graph classification, we adopt LIBSVM (Chang and Lin, 2011) with C
parameter selected in {1e3, 1e2, …, 1e2, 1e3} as our downstream classifier. Then we use 10fold cross validation accuracy as the classification performance and repeat the experiments 5 times to report the mean and standard deviation. For semisupervised graph classification, we randomly select
of training data as labeled data and the rest is treated as unlabeled one and report the best test set accuracy in 300 epochs. Following the experimental setup in (Sun et al., 2019), we randomly choose 5000, 10000, 10000 samples for training, validation and testing respectively and the rest are treated as unlabeled training data for the molecular property prediction tasks.Datasets  MUTAG  IMDBB  IMDBM  NCI1  COLLAB  PTC  

Datasets  # graphs  188  1000  1500  4110  5000  344 
# classes  2  2  3  2  3  2  
Avg # nodes  17.9  19.8  13.0  29.8  74.5  25.5  
Graph Kernels  Random Walk  83.7 1.5  50.7 0.3  34.7 0.2  OMR  OMR  57.9 1.3 
Shortest Path  85.2 2.4  55.6 0.2  38.0 0.3  51.3 0.6  49.8 1.2  58.2 2.4  
Graphlet Kernel  81.7 2.1  65.9 1.0  43.9 0.4  53.9 0.4  56.3 0.6  57.3 1.4  
WL subtree  80.7 3.0  72.3 3.4  47.0 0.5  55.1 1.6  50.2 0.9  58.0 0.5  
Deep Graph  87.4 2.7  67.0 0.6  44.6 0.5  54.5 1.2  52.1 1.0  60.1 2.6  
MLG  87.9 1.6  66.6 0.3  41.2 0.0  >1 Day  >1 Day  63.3 1.5  
Unsupervised  Graph2Vec  83.2 9.6  71.1 0.5  50.4 0.9  73.2 1.8  47.9 0.3  60.2 6.9 
InfoGraph  89.0 1.1  74.2 0.7  49.7 0.5  73.8 0.7  67.6 1.2  61.7 1.7  
CMCGraph  89.71.1  74.20.7  51.20.5  75.0 0.7  68.9 1.9  62.51.7  
Ours (Random)  85.72.1  71.61.2  49.20.6  75.1 0.4  65.8 1.0  57.61.5  
Ours  90.20.7  74.70.6  51.50.3  75.4 0.3  70.4 1.1  61.41.7  
Datasets  IMDBB  IMDBM  COLLAB  NCI1 

Mean Teachers  69.0  49.3  72.5  71.1 
InfoGraph*  71.0  49.3  67.6  71.1 
GIN (Supervised Only)  67.0  50.0  71.4  67.9 
Ours (Unsup)  72.0  50.0  72.6  70.6 
Ours (SupCon)  75.0  52.0  73.4  67.9 
GIN (Supervised Only)+selftraining  72.0  51.3  70.4  74.0 
Ours (Unsup)+selftraining  73.0  54,0  71  72.5 
Ours (SupCon)+selftraining  77.0  55.3  73.6  77.1 
4.2 Numerical Results
Results on unsupervised graph classification. We first present the results of the unsupervised setting in Table 1. All graph kernels give inferior performance except in the PTC dataset. The Random Walk kernel runs out of memory and the MultiScale Laplacian Kernel suffers from a long running time (exceeds 24 hours) in two larger datasets. IGSD outperforms stateoftheart baselines like InfoGraph and CMCGraph, showing that IGSD can learn expressive graphlevel representations for downstream classifiers. Besides, our model still achieve competitive results in datasets like IMDBM and NCI1 with random dropping augmentation, which demonstrates the robustness of IGSD with different choices of data augmentation strategies.
Results on semisupervised graph classification. We further apply our model to semisupervised graph classification tasks with results demonstrated in Table 2, where we set and in Eq. 7 to be 1 and 0 as Ours (Unsup) while 0 and 1 as Ours (SupCon). In this setting, our model performs better than Mean Teachers and InfoGraph*. Both the unsupervised loss and supervised contrastive loss provide extra performance gain compared with GIN using supervised data only. Besides, both of their performance can be improved significantly combined using selftraining especially with supervised contrastive loss. It makes empirical sense since selftraining iteratively assigns psuedolabels with high confidence to unlabeled data, which provides extra supervision on their categories under contrastive learning framework.
Results on semisupervised molecular property prediction. We present the regression performance of our model measured in the QM9 dataset in Figure 2. We display the performance of our model and baselines as mean square error ratio with respect to supervised results and our model outperforms all baselines in 9 out of 10 tasks compared with strong baselines InfoGraph, InfoGraph* and Mean Teachers. And in some tasks like R2 (5), U0 (7) and U (8), DGSI achieves significant performance gains against its counterparts, which demonstrates the ability to transfer knowledge learned from unsupervised data for supervised tasks.
4.3 Ablation Studies and Analysis
Performance with selftraining. We first investigate the effects of selftraining for our model performance in table 2. Results show that selftraining can improve the GIN baseline and our models with unsupervised loss (Unsup) or supervised contrastive loss (SupCon). The improvement is even more significant combined with supervised contrastive loss since highquality pseudolabels provide additional information of graph categories. Moreover, our selftraining algorithm consistently outperforms the traditional selftraining baseline, which further validates the superiority of our model.
Performance with different amount of negative pairs. We then conduct ablation experiments on the amount of negative pairs by varying batch size over {16, 32, 64, 128} with results on IMDBBINARY dataset shown in Figure 2(a). Both methods contrast negative pairs batchwise and increasing batch size improves the performance of IGSD while degrades CMCGraph. When batch size is greater than 32, IGSD outperforms CMCGraph and the performance gap becomes larger as the batch size increases.
Performance with different proportion of labeled data. We also investigate the performance of different models with different proportion of labeled data with IMDBBINARY dataset. As illustrated in Figure 2(b), IGSD outperforms strong InfoGraph* baseline given different amount of labeled data consistently. And the performance gain is most significant when the fraction of labeled data is since our models can leverage labels more effectively by regularizing original unsupervised learning objective when labels are scarce.
5 Related Work
Contrastive Learning Modern unsupervised learning in the form of contrastive learning can be categorized into two types: contextinstance contrast and contextcontext contrast (Liu et al., 2020). The contextinstance contrast, or socalled globallocal contrast focuses on modeling the belonging relationship between the local feature of a sample and its global context representation. Most unsupervised learning models on graphs like DGI (Veličković et al., 2018), InfoGraph (Sun et al., 2019), CMCGraph (Hassani and Khasahmadi, 2020) fall into this category, following the InfoMax principle to maximize the the mutual information (MI) between the input and its representation. However, estimating MI is notoriously hard in MIbased contrastive learning and in practice tractable lower bound on this quantity is maximized instead. And maximizing tighter bounds on MI can result in worse representations without stronger inductive biases in sampling strategies, encoder architecture and parametrization of MI estimators (Tschannen et al., 2019). Besides, the intricacies of negative sampling in MIbased approaches impose key research challenges like improper amount of negative samples or biased negative sampling (Tschannen et al., 2019; Chuang et al., 2020). Another line of contrastive learning approaches called contextcontext contrast directly study the relationships between the global representations of different samples as what metric learning does. For instance, a recently proposed model BYOL (Grill et al., 2020) bootstraps the representations of the whole images directly. Focusing on global representations between samples and corresponding augmented views also allows instancelevel supervision to be incorporated naturally like introducing supervised contrastive loss (Khosla et al., 2020) into the framework for learning powerful representations. Graph Contrastive Coding (GCC) (Qiu et al., 2020) is a pioneer to leverage instance discrimination as the pretext task for structural information pretraining. However, our work is fundamentally different from theirs. GCC focuses on structural similarity with InfoNCE as learning objective to find common and transferable structural patterns across different graph datasets and the contrastive scheme is done through subgraph instance discrimination. On the contrary, our model aims at learning graphlevel representation by directly contrasting graph instances such that data augmentation strategies and graph labels can be utilized naturally and effectively.
Knowledge Distillation Knowledge distillation (Hinton et al., 2015) is a method for transferring knowledge from one architecture to another, allowing model compression and inductive biases transfer. Selfdistillation (Furlanello et al., 2018) is a special case when two architectures are identical, which can iteratively modify regularization and reduce overfitting if perform suitable rounds (Mobahi et al., 2020). However, they often focus on closing the gap between the predictive results of student and teacher rather than defining similarity loss in latent space for contrastive learning.
Semisupervised Learning Modern semisupervised learning can be categorized into two kinds: multitask learning and consistency training between two separate networks. Most widely used semisupervised learning methods take the form of multitask learning: on labeled data and unlabeled data . By regularizing the learning process with unlabeled data, the decision boundary becomes more plausible. Another mainstream of semisupervised learning lies in introducing student network and teacher network and enforcing consistency between them (Tarvainen and Valpola, 2017; Miyato et al., 2019; Lee, 2013). It has been shown that semisupervised learning performance can be greatly improved via unsupervised pretraining of a (big) model, supervised finetuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the taskspecific knowledge (Chen et al., 2020b). However, whether taskagnostic selfdistillation would benefit semisupervised learning is still underexplored.
6 Conclusions
In this paper, we propose IGSD, a novel unsupervised graphlevel representation learning framework via selfdistillation. Our framework iteratively performs teachstudent distillation by contrasting augmented views of graph instances. Experimental results in both unsupervised and semisupervised settings show that IGSD is not only able to learn effective graph representations competitive with stateoftheart models but also robust with choices of encoders and augmentation strategies. In the future, we plan to apply our framework to other graph learning tasks and investigate the design of view generators to generative effective views automatically.
References
 Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pp. 585–591. Cited by: §1.
 Mixmatch: a holistic approach to semisupervised learning. In Advances in Neural Information Processing Systems, pp. 5049–5059. Cited by: §3.2.

LIBSVM: a library for support vector machines
. ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: §4.1.  A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: §1, §1.
 Big selfsupervised models are strong semisupervised learners. arXiv preprint arXiv:2006.10029. Cited by: §4.1, §5.
 Debiased contrastive learning. External Links: 2007.00224 Cited by: §3.3, §5.
 Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1.
 Born again neural networks. External Links: 1805.04770 Cited by: §5.
 On graph kernels: hardness results and efficient alternatives. In Learning theory and kernel machines, pp. 129–143. Cited by: §4.1.
 Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212. Cited by: §1, §4.1.
 Bootstrap your own latent: a new approach to selfsupervised learning. External Links: 2006.07733 Cited by: §3.1, §5.
 Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §2.2.
 Contrastive multiview representation learning on graphs. arXiv preprint arXiv:2006.05582. Cited by: §A.1, §1, §2.3, §4.1, §5.

Momentum contrast for unsupervised visual representation learning.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
. External Links: ISBN 9781728171685, Link, Document Cited by: §1, §1.  Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.
 Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §1.
 Strategies for pretraining graph neural networks. External Links: 1905.12265 Cited by: §1.

Marginalized kernels between labeled graphs.
In
Proceedings of the 20th international conference on machine learning (ICML03)
, pp. 321–328. Cited by: §4.1.  Benchmark data sets for graph kernels. External Links: Link Cited by: §4.1.
 Supervised contrastive learning. External Links: 2004.11362 Cited by: §3.3, §5.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §A.1, §1, §4.1.
 Diffusion improves graph learning. External Links: 1911.05485 Cited by: §A.1, §2.3.
 The multiscale laplacian graph kernel. In Advances in Neural Information Processing Systems, pp. 2990–2998. Cited by: §4.1.
 Pseudolabel: the simple and efficient semisupervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §5.
 Selfsupervised learning: generative or contrastive. External Links: 2006.08218 Cited by: §5.
 Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 1979–1993. External Links: ISSN 19393539, Link, Document Cited by: §5.
 Selfdistillation amplifies regularization in hilbert space. arXiv preprint arXiv:2002.05715. Cited by: §5.

Graph2vec: learning distributed representations of graphs
. arXiv preprint arXiv:1707.05005. Cited by: §4.1.  Finding and evaluating community structure in networks. Physical review E 69 (2), pp. 026113. Cited by: §1.
 Realistic evaluation of deep semisupervised learning algorithms. External Links: 1804.09170 Cited by: §3.3.
 Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.2.
 Gcc: graph contrastive coding for graph neural network pretraining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1150–1160. Cited by: §5.
 Improving language understanding by generative pretraining. Cited by: §1.
 Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 1 (1), pp. 1–7. Cited by: §A.1, §4.1.
 Semisupervised selftraining of object detection models. Cited by: §3.3.

Facenet: a unified embedding for face recognition and clustering
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1.  Weisfeilerlehman graph kernels.. Journal of Machine Learning Research 12 (9). Cited by: §A.1, §1, §4.1.
 Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: §4.1.
 Infograph: unsupervised and semisupervised graphlevel representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §1, §4.1, §4.1, §4.1, §5.

Mean teachers are better role models: weightaveraged consistency targets improve semisupervised deep learning results
. External Links: 1703.01780 Cited by: §4.1, §5.  What makes for good views for contrastive learning?. arXiv preprint arXiv:2005.10243. Cited by: §A.1.
 On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625. Cited by: §1, §5.
 Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §A.1, §1, §5.
 Manifold mixup: better representations by interpolating hidden states. In International Conference on Machine Learning, pp. 6438–6447. Cited by: §3.2.
 Order matters: sequence to sequence for sets. External Links: 1511.06391 Cited by: §2.2.
 How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §A.1, §1, §4.1, §4.1.
 Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. Cited by: §4.1.
 When does selfsupervision help graph convolutional networks?. arXiv preprint arXiv:2006.09136. Cited by: §1.
 Mixup: beyond empirical risk minimization. External Links: 1710.09412 Cited by: §3.2.
 Data augmentation for graph neural networks. External Links: 2006.06830 Cited by: §A.1.
 Artificial intelligence in covid19 drug repurposing. The Lancet Digital Health. Cited by: §1.
Appendix A Appendix
a.1 Related Work
Graph Representation Learning
Traditionally, graph kernels are widely used for learning node and graph representations. This common process includes meticulous designs like decomposing graphs into substructures and using kernel functions like WeisfeilerLeman graph kernel (Shervashidze et al., 2011) to measure graph similarity between them. However, they usually require nontrivial handcrafted substructures and domainspecific kernel functions to measure the similarity while yields inferior performance on downstream tasks like node classification and graph classification. Recently, there has been increasing interest in Graph Neural Network (GNN) approaches for graph representation learning and many GNN variants have been proposed (Ramakrishnan et al., 2014; Kipf and Welling, 2016; Xu et al., 2018). However, they mainly focus on supervised settings.
Data augmentation
Data augmentation strategies on graphs are limited since defining views of graphs is a nontrivial task. There are two common choices of augmentations on graphs (1) featurespace augmentation and (2) structurespace augmentation. A straightforward way is to corrupt the adjacency matrix which preserves the features but adds or removes edges from the adjacency matrix with some probability distribution
(Veličković et al., 2018). Zhao et al. (2020) improves performance in GNNbased semisupervised node classification via edge prediction. Empirical results show that diffusion matrix can serve as a denoising filter to augment graph data for improving graph representation learning significantly both in supervised (Klicpera et al., 2019) and unsupervised settings (Hassani and Khasahmadi, 2020). Hassani and Khasahmadi (2020) shows the benefits of treating diffusion matrix as an augmented view of mutual informationbased contrastive graph representation learning. Attaining effective views is nontrivial since we need to consider factors like mutual information to preserve label information w.r.t the downstream task (Tian et al., 2020).