When Does Self-Supervision Help Graph Convolutional Networks?

06/16/2020 ∙ by Yuning You, et al. ∙ 6

Self-supervision as an emerging technique has been employed to train convolutional neural networks (CNNs) for more transferrable, generalizable, and robust representation learning of images. Its introduction to graph convolutional networks (GCNs) operating on graph data is however rarely explored. In this study, we report the first systematic exploration and assessment of incorporating self-supervision into GCNs. We first elaborate three mechanisms to incorporate self-supervision into GCNs, analyze the limitations of pretraining finetuning and self-training, and proceed to focus on multi-task learning. Moreover, we propose to investigate three novel self-supervised learning tasks for GCNs with theoretical rationales and numerical comparisons. Lastly, we further integrate multi-task self-supervision into graph adversarial training. Our results show that, with properly designed task forms and incorporation mechanisms, self-supervision benefits GCNs in gaining more generalizability and robustness. Our codes are available at <https://github.com/Shen-Lab/SS-GCNs>.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


[ICML 2020] "When Does Self-Supervision Help Graph Convolutional Networks?" by Yuning You, Tianlong Chen, Zhangyang Wang, Yang Shen

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph convolutional networks (GCNs) (Kipf and Welling, 2016) generalize convolutional neural networks (CNNs) (LeCun et al., 1995) to graph-structured data and exploit the properties of graphs. They have outperformed traditional approaches in numerous graph-based tasks such as node or link classification (Kipf and Welling, 2016; Veličković et al., 2017; Qu et al., 2019; Verma et al., 2019; Karimi et al., 2019; You et al., 2020), link prediction (Zhang and Chen, 2018), and graph classification (Ying et al., 2018; Xu et al., 2018), many of which are semi-supervised learning tasks. In this paper, we mainly focus our discussion on transductive semi-supervised node classification, as a representative testbed for GCNs, where there are abundant unlabeled nodes and a small number of labeled nodes in the graph, with the target to predict the labels of remaining unlabeled nodes.

In a parallel note, self-supervision has raised a surge of interest in the computer vision domain

(Goyal et al., 2019; Kolesnikov et al., 2019; Mohseni et al., 2020) to make use of rich unlabeled data. It aims to assist the model to learn more transferable and generalized representation from unlabeled data via pretext tasks, through pretraining (followed by finetuning), or multi-task learning. The pretext tasks shall be carefully designed in order to facilitate the network to learn downstream-related semantics features (Su et al., 2019). A number of pretext tasks have been proposed for CNNs, including rotation (Gidaris et al., 2018), exemplar (Dosovitskiy et al., 2014), jigsaw (Noroozi and Favaro, 2016) and relative patch location prediction (Doersch et al., 2015). Lately, Hendrycks et al. (2019)

demonstrated the promise of self-supervised learning as auxiliary regularizations for improving robustness and uncertainty estimation.

Chen et al. (2020) introduced adversarial training into self-supervision, to provide the first general-purpose robust pretraining.

In short, GCN tasks usually admit transductive semi-supervised settings, with tremendous unlabeled nodes; meanwhile, self-supervision plays an increasing role in utilizing unlabeled data in CNNs. In view of the two facts, we are naturally motivated to ask the following interesting, yet rarely explored question:

  • Can self-supervised learning play a similar role in GCNs to improve their generalizability and robustness?

Contributions. This paper presents the first systematic study on how to incorporate self-supervision in GCNs, unfolded by addressing three concrete questions:

  • Could GCNs benefit from self-supervised learning in their classification performance? If yes, how to incorporate it in GCNs to maximize the gain?

  • Does the design of pretext tasks matter? What are the useful self-supervised pretext tasks for GCNs?

  • Would self-supervision also affect the adversarial robustness of GCNs? If yes, how to design pretext tasks?

Directly addressing the above questions, our contributions are summarized as follows:

  • We demonstrate the effectiveness of incorporating self-supervised learning in GCNs through multi-task learning, i.e. as a regularization term in GCN training. It is compared favorably against self-supervision as pretraining, or via self-training (Sun et al., 2019).

  • We investigate three self-supervised tasks based on graph properties. Besides the node clustering task previously mentioned in (Sun et al., 2019), we propose two new types of tasks: graph partitioning and completion. We further illustrate that different models and datasets seem to prefer different self-supervised tasks.

  • We further generalize the above findings into the adversarial training setting. We provide extensive results to show that self-supervision also improves robustness of GCN under various attacks, without requiring larger models nor additional data.

2 Related Work

Graph-based semi-supervised learning. Semi-supervised graph-based learning works with the crucial assumption that the nodes connected with edges of larger weights are more likely to have the same label (Zhu and Goldberg, 2009). There are abundance of work on graph-based methods, e.g. (randomized) mincuts (Blum and Chawla, 2001; Blum et al., 2004)

, Boltzmann machines

(Getz et al., 2006; Zhu and Ghahramani, 2002) and graph random walks (Azran, 2007; Szummer and Jaakkola, 2002). Lately, graph convolutional network (GCN) (Kipf and Welling, 2016) and its variants (Veličković et al., 2017; Qu et al., 2019; Verma et al., 2019) have gained their popularity by extending the assumption from a hand-crafted one to a data-driven fashion. A detailed review could be referred to (Wu et al., 2019b).

Self-supervised learning. Self-supervision is a promising direction for neural networks to learn more transferable, generalized and robust features in computer vision domain (Goyal et al., 2019; Kolesnikov et al., 2019; Hendrycks et al., 2019). So far, the usage of self-supervision in CNNs mainly falls under two categories: pretraining & finetuning, or multi-task learning. In pretraining & finetuning. the CNN is first pretrained with self-supervised pretext tasks, and then finetuned with the target task supervised by labels (Trinh et al., 2019; Noroozi and Favaro, 2016; Gidaris et al., 2018), while in multi-task learning the network is trained simultaneously with a joint objective of the target supervised task and the self-supervised task(s). (Doersch and Zisserman, 2017; Ren and Jae Lee, 2018).

To our best knowledge, there has been only one recent work pursuing self-supervision in GCNs (Sun et al., 2019), where a node clustering task is adopted through self-training. However, self-training suffers from limitations including performance “saturation” and degrading (to be detailed in Sections 3.2 and 4.1 for theoretical rationales and empirical results). It also restricts the types of self-supervision tasks that can be incorporated.

Adversarial attack and defense on graphs. Similarly to CNNs, the wide applicability and vulnerability of GCNs raise an urgent demand for improving their robustness. Several algorithms are proposed to attack and defense on graph (Dai et al., 2018; Zügner et al., 2018; Wang et al., 2019a; Wu et al., 2019a; Wang et al., 2019b).

Dai et al. (2018)

developed attacking methods by dropping edges, based on gradient descent, genetic algorithms and reinforcement learning.

Zügner et al. (2018) proposed an FSGM-based approach to attack the edges and features. Lately, more diverse defense approaches emerge. Dai et al. (2018) defended the adversarial attacks by directly training on perturbed graphs. Wu et al. (2019a) gained robustness by learning graphs from the continuous function. Wang et al. (2019a) used graph refining and adversarial contrasting learning to boost the model robustness. Wang et al. (2019b) proposed to involve unlabeled data with pseudo labels that enhances scalability to large graphs.

3 Method

In this section, we first elaborate three candidate schemes to incorporate self-supervision with GCNs. We then design novel self-supervised tasks, each with its own rationale explained. Lastly we generalize self-supervised to GCN adversarial defense.

3.1 Graph Convolutional Networks

Given an undirected graph , where represents the node set with nodes, stands for the edge set with edges, and indicates an edge between nodes and . Denoting as the feature matrix where is the

-dimensional attribute vector of the node

, and as the adjacency matrix where and , the GCN model of semi-supervised classification with two layers (Kipf and Welling, 2016) is formulated as:


where , and is the degree matrix of . Here we do not apply softmax function to the output but treat it as a part of the loss described below.

We can treat in (1) as the feature extractor of GCNs in general. The parameter set in (1) but could include additional parameters for corresponding network architectures in GCN variants (Veličković et al., 2017; Qu et al., 2019; Verma et al., 2019)

. Thus GCN is decomposed into feature extraction and linear transformation as

where parameters and are learned from data. Considering the transductive semi-supervised task, we are provided the labeled node set with and the label matrix with label dimension (for a classification task ). Therefore, the model parameters in GCNs are learned by minimizing the supervised loss calculated between the output and the true label for labeled nodes, which can be formulated as:



is the loss function for each example,

is the annotated label vector, and is the true label vector for .

Figure 1: The overall framework for self-supervision on GCN through multi-task learning. The target task and auxiliary self-supervised tasks share the same feature extractor with their individual linear transformation parameters .

3.2 Three Schemes: Self-Supervision Meets GCNs

Inspired by relevant discussions in CNNs (Goyal et al., 2019; Kolesnikov et al., 2019; Hendrycks et al., 2019), we next investigate three possible schemes to equip a GCN with a self-supervised task (“ss”), given the input the label and the node set .

Pretraining & finetuning. In the pretraining process, the network is trained with the self-supervised task as following:


where is the linear transformation parameter, and is the loss function of the self-supervised task, . Then in the finetuing process the feature extractor is trained in formulation (2) using to initialize parameters .

Pipeline GCN P&F MTL
Accuracy 79.10 0.21 79.19 0.21 80.00 0.74
Table 1: Comparing performances of GCN through pretraining & finetuning (P&F) and multi-task learning (MTL) with graph partitioning (see Section 3.3) on the PubMed dataset. Reported numbers correspond to classification accuracy in percent.

Pretraining & finetuning is arguably the most straightforward option for self-supervision benefiting GCNs. However, our preliminary experiment found little performance gain from it on a large dataset Pubmed (Table 1). We conjecture that it is due to (1) “switching” to a different objective function in finetuning from that in pretraining ; and (2) training a shallow GCN in the transductive semi-supervised setting, which was shown to beat deeper GCNs causing over-smoothing or “information loss” (Li et al., 2018; Oono and Suzuki, ). We will systematically assess and analyze this scheme over multiple datasets and combined with other self-supervision tasks in Section 4.1.

Self-training. (Sun et al., 2019) is the only prior work that pursues self-supervision in GCNs and it does so through self-training. With both labeled and unlabeled data, a typical self-training pipeline starts by pretraining a model over the labeled data, then assigning “pseudo-labels” to highly confident unlabeled samples, and including them into the labeled data for the next round of training. The process could be repeated several rounds and can be formulated in each round similar to formulation (2) with updated. The authors of (Sun et al., 2019) proposed a multi-stage self-supervised (M3S) training algorithm, where self-supervision was injected to align and refine the pseudo labels for the unlabeled nodes.

Label Rate 0.03% 0.1% 0.3% (Conventional dataset split)
GCN 51.1 67.5 79.10 0.21
M3S 59.2 70.6 79.28 0.30
Table 2: Experiments for GCN through M3S. Gray numbers are from (Sun et al., 2019).

Despite improving performance in previous few-shot experiments, M3S shows performance gain “saturation” in Table 2 as the label rate grows higher, echoing literature (Zhu and Goldberg, 2009; Li et al., 2018). Further, we will show and rationalize their limited performance boost in Section 4.1.

Multi-task learning. Considering a target task and a self-supervised task for a GCN with (2), the output and the training process can be formulated as:


where are the weights for the overall supervised loss as defined in (2) and those for the self-supervised loss as defined in (3), respectively. To optimize the weighted sum of their losses, the target supervised and self-supervised tasks share the same feature extractor but have their individual linear transformation parameters and as in Figure 1.

In the problem (4), we regard the self-supervised task as a regularization term throughout the network training. The regularization term is traditionally and widely used in graph signal processing, and a famous one is graph Laplacian regularizer (GLR) (Shuman et al., 2013; Bertrand and Moonen, 2013; Milanfar, 2012; Sandryhaila and Moura, 2014; Wu et al., 2016) which penalizes incoherent (i.e. nonsmooth) signals across adjacent nodes (Chen and Liu, 2017). Although the effectiveness of GLR has been shown in graph signal processing, the regularizer is manually set simply following the smoothness prior without the involvement of data, whereas the self-supervised task acts as the regularizer learned from unlabeled data under the minor guidance of human prior. Therefore, a properly designed task would introduce data-driven prior knowledge that improves the model generalizability, as show in Table 1.

In total, multi-task learning is the most general framework among the three. Acting as the data-driven regularizer during training, it makes no assumption on the self-supervised task type. It is also experimentally verified to be the most effective among all the three (Section 4).

3.3 GCN-Specific Self-Supervised Tasks

While Section 3.2 discusses the “mechanisms” by which GCNs could be trained with self-supervision, here we expand a “toolkit” of self-supervised tasks for GCNs. We show that, by utilizing the rich node and edge information in a graph, a variety of GCN-specific self-supervised tasks (as summarized in Table 3) could be defined and will be further shown to benefit various types of supervised/downstream tasks. They will assign different pseudo-labels to unlabeled nodes and solve formulation in (4).

Task Relied Feature Primary Assumption Type
Clustering Nodes Feature Similarity Classification
Partitioning Edges Connection Density Classification
Completion Nodes & Edges Context based Representation Regression
Table 3: Overview of three self-supervised tasks.

Node clustering. Following M3S (Sun et al., 2019), one intuitive way to construct a self-supervised task is via the node clustering algorithm. Given the node set with the feature matrix as input, with a preset number of clusters

(treated as a hyperparameter in our experiments), the clustering algorithm will output a set of node sets

such that:

With the clusters of node sets, we assign cluster indices as self-supervised labels to all the nodes:

Graph partitioning. Clustering-related algorithms are node feature-based, with the rationale of grouping nodes with similar attributes. Another rationale to group nodes can be based on topology in graph data. In particular two nodes connected by a “strong” edge (with a large weight) are highly likely of the same label class (Zhu and Goldberg, 2009). Therefore, we propose a topology-based self-supervision using graph partitioning.

Graph partitioning is to partition the nodes of a graph into roughly equal subsets, such that the number of edges connecting nodes across subsets is minimized (Karypis and Kumar, 1995). Given the node set , the edge set and the adjacency matrix as the input, with a preset number of partitions (a hyperparameter in our experiments), a graph partitioning algorithm will output a set of node sets such that:

which is similar to the case of node clustering. In addition, balance constraints are enforced for graph partitioning ( ) and the objective of graph partitioning is to minimize the edgecut ().

With the node set partitioned along with the rest of the graph, we assign partition indices as self-supervised labels:

Different from node clustering based on node features, graph partitioning provides the prior regularization based on graph topology, which is similar to graph Laplacian regularizer (GLR) (Shuman et al., 2013; Bertrand and Moonen, 2013; Milanfar, 2012; Sandryhaila and Moura, 2014; Wu et al., 2016) that also adopts the idea of “connection-prompting similarity”. However, GLR, which is already injected into the GCNs architecture, locally smooths all nodes with their neighbor nodes. In contrast, graph partitioning considers global smoothness by utilizing all connections to group nodes with heavier connection densities.

Graph completion.

Motivated by image inpainting a.k.a. completion

(Yu et al., 2018) in computer vision (which aims to fill missing pixels of an image), we propose graph completion, a novel regression task, as a self-supervised task. As an analogy to image completion and illustrated in Figure 2, our graph completion first masks target nodes by removing their features. It then aims at recovering/predicting masked node features by feeding to GCNs unmasked node features (currently restricted to second-order neighbors of each target node for 2-layer GCNs).

Figure 2: Graph completion for a target node. With the target-node feature masked and neighbors’ features and connections provided, GCNs will recover the masking feature based on the neighborhood information.

We design such a self-supervised task for the following reasons: 1) the completion labels are free to obtain, which is the node feature itself; and 2) we consider graph completion can aid the network for better feature representation, which teaches the network to extract feature from the context.

3.4 Self-Supervision in Graph Adversarial Defense

With the three self-supervised tasks introduced for GCNs to gain generalizability toward better-performing supervised learning (for instance, node classification), we proceed to examine their possible roles in gaining robustness against various graph adversarial attacks.

Adversarial attacks. We focus on single-node direct evasion attacks: a node-specific attack type on the attributes/links of the target node under certain constraints following (Zügner et al., 2018), whereas the trained model (i.e. the model parameters ) remains unchanged during/after the attack. The attacker generates perturbed feature and adjacency matrices, and , as:


with (attribute, links and label of) the target node and the model parameters as inputs. The attack can be on links, (node) features or links & features.

Adversarial defense. An effective approach for adversarial defense, especially in image domain, is through adversarial training which augments training sets with adversarial examples (Goodfellow et al., 2014). However, it is difficult to generate adversarial examples in graph domain because of low labeling rates in the transductive semi-supervised setting. Wang et al. (2019b) thus proposed to utilize unlabeled nodes in generating adversarial examples. Specifically, they trained a GCN as formulated in (2) to assign pseudo labels to unlabeled nodes. Then they randomly chose two disjoint subsets and from the unlabeled node set and attacked each target node to generate perturbed feature and adjacency matrices and .

Adversarial training for graph data can then be formulated as both supervised learning for labeled nodes and recovering pseudo labels for unlabeled nodes (attacked and clean):


where is a weight for the adversarial loss , and .

Adversarial defense with self-supervision. With self-supervision working in GCNs formulated as in (4) and adversarial training in (6), we formulate adversarial training with self-supervision as:


where the self-supervised loss is introduced into training with the perturbed graph data as input (the self-supervised label matrix is also generated from perturbed inputs). It is observed in CNNs that self-supervision improves robustness and uncertainty estimation without requiring larger models or additional data (Hendrycks et al., 2019). We thus experimentally explore whether that also extends to GCNs.

4 Experiments

In this section, we extensively assess, analyze, and rationalize the impact of self-supervision on transductive semi-supervised node classification following (Kipf and Welling, 2016) on the aspects of: 1) the standard performances of GCN (Kipf and Welling, 2016) with different self-supervision schemes; 2) the standard performances of multi-task self-supervision on three popular GNN architectures — GCN, graph attention network (GAT) (Veličković et al., 2017), and graph isomorphism network (GIN) (Xu et al., 2018); as well as those on two SOTA models for semi-supervised node classification — graph Markov neural network (GMNN) (Qu et al., 2019) that introduces statistical relational learning (Koller and Pfeffer, 1998; Friedman et al., 1999) into its architecture to facilitate training and GraphMix (Verma et al., 2019) that uses the Mixup trick; and 3) the performance of GCN with multi-task self-supervision in adversarial defense. Implementation details can be found in Appendix A.

Dataset Classes
Cora 2,780 140 13,264 1,433 7
Citeseer 3,327 120 4,732 3,703 6
PubMed 19,717 60 108,365 500 3
Table 4: Dataset statistics. , , , and denotes the numbers of nodes, numbers of labeled nodes, numbers of edges, and feature dimension per node, respectively.

4.1 Self-Supervision Helps Generalizability

Self-supervision incorporated into GCNs through various schemes. We first examine three schemes (Section 3.2) to incorporate self-supervision into GCN training: pretraining & finetuning, self-training (i.e. M3S (Sun et al., 2019)) and multi-task learning. The hyper-parameters of M3S are set at default values reported in (Sun et al., 2019). The differential effects of the three schemes combined with various self-supervised tasks are summarized for three datasets in Table 5

, using the target performances (accuracy in node classification). Each combination of self-supervised scheme and task is run 50 times for each dataset with different random seeds so that the mean and the standard deviation of its performance can be reported.

Cora Citeseer PubMed
GCN 81.00 0.67 70.85 0.70 79.10 0.21
81.5 70.3 79.0
P&F-Clu 81.83 0.53 71.06 0.59 79.20 0.22
P&F-Par 81.42 0.51 70.68 0.81 79.19 0.21
P&F-Comp 81.25 0.65 71.06 0.55 79.19 0.39
M3S 81.60 0.51 71.94 0.83 79.28 0.30
MTL-Clu 81.57 0.59 70.73 0.84 78.79 0.36
MTL-Par 81.83 0.65 71.34 0.69 80.00 0.74
MTL-Comp 81.03 0.68 71.66 0.48 79.14 0.28
Table 5: Node classification performances (accuracy; unit: %) when incorporating three self-supervision tasks (Node Clustering, Graph Partitioning, and Graph Completion) into GCNs through various schemes: pretraining & finetuning (abbr. P&T), self-training M3S (Sun et al., 2019)), and multi-task learning (abbr. MTL). Red numbers indicate the best two performances with the mean improvement at least 0.8 (where 0.8 is comparable or less than observed standard deviations). In the case of GCN without self-supervision, gray numbers indicate the published results.

Results in Table 5 first show that, among the three schemes to incorporate self-supervision into GCNs, pretraining & finetuning provides some performance improvement for the small dataset Cora but does not do so for the larger datasets Citeseer and PubMed. This conclusion remains valid regardless of the choice of the specific self-supervised task. The moderate performance boost echos our previous conjecture: although information about graph structure and features is first learned through self-supervision ( as in (3)) in the pretraining stage, such information may be largely lost during finetuning while targeting the target supervised loss alone ( as in (2)). The reason for such information loss being particularly observed in GCNs could be that, the shallow GCNs used in the transductive semi-supervised setting can be more easily “overwritten” while switching from one objective function to another in finetuning.

Through the remaining two schemes, GCNs with self-supervision incorporated could see more significant improvements in the target task (node classification) compared to GCN without self-supervision. In contrast to pretraining and finetuning that switches the objective function after self-supervision in (3) and solves a new optimization problem in (2), both self-training and multi-task learning incorporate self-supervision into GCNs through one optimization problem and both essentially introduce an additional self-supervision loss to the original formulation in (2).

Their difference lies in what pseudo-labels are used and how they are generated for unlabeled nodes. In the case of self-training, the pseudo-labels are the same as the target-task labels and such “virtual” labels are assigned to unlabeled nodes based on their proximity to labeled nodes in graph embedding. In the case of multi-task learning, the pseudo-labels are no longer restricted to the target-task labels and can be assigned to all unlabeled nodes by exploiting graph structure and node features without labeled data. And the target supervision and the self-supervision in multi-task learning are still coupled through common graph embedding. So compared to self-training, multi-task learning can be more general (in pseudo-labels) and can exploit more in graph data (through regularization).

Multi-task self-supervision on SOTAs. Does multi-task self-supervision help SOTA GCNs? Now that we have established multi-task learning as an effective mechanism to incorporate self-supervision into GCNs, we set out to explore the added benefits of various self-supervision tasks to SOTAs through multi-task learning. Table 6 shows that different self-supervised tasks could benefit different network architectures on different datasets to different extents.

Datasets Cora Citeseer PubMed
GCN 81.00 0.67 70.85 0.70 79.10 0.21
GCN+Clu 81.57 0.59 70.73 0.84 78.79 0.36
GCN+Par 81.83 0.65 71.34 0.69 80.00 0.74
GCN+Comp 81.03 0.68 71.66 0.48 79.14 0.28
GAT 77.66 1.08 68.90 1.07 78.05 0.46
GAT+Clu 79.40 0.73 69.88 1.13 77.80 0.28
GAT+Par 80.11 0.84 69.76 0.81 80.11 0.34
GAT+Comp 80.47 1.22 70.62 1.26 77.10 0.67
GIN 77.27 0.52 68.83 0.40 77.38 0.59
GIN+Clu 78.43 0.80 68.86 0.91 76.71 0.36
GIN+Par 81.83 0.58 71.50 0.44 80.28 1.34
GIN+Comp 76.62 1.17 68.71 1.01 78.70 0.69
GMNN 83.28 0.81 72.83 0.72 81.34 0.59
GMNN+Clu 83.49 0.65 73.13 0.72 79.45 0.76
GMNN+Par 83.51 0.50 73.62 0.65 80.92 0.77
GMNN+Comp 83.31 0.81 72.93 0.79 81.33 0.59
GraphMix 83.91 0.63 74.33 0.65 80.68 0.57
GraphMix+Clu 83.87 0.56 75.16 0.52 79.99 0.82
GraphMix+Par 84.04 0.57 74.93 0.43 81.36 0.33
GraphMix+Comp 83.76 0.64 74.43 0.72 80.82 0.54
Table 6: Experiments on SOTAs (GCN, GAT, GIN, GMNN, and GraphMix) with multi-task self-supervision. Red numbers indicate the best two performances for each SOTA.

When does multi-task self-supervision help SOTAs and why? We note that graph partitioning is generally beneficial to all three SOTAs (network architectures) on all the three datasets, whereas node clustering do not benefit SOTAs on PubMed. As discussed in Section 3.2 and above, multi-task learning introduce self-supervision tasks into the optimization problem in (4) as the data-driven regularization and these tasks represent various priors (see Section 3.3).

(1) Feature-based node clustering assumes that feature similarity implies target-label similarity and can group distant nodes with similar features together. When the dataset is large and the feature dimension is relatively low (such as PubMed), feature-based clustering could be challenged in providing informative pseudo-labels.

(2) Topology-based graph partitioning assumes that connections in topology implies similarity in labels, which is safe for the three datasets that are all citation networks. In addition, graph partitioning as a classification task does not impose the assumption overly strong. Therefore, the prior represented by graph partitioning can be general and effective to benefit GCNs (at least for the types of the target task and datasets considered).

(3) Topology and feature-based graph completion assumes the feature similarity or smoothness in small neighborhoods of graphs. Such a context-based feature representation can greatly improve target performance, especially when the neighborhoods are small (such as Citeseer with the smallest average degree among all three datasets). However, the regression task can be challenged facing denser graphs with larger neighborhoods and more difficult completion tasks (such as the larger and denser PubMed with continuous features to complete). That being said, the potentially informative prior from graph completion can greatly benefit other tasks, which is validated later (Section 4.2).

Does GNN architecture affect multi-task self-supervision? For every GNN architecture/model, all three self-supervised tasks improve its performance for some datasets (except for GMNN on PubMed). The improvements are more significant for GCN, GAT, and GIN. We conjecture that data-regularization through various priors could benefit these three architectures (especially GCN) with weak priors to begin with. In contrast, GMNN sees little improvement with graph completion. GMNN introduces statistical relational learning (SRL) into the architecture to model the dependency between vertices and their neighbors. Considering that graph completion aids context-based representation and acts a somewhat similar role as SRL, the self-supervised and the architecture priors can be similar and their combination may not help. Similarly GraphMix introduces a data augmentation method Mixup into the architecture to refine feature embedding, which again mitigates the power of graph completion with overlapping aims.

We also report in Appendix B the results in inductive fully-supervised node classification. Self-supervision leads to modest performance improvements in this case, appearing to be more beneficial in semi-supervised or few-shot learning.

4.2 Self-Supervision Boosts Adversarial Robustness

What additional benefits could multi-task self-supervision bring to GCNs, besides improving the generalizability of graph embedding (Section 4.1)? We additionally perform adversarial experiments on GCN with multi-task self-supervision against Nettack (Zügner et al., 2018), to examine its potential benefit on robustness.

We first generate attacks with the same perturbation intensity (, see details in Appendix A) as in adversarial training to see the robust generalization. For each self-supervised task, the hyper-parameters are set at the same values as in Table 6. Each experiment is repeated 5 times as the attack process on test nodes is very time-consuming.

What self-supervision task helps defend which types of graph attacks and why? In Tables 7 and 8 we find that introducing self-supervision into adversarial training improves GCN’s adversarial defense. (1) Node clustering and graph partitioning are more effective against feature attacks and links attacks, respectively. During adversarial training, node clustering provides the perturbed feature prior while graph partitioning does perturbed link prior for GCN, contributing to GCN’s resistance against feature attacks and link attacks, respectively. (2) Strikingly, graph completion boosts the adversarial accuracy by around 4.5 (%) against link attacks and over 8.0 (%) against the link & feature attacks on Cora. It is also among the best self-supervision tasks for link attacks and link & feature attacks on Citeseer, albeit with a smaller improvement margin (around 1%). In agreement with our earlier conjecture in Section 4.1, the topology- and feature-based graph completion constructs (joint) perturbation prior on links and features, which benefits GCN in its resistance against link or link & feature attacks.

Furthermore, we generate attacks with varying perturbation intensities () to check the generalizabilty of our conclusions. Results in Appendix C show that with self-supervision introduced in adversarial training, GCN can still improve its robustness facing various attacks at various intensities.

4.3 Result Summary

We briefly summarize the results as follows.

First, among three schemes to incorporate self-supervision into GCNs, multi-task learning works as the regularizer and consistently benefits GCNs in generalizable standard performances with proper self-supervised tasks. Pretraining & finetuning switches the objective function from self-supervision to target supervision loss, which easily “overwrites” shallow GCNs and gets limited performance gain. Self-training is restricted in what pseudo-labels are assigned and what data are used to assign pseudo-labels. And its performance gain is more visible in few-shot learning and can be diminishing with slightly increasing labeling rates.

Second, through multi-task learning, self-supervised tasks provide informative priors that can benefit GCN in generalizable target performance. Node clustering and graph partitioning provide priors on node features and graph structures, respectively; whereas graph completion with (joint) priors on both help GCN in context-based feature representation. Whether a self-supervision task helps a SOTA GCN in the standard target performance depends on whether the dataset allows for quality pseudo-labels corresponding to the task and whether self-supervised priors complement existing architecture-posed priors.

Attacks None Links Feats Links & Feats
GCN 80.61 0.21 28.72 0.63 44.06 1.23 8.18 0.27
AdvT 80.24 0.74 54.58 2.57 75.25 1.26 39.08 3.05
AdvT+Clu 80.26 0.99 55.54 3.19 76.24 0.99 41.84 3.48
AdvT+Par 80.42 0.76 56.36 2.57 75.88 0.72 41.57 3.47
AdvT+Comp 79.64 0.99 59.05 3.29 76.04 0.68 47.14 3.01
Table 7: Adversarial defense performances on Cora using adversarial training (abbr. AdvT) without or with graph self-supervision. Attacks include those on links, features (abbr. Feats), and both. Red numbers indicate the best two performances in each attack scenario (node classification accuracy; unit: %).
Attacks None Links Feats Links & Feats
GCN 71.05 0.56 13.68 1.09 22.08 0.73 3.08 0.17
AdvT 69.98 1.03 39.32 2.39 63.12 0.62 26.20 2.09
AdvT+Clu 70.13 0.81 40.32 1.73 63.67 0.45 27.02 1.29
AdvT+Par 69.96 0.77 41.05 1.91 64.06 0.24 28.70 1.60
AdvT+Comp 69.98 0.82 40.42 2.09 63.50 0.31 27.16 1.69
Table 8: Adversarial defense performances on Citeseer using adversarial training without or with graph self-supervision.

Last, multi-task self-supervision in adversarial training improves GCN’s robustness against various graph attacks. Node clustering and graph partitioning provides priors on features and links, and thus defends better against feature attacks and link attacks, respectively. Graph completion, with (joint) perturbation priors on both features and links, boost the robustness consistently and sometimes drastically for the most damaging feature & link attacks.

5 Conclusion

In this paper, we present a systematic study on the standard and adversarial performances of incorporating self-supervision into graph convolutional networks (GCNs). We first elaborate three mechanisms by which self-supervision is incorporated into GCNs and rationalize their impacts on the standard performance from the perspective of optimization. Then we focus on multi-task learning and design three novel self-supervised learning tasks. And we rationalize their benefits in generalizable standard performances on various datasets from the perspective or data-driven regularization. Lastly, we integrate multi-task self-supervision into graph adversarial training and show their improving robustness of GCNs against adversarial attacks. Our results show that, with properly designed task forms and incorporation mechanisms, self-supervision benefits GCNs in gaining both generalizability and robustness. Our results also provide rational perspectives toward designing such task forms and incorporation tasks given data characteristics, target tasks and neural network architectures.


We thank anonymous reviewers for useful comments that help improve the paper during revision. This study was in part supported by the National Institute of General Medical Sciences of the National Institutes of Health [R35GM124952 to Y.S.], and a US Army Research Office Young Investigator Award [W911NF2010240 to Z.W.].


  • A. Azran (2007) The rendezvous algorithm: multiclass semi-supervised learning with markov random walks. In Proceedings of the 24th international conference on Machine learning, pp. 49–56. Cited by: §2.
  • A. Bertrand and M. Moonen (2013) Seeing the bigger picture: how nodes can learn their place within a complex ad hoc network topology. IEEE Signal Processing Magazine 30 (3), pp. 71–82. Cited by: §3.2, §3.3.
  • A. Blum and S. Chawla (2001) Learning from labeled and unlabeled data using graph mincuts. Carnegie Mellon University. Cited by: §2.
  • A. Blum, J. Lafferty, M. R. Rwebangira, and R. Reddy (2004) Semi-supervised learning using randomized mincuts. In Proceedings of the twenty-first international conference on Machine learning, pp. 13. Cited by: §2.
  • P. Chen and S. Liu (2017)

    Bias-variance tradeoff of graph laplacian regularizer

    IEEE Signal Processing Letters 24 (8), pp. 1118–1122. Cited by: §3.2.
  • T. Chen, S. Liu, S. Chang, Y. Cheng, L. Amini, and Z. Wang (2020) Adversarial robustness: from self-supervised pre-training to fine-tuning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 699–708. Cited by: §1.
  • H. Dai, H. Li, T. Tian, X. Huang, L. Wang, J. Zhu, and L. Song (2018) Adversarial attack on graph structured data. arXiv preprint arXiv:1806.02371. Cited by: §2, §2.
  • C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §1.
  • C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060. Cited by: §2.
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014) Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems, pp. 766–774. Cited by: §1.
  • N. Friedman, L. Getoor, D. Koller, and A. Pfeffer (1999) Learning probabilistic relational models. In IJCAI, Vol. 99, pp. 1300–1309. Cited by: §4.
  • G. Getz, N. Shental, and E. Domany (2006) Semi-supervised learning–a statistical physics approach. arXiv preprint cs/0604011. Cited by: §2.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §1, §2.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §3.4.
  • P. Goyal, D. Mahajan, A. Gupta, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. arXiv preprint arXiv:1905.01235. Cited by: §1, §2, §3.2.
  • D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song (2019) Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems, pp. 15637–15648. Cited by: §1, §2, §3.2, §3.4.
  • M. Karimi, D. Wu, Z. Wang, and Y. Shen (2019) Explainable deep relational networks for predicting compound-protein affinities and contacts. arXiv preprint arXiv:1912.12553. Cited by: §1.
  • G. Karypis and V. Kumar (1995) Multilevel graph partitioning schemes. In ICPP (3), pp. 113–122. Cited by: §3.3.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2, §3.1, §4.
  • A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005. Cited by: §1, §2, §3.2.
  • D. Koller and A. Pfeffer (1998) Probabilistic frame-based systems. In AAAI/IAAI, pp. 580–587. Cited by: §4.
  • Y. LeCun, Y. Bengio, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §1.
  • Q. Li, Z. Han, and X. Wu (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §3.2, §3.2.
  • P. Milanfar (2012) A tour of modern image filtering: new insights and methods, both practical and theoretical. IEEE signal processing magazine 30 (1), pp. 106–128. Cited by: §3.2, §3.3.
  • S. Mohseni, M. Pitale, J. Yadawa, and Z. Wang (2020) Self-supervised learning for generalizable out-of-distribution detection. AAAI. Cited by: §1.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §1, §2.
  • [27] K. Oono and T. Suzuki GRAPH neural networks exponentially lose expressive power for node classification. Cited by: §3.2.
  • M. Qu, Y. Bengio, and J. Tang (2019) GMNN: graph markov neural networks. arXiv preprint arXiv:1905.06214. Cited by: §1, §2, §3.1, §4.
  • Z. Ren and Y. Jae Lee (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 762–771. Cited by: §2.
  • A. Sandryhaila and J. M. Moura (2014) Big data analysis with signal processing on graphs: representation and processing of massive data sets with irregular structure. IEEE Signal Processing Magazine 31 (5), pp. 80–90. Cited by: §3.2, §3.3.
  • D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst (2013)

    The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains

    IEEE signal processing magazine 30 (3), pp. 83–98. Cited by: §3.2, §3.3.
  • J. Su, S. Maji, and B. Hariharan (2019) When does self-supervision improve few-shot learning?. arXiv preprint arXiv:1910.03560. Cited by: §1.
  • K. Sun, Z. Zhu, and Z. Lin (2019) Multi-stage self-supervised learning for graph convolutional networks. arXiv preprint arXiv:1902.11038. Cited by: item A1:, item A2:, §2, §3.2, §3.3, Table 2, §4.1, Table 5.
  • M. Szummer and T. Jaakkola (2002) Partially labeled classification with markov random walks. In Advances in neural information processing systems, pp. 945–952. Cited by: §2.
  • T. H. Trinh, M. Luong, and Q. V. Le (2019) Selfie: self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940. Cited by: §2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §2, §3.1, §4.
  • V. Verma, M. Qu, A. Lamb, Y. Bengio, J. Kannala, and J. Tang (2019) GraphMix: regularized training of graph neural networks for semi-supervised learning. arXiv preprint arXiv:1909.11715. Cited by: §1, §2, §3.1, §4.
  • S. Wang, Z. Chen, J. Ni, X. Yu, Z. Li, H. Chen, and P. S. Yu (2019a) Adversarial defense framework for graph neural network. arXiv preprint arXiv:1905.03679. Cited by: §2, §2.
  • X. Wang, X. Liu, and C. Hsieh (2019b) GraphDefense: towards robust graph convolutional networks. External Links: 1911.04429 Cited by: §2, §2, §3.4.
  • H. Wu, C. Wang, Y. Tyshetskiy, A. Docherty, K. Lu, and L. Zhu (2019a) Adversarial examples on graph data: deep insights into attack and defense. External Links: 1903.01610 Cited by: §2, §2.
  • L. Wu, J. Laeuchli, V. Kalantzis, A. Stathopoulos, and E. Gallopoulos (2016)

    Estimating the trace of the matrix inverse by interpolating from the diagonal of an approximate inverse

    Journal of Computational Physics 326, pp. 828–844. Cited by: §3.2, §3.3.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019b) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §1, §4.
  • Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: §1.
  • Y. You, T. Chen, Z. Wang, and Y. Shen (2020) L2-gcn: layer-wise and learned efficient training of graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2127–2135. Cited by: §1.
  • J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505–5514. Cited by: §3.3.
  • M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: §1.
  • X. Zhu and Z. Ghahramani (2002) Towards semi-supervised classification with markov random fields. Citeseer. Cited by: §2.
  • X. Zhu and A. B. Goldberg (2009) Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning 3 (1), pp. 1–130. Cited by: §2, §3.2, §3.3.
  • D. Zügner, A. Akbarnejad, and S. Günnemann (2018) Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2847–2856. Cited by: §2, §2, §3.4, §4.2.