SelfTaskGNN
Implementation of paper "Selfsupervised Learning on Graphs:Deep Insights and New Directions"
view repo
The success of deep learning notoriously requires larger amounts of costly annotated data. This has led to the development of selfsupervised learning (SSL) that aims to alleviate this limitation by creating domain specific pretext tasks on unlabeled data. Simultaneously, there are increasing interests in generalizing deep learning to the graph domain in the form of graph neural networks (GNNs). GNNs can naturally utilize unlabeled nodes through the simple neighborhood aggregation that is unable to thoroughly make use of unlabeled nodes. Thus, we seek to harness SSL for GNNs to fully exploit the unlabeled data. Different from data instances in the image and text domains, nodes in graphs present unique structure information and they are inherently linked indicating not independent and identically distributed (or i.i.d.). Such complexity is a doubleedged sword for SSL on graphs. On the one hand, it determines that it is challenging to adopt solutions from the image and text domains to graphs and dedicated efforts are desired. On the other hand, it provides rich information that enables us to build SSL from a variety of perspectives. Thus, in this paper, we first deepen our understandings on when, why, and which strategies of SSL work with GNNs by empirically studying numerous basic SSL pretext tasks on graphs. Inspired by deep insights from the empirical studies, we propose a new direction SelfTask to build advanced pretext tasks that are able to achieve stateoftheart performance on various realworld datasets. The specific experimental settings to reproduce our results can be found in <https://github.com/ChandlerBang/SelfTaskGNN>.
READ FULL TEXT VIEW PDFImplementation of paper "Selfsupervised Learning on Graphs:Deep Insights and New Directions"
Relevant work about selfsupervised learning of GCNs
In recent years, deep learning has achieved superior performance across numerous domains; but it requires costly annotations of huge amounts of data Kolesnikov et al. (2019). Hence, selfsupervised learning (SSL) has been introduced in both the image Kolesnikov et al. (2019); Doersch et al. (2015); Caron et al. (2018) and text Le and Mikolov (2014); Devlin et al. (2018) domains to alleviate the need of large labeled data by deriving labels for the significantly more unlabeled data. More specifically, SSL often first designs a domain specific pretext task to assign labels for data instances and then trains the deep model on the pretext task to learn better representations due to the inclusion of unlabeled samples in the training process.
As the generalization of deep learning to the graph domain, graph neural networks (GNNs) have been proven to be powerful in graph representation learning. As a result, GNNs have facilitated various computational tasks on graphs such as node classification and graph classification Wu et al. (2019b); Kipf and Welling (2016); Hamilton et al. (2017); Veličković et al. (2018)
. In this work, we focus on advancing GNNs for node classification where GNNs leverage both labeled and unlabeled nodes on a graph to jointly learn node representations and a classifier that can predict the labels of unlabeled nodes on the graph. On the one hand, GNNs are inherently semisupervised where unlabeled data has been coherently integrated. On the other hand, GNNs mainly utilize unlabeled nodes by simply aggregating their features that cannot thoroughly take advantage of the abundant information
Sun et al. (2019). Thus, to fully exploit the unlabeled nodes for GNNs, SSL can be naturally harnessed for providing additional supervision.Graphstructured data is often more complex than other domains (e.g., image and text). In addition to node attributes, graphs present complicated structure information. For example, the topology of an image is a fixed grid and text is a simple sequence, while graphs are not restricted to these rigid structures. Furthermore, unlike images and text where the entire structure is a single data sample, each node in a graph is an individual instance and has its own associated attributes and topological structures. The complexity of graphstructured data does not stop here. In the text and image domain, data samples are often under the assumption of being i.i.d. (independent and identically distributed). However, in the graph domain, instances (or nodes) are inherently linked and dependent of each other. Therefore, the complex nature of graphstructured data determines that it is very challenging to directly adopt selfsupervised learning developed in other domains to graphs. While introducing tremendous challenges, the complexity of graphs is a doubleedged sword that also presents unprecedented opportunities. In particular, the complexity provides rich information that enables us to design pretext tasks from various perspectives. Similar to the image and text domains, we can focus on individual nodes such as node features and node topological properties. Moreover, unlike the image and text domains, nodes are dependent in a graph, and thus we are able to investigate new aspects such as dependence on node pairs or even a set of nodes. In addition, multiple information resources including node attributes, structure information, and label information of labeled nodes are available in a graph and their interactions and combinations provide unprecedented opportunities for us to design advanced selfsupervised pretext tasks. Very recently, there are only a few attempts to adapt SSL from the image domain in training graph neural networks Sun et al. (2019); Peng et al. (2020). Therefore, the research of selfsupervised learning on graphs is still at the initial stage and more systematical and dedicated efforts are pressingly needed.
In this paper, we embrace the challenges and opportunities to study selfsupervised learning in graph neural networks for node classification with two major goals. First, we want to deepen our understandings on selfsupervised learning on graphs. Specifically, there are a variety of potential pretext tasks for graphs; hence it is important to gain insights on when and why SSL works for GNNs and which strategy can better integrate SSL for GNNs. Second, we target on inspiring new directions of SSL on graphs according to our understandings. Particularly, we want to investigate how these insights can motivate more sophisticated approaches to design pretext tasks. To achieve the first goal, we design basic types of pretext tasks directly based on attribute and structural information. We make several crucial findings about SSL on graphs via deep analysis on their impact on the GNN performance. These findings allow us to propose a new direction SelfTask to design more advanced pretext tasks that are empirically demonstrated to achieve stateoftheart performance on various graph datasets.
We use to denote a graph where is the set of nodes, is the set of edges describing the relations between nodes and is the node feature matrix where denotes the node features of . The graph structure information can also be represented by an adjacency matrix where indicates there exists a link between nodes and , otherwise . Hence, a graph can also be denoted as . In this paper, we focus on the semisupervised node classification setting where only a subset of nodes are associated with corresponding labels . We denote the labeled data as and unlabeled data as . Let be a graph neural network that maps the nodes to the set of labels such that the graph neural network can infer the labels of unlabeled data. Thus, the objective function for the semisupervised node classification task can be formulated as minimizing the loss , or more specifically as
(1) 
where is used to denote the parameters of , is the prediction of node and
denotes the loss function used to measure the difference between the predicted and true labels (e.g., cross entropy).
With the aforementioned notations and definitions, there could be multiple settings of SSL for GNN, but in this work we formally define the problem of selfsupervised learning for graph neural networks under the task of node classification as:
Problem 1. Given a dataset in the graph domain represented as a graph with paired labeled data , we aim to construct a selfsupervised pretext task with a corresponding loss that can be integrated with the task specific loss to learn a graph neural network that can better generalize on the unlabeled data.
In this section, we present various types of selfsupervised pretext tasks on graphs. More specifically, we investigate defining pretext tasks based upon selfsupervised information from: (A) the underlying graph structure information (i.e., ); or (B) node feature/attribute information (i.e., ). These two directions are the most natural sources of information to develop selfsupervised information for the unlabeled nodes. As there can be a variety of potential pretext tasks for graphs, we first provide detailed justifications for each of the selfsupervised pretext tasks and present the details of representative methods in both (A) and (B). Thereafter, in Section 5 we present more advanced pretext tasks built upon deep insights from an empirical study in Section 4.
The first natural choice for extracting selfsupervised information in the graph domain is the inherent structure behind the data. This is because unlike the image and text domains, in graphs our data instances are related (i.e., the nodes are linked together). Thus, one main direction is to construct selfsupervision information for the unlabeled nodes based on their local structure information, or how they relate to the rest of the graph. In other words, the structure information for establishing selfsupervised pretext tasks can be categorized into either local or global structure information.
From the local perspective of developing selfsupervised information, it can either come from the node itself, or from the structural relationship that node has in its local surrounding neighborhood. In addition, the pretext task can be defined on a single node, or can be developed in a pairwise/contrastive way that involves combining/comparing the information from more than one node. Next, we present two representative examples of local structure based SSL pretext tasks considering these different aspects.
[leftmargin=*]
NodeProperty. In this task, we aim to predict the property for each node in the graph such as their degree, local node importance, and local clustering coefficient. The goal of this pretext task is to (further) encourage the GNN to learn local structure information in addition to the specific task that is being optimized. In this work, we use node degree as a representative local node property for selfsupervision while leaving other node properties (or the combination) as one future work. More formally, we let denote the degree of and construct the associated loss of the selfsupervised pretext task as
(2) 
where is used to denote the parameters of a graph neural network model , represents the set of unlabeled nodes and associated pretext task labels in the graph, and is used to denote the the predicted local node property for node (which in this case is the predicted node degree). The intuition of constructing selfsupervised pretext tasks related to the local node property is to ultimately guide the features (i.e., node representations) from the GNN to preserve this information. This relies on the assumption that such node property information is related to the specific task of interest.
EdgeMask. For the edge mask task, we seek to develop the selfsupervision based not only on an individual node itself, but instead pairwise, based on the connections between two nodes in the graph. In particular, we first randomly mask some edges and then the model is asked to reconstruct the masked edges. More specifically, we first mask edges denoted as the set and also sample the set of node pairs of equal size (i.e., ). Then, the SSL pretext task here is to predict whether or not there exists a link between a given node pair. More formally, we construct the associated loss as
(3)  
where denotes the embedding of node , is the cross entropy loss, linearly maps to 1dimension, and the class of having a link between and is indicated by 1 and 0 otherwise. In summary, this method is expected to help GNN learn information about local connectivity.
Global selfsupervision information for a given node is not only based on the node itself or limited to its immediate local neighborhood, but also taking a bird’seye view of the position of the node in the graph. Similar to the local perspective, we also propose two representative SSL pretext tasks where one is based upon a global pairwise comparison between two nodes and the other is from how a single node is globally positioned in the graph.
[leftmargin=*]
PairwiseDistance. The EdgeMask pretext task is from a local structure perspective based on masking and trying to recover/predict local edges in the graph. We further develop the PairwiseDistance where we aim to guide the graph neural network to maintain global topology information through a pairwise comparison. In other words, the pretext task is designed to be able to distinguish/predict the distance between different node pairs. We note that distance can be measured in a variety of ways, such as being in the same connected component/cluster or not, personalized PageRank or other global link prediction methods that calculate node similarity LibenNowell and Kleinberg (2007), etc. In this work, similar to global context prediction in Peng et al. (2020), we elect to use the shortest path length as a measure of the distance between nodes. More specifically, we first calculate the pairwise node shortest path length for all node pairs and further group the lengths into four categories – , and . The reasons of selecting four bins for the path length between two nodes is that the GNN should be able to correctly judge the distance between two nodes to some extent, but if we were to include more classes it would: 1) require more calculations to discover all the actual pairwise distances (if greater than 4); and 2) potentially overfit to some of the longer pairwise distances in the graph, which become quite noisy as compared to the shorter path lengths. In addition, since using all node pairs in the objective would be computationally expensive during the training process, in practice, we randomly sample a certain amount of node pairs
used for selfsupervision during each epoch. The SSL loss can then be formulated as a multiclass classification problem as follows,
(4)  
where is the corresponding distance category of , denotes the cross entropy loss, and linearly maps to 1dimension,. Note that we leave other pairwise distance measures and other settings for the shortest path distance as future work.
Distance2Clusters. Although PairwiseDistance
applies a sampling strategy to reduce time complexity, it is still very timeconsuming since we need to calculate pairwise distance for all node pairs. Instead, we derive a new SSL pretext task exploring global structure information by predicting the distance (again in terms of shortest path length) from the unlabeled nodes to predefined graph clusters. This will force the representations to learn a global positioning vector of each of the nodes. In other words, rather than a node predicting the distance pairwise to an arbitrary other node in the graph, instead, we establish a fixed set of anchor/center nodes associated with graph clusters and then each node will predict its distance to this set of anchor nodes. Concretely, we first partitioning the graph to get
clusters by applying the METIS graph partitioning algorithm Karypis and Kumar (1998), since it is commonly used in the literature. Inside each cluster , we assign the node with the highest degree to be the center of the corresponding cluster, denoted as . Then, we can efficiently create a cluster distance vector for node where the th element of is the distance from to the center of . The SSL goal of Distance2Clusters is thus to predict this distance vector and the optimization problem can be formulated as a multiple regression problem as,(5) 
In this subsection, we focus on attribute information as the second natural choice for establishing a selfsupervised pretext task. Here the key point behind attribute information is to help guide the GNN in a way to ensure certain aspects of node/neighborhood attribute information is encoded in the node embeddings after a selfsupervised attributebased pretext. Next, we design two attribute based pretext tasks.
[leftmargin=*]
AttributeMask. This task is similar to EdgeMask but we hope GNN can learn more attribute information via SSL. Thus, we randomly mask (i.e., set equal to zero) the features of nodes where =, and then ask the selfsupervised component to reconstruct these features. More formally,
However, the features in most realworld datasets are often highdimensional and sparse. Hence, in practice we first employ Principle Component Analysis (PCA) to obtain reduced dense features before applying AttributeMask.
PairwiseAttrSim. As compared to data samples in other domains such as an image, in graph structured data the aggregation process is actually merging the features from multiple instances to discover the learned representation. Thus, given two nodes that have similar attributes, their learned representations are not necessarily similar (as compared to e.g., two exact images will obtain the same representation in typical deep learning models). More generally, the similarity two nodes have in the input feature space is not guaranteed in the learned representations due to the GNN aggregating features from the two nodes local neighborhoods. This can create a doubleedged sword as although we wish to utilize the local neighborhood in a GNN to enhance the node feature transformation, we still wish to somewhat maintain the notion of data instance similarity and not allow a node’s neighborhood to drastically change their attribute signature. Thus, we establish the attributebased SSL pretext task of node attribute similarity. Due to the majority of the pairwise similarity being near zero, we develop the following pair sampling strategy. First, we let and denote the sets of node pairs having the highest similarity and dissimilarity, respectively, which we more formally define as,
where measures the node feature similarity between and
(according to cosine similarity) and
is the number of top/bottom pairs selected for each node. Now, we can formalize the regression problem as follows,(6)  
where and linearly maps to 1dimension.
In the last section, we discussed basic selfsupervised pretext tasks from both structure and attribute information. In this section, we present two strategies to merge these pretext tasks into GNNs, i.e., joint training and twostage training, and then empirically analyze the impacts of the pretext tasks on GNNs.
A natural idea to employ selfsupervised learning for graph neural networks is to jointly train the corresponding losses. In other words, we aim to optimize the selfsupervised loss (i.e., ) and supervised loss (i.e., ) simultaneously.
An overview of joint training is shown in Figure 1
. Essentially this can be separated into two stages: 1) feature extraction process; and 2) adaptation processes for both the downstream and selfsupervised pretext tasks. The first step is a feature extraction process that is applied on the input graph, which can be various graph convolutional layers. Based on the extracted features, two adaptation processes are applied to the downstream task and selfsupervised pretext task, respectively. Note that the adaptation layers can be graph convolutional layers or linear layers (which we later discuss in the experiment setup). Afterwards we jointly optimize the losses from both selfsupervised and downstream task components.
As introduced in Section 2, we denoted the prediction of a node as , where represented our graph neural network model. As demonstrated in Figure 1, we separate the GNN into both a feature extractor and an adapter for the downstream classification task. Correspondingly, we split the parameter as . Then, we use to denote the feature extractor component of where and represents the embedding of node . Furthermore, we utilize to denote the adapter/classifier component of that maps the embedding of a node to the predicted class . In addition, the selfsupervised pretext task can be formulated to utilize the same feature extractor and an additional adapter . Thus, the overall objective can be defined as follows,
(7) 
where and
is the hyperparameter to control the contribution of selfsupervision.
Two common strategies to utilize features learned via selfsupervision in computer vision include applying the selfsupervised model as an initialization for finetuning
Zhai et al. (2019); Noroozi et al. (2018) and training a linear classifier over the learned features Kolesnikov et al. (2019); Doersch et al. (2015). These strategies motivate us a twostage training method to integrate SSL into GNNs. This method consists of the following two stages: 1) Pretraining on the selfsupervised pretext task; and 2) Finetuning on the downstream task.An overview of the twostage training method is given in Figure 2. Similar to the joint training, the selfsupervised model consists of a feature extraction module and an adaptation module , which are optimized by itself independent of the downstream task. Then, after the selfsupervised model is fully trained, we begin to train the downstream task model. More specifically, the downstream task model also has an adaptation module , but its feature extraction module shares parameters with that of the selfsupervised model . As seen in the figure, we first pretrain the selfsupervised model for the pretext task, and then use the selfsupervised model’s feature extraction module as the initialization of that of the downstream task. After initializing , we can either fix it or fine tune it for the downstream task.
In this subsection, we conduct extensive experiments based on the basic pretext tasks to understand what SSL information works for GNNs, which strategies can better integrate SSL for GNNs, and further analyze why SSL is able to improve GNNs. Following the setting in GCN Kipf and Welling (2016), we conduct experiments on the public data splits of three widely used benchmark datasets: Cora, Citeseer, and Pubmed. The dataset statistics can be found in Table 1. We used Adam optimizer with learning rate 0.01, regularization 5e4, dropout rate 0.5, 128 hidden units across all selfsupervised information and GCN, and topK = bottomK = 3. Then parameters tuned on validation accuracy are: in range of , and in {10%, 20%} the size of .
Dataset  Nodes  Edges  Classes  Features  Training/Validation/Test 

Cora  2,708  5,429  7  1,433  140/500/1000 
Citeseer  3,327  4,732  6  3,703  120/500/1000 
Pubmed  19,717  44,338  3  500  60/500/1000 
Strategies for Twostage Training. For twostage training, after initializing the feature extractor, we can either fix it or fine tune it for node classification. We studied various architectures for both pretrained and node classification models with the results demonstrated in Table 2. Note that in the table, (1) “2GC+1Linear" denotes that we use two graph convolutional layers for feature extraction and one linear layer for the adaptation; (2) “2GC" means one graph convolutional layers for feature extraction and another graph convolutional layer for the adaptation; (3) for the column of “Finetune Strategy", “Fix" and “Tune all" correspond to the aforementioned two strategies and we also report the performance of node classification without pretraining from SSL as the third strategy; and (4) all the experiments are conducted with the PairwiseDistance task. In most cases, the strategy of “Tune all" achieves the best performance. Thus, we choose this strategy when using twostage training. We also note that the configuration of one graph convolutional layer for feature extraction, one graph convolutional layer for the adaptation of node classification and one linear layer for the adaptation of pretext task works very well for all three strategies. Therefore, we select this configuration for the remaining experiments unless stated otherwise.






2GC+1Linear  2GC+1Linear  Fix  73.53  
2GC+1Linear  2GC+1Linear  Tune all  80.55  
  2GC+1Linear    78.63  
2GC+1Linear  3GC  Fix  74.69  
2GC+1Linear  3GC  Tune all  82.49  
  3GC    80.88  
1GC+1Linear  1GC+1Linear  Fix  80.75  
1GC+1Linear  1GC+1Linear  Tune all  79.79  
  1GC+1Linear    78.75  
1GC+1Linear  2GC  Fix  81.04  
1GC+1Linear  2GC  Tune all  82.39  
  2GC    81.32 
SSL for GNNs. Following the aforementioned experimental settings, we evaluate six basic pretext tasks in Section 3 with joint and twostage training strategies and the results are shown in Table 3.
Joint Training vs. Twostage Training.
We observe that although the twostage training is able to improve the vanilla GCN model, the joint training outperforms the twostage training in most settings. This observation from the graph domain is consistent with that from image selfsupervised semisupervised learning
Zhai et al. (2019). In addition, the joint training strategy is less complicated as compared to the finetuning strategy. More specifically, joint training only requires the tuning of a single hyperparameter as compared to significant efforts for the twostage training due to the high sensitivity in the twostage training as shown in Table 2. Thus, our empirical analysis suggests that joint training is a better strategy to integrate SSL with GNNs than the twostage training.What SSL Works for GNNs. From Table 3, we can first observe that the best performance is always achieved by one including an SSL pretext task. In other words, our empirical analysis clearly shows that utilizing selfsupervised information in graph neural networks is a promising direction for further improving the performance of deep learning on graphstructured data. Furthermore, we observe a wide range of utility for the various selfsupervised pretext tasks for improving node classification. First, we notice that across all datasets, the best performing method is a pretext task developed from global structure information. Another thing to note is that we determine the quality of AttributeMask in comparison to GCNPCA, since they both first utilize PCA as a preprocessing step to reduce the dimension of the node features/attributes. Then, under further analysis on the results in Table 3, we find that the pretext tasks of NodeProperty, EdgeMask and AttributeMask cannot boost the original GCN since the performance difference is always smaller than . By contrast, global selfsupervision including PairwiseDistance, Distance2Cluster, and PairwiseAttrSim successfully improves the performance (e.g., over improvement on the Cora dataset). Thus, selfsupervised information from both the structure and attributes have potentials; while for the structure information, the global pretext tasks are likely to provide much more significant improvements compared to the local ones.
Model 




Cora  Citeseer  Pubmed  Cora  Citeseer  Pubmed  
GCN  81.32  71.53  79.28  81.32  71.53  79.28  
GCNDroppedGraph  81.03  71.29  79.28  81.03  71.29  79.26  
GCNPCA  81.74  70.38  78.83  81.74  70.38  78.83  
NodeProperty  81.94  71.60  79.44  81.59  71.69  79.24  
EdgeMask  81.69  71.51  78.90  81.44  71.57  79.33  
PairwiseNodeDistance  83.11  71.90  80.05  82.39  72.02  79.57  
Distance2Cluster  83.55  71.44  79.88  81.80  71.55  79.51  
AttributeMask  81.47  70.57  78.88  81.31  70.40  78.72  
PairwiseAttrSim  83.05  71.67  79.45  81.57  71.74  79.42 
Why SSL Works for GNNs.
We have found that using some selfsupervision like global structure and pairwise attribute information can perform well while others not. Thus, the natural question is why they work or why they are unable to improve. As we mentioned before, GCN for node classification is naturally semisupervised that has explored the unlabeled nodes. Therefore, one possible reason why some selfsupervision cannot help could be that GCN can already learn such information itself. If this is the case, then training on additional selfsupervised pretext task could perhaps not further boost the performance. To verify this assumption, we train logistic regression classifiers on the original node features and nodes representations from GCN (without selfsupervised learning) to predict on the pretext task. The intuition is if GCN can learn one type of selfsupervision, the nodes representations from GCN should have preserved such information; thus they should perform better on the corresponding pretext task than the original node features. The performance difference is shown in Fig
3. We choose three representative pretext tasks EdgeMask, NodeProperty and PairwiseDistance to illustrate since similar patterns can be observed in other cases. From the figure we can observe that GCN node representation consistently outperforms original features for NodeProperty task by a large margin, indicating that GCN can learn such local structure information. Hence, NodeProperty cannot bring in further improvement for its corresponding selfsupervised GCN. Similar observations can be made on EdgeMask task for Cora and Citeseer. The reason why the performance difference for EdgeMask on Pubmed is small could be that the original features of Pubmed are very representative for local connectivity, since original features can achieve over 80% accuracy on the EdgeMask task. On the contrary, the performance difference on PariwiseDisance task is rather small across three datasets. This observation suggests that GCN is unable to naturally learn the global structure information and employing pairwise node distance prediction as the SSL task can help boost its performance for the downstream task.Insights on SSL for GNNs. One of the most fundamental properties in graphs is the study of graph similarity, which is to describe how similar two nodes are in a graph. There are two most popular approaches to measure this similarity including structural equivalence and regular equivalence Newman (2018). More specifically, one way of defining similarity in regards to structural equivalence in a graph is that two nodes are structurally equivalent if their local neighborhoods significantly overlap. In comparison, regular equivalent nodes are those that, while not necessarily having the same neighbors, have neighbors who themselves are similar Borgatti and Everett (1992). Now, given these definitions, we discuss and characterize to what extent GCN is able to naturally maintain these types of similarity when mapping from the input space to the embedding space, and then how these are correlated to the observations we have made with different selfsupervised pretext tasks.
First, given that GCN works by aggregating features from a node’s local neighborhood, if two nodes have a significant overlap in their neighbors (as defined in structural equivalence), then it would be expected that their embeddings should somehow maintain this notion of similarity. Second, we let neighbor similarity in regular equivalence be defined in terms of their attribute similarity, then even if two nodes do not share the same neighbors, if their neighbors are similar, this will result in the learned embeddings of the two nodes maintaining this notion of regular attribute equivalence.
Furthermore, we can observe our proposed selfsupervised pretext tasks built on structure information (e.g., PairwiseNodeDistance) can be described in helping to maintain this notion of structural equivalence in the embeddings. For example, if the embeddings are encouraged to encode how similar two nodes are in terms of their distance (as done in PairwiseNodeDistance), then this is related to maintaining structural similarity as many node similarity measures are defined based on the idea of path length between two nodes. For the selfsupervised pretext tasks based on attribute information, if we define the concept of attribute equivalence being two vertices that share many of the same attributes/features (as structural share many neighbors). Then, given this definition of attribute equivalence, we can observe that indeed selfsupervised pretext tasks based on attribute information such as PairwiseAttrSim are actually looking to explicitly maintain this notion of similarity from the input to the embedding space.
With these insights and our empirical observations, we are able to design more pretext tasks. In this work, we aim to try a new direction beyond the structure and attribute information where we want to take into consideration the specific downstream task. In particular, we want to extend the concept of regular equivalence being defined with similarity on the level of structure or attributes to instead the level of task by introducing regular task equivalence where node similarity is now defined specific to the task. Since our downstream task is node classification, beyond the structure and attribute information (i.e., and , respectively), we additionally have the label information for some of the nodes (i.e., those in ). Thus, in the next section, we will discuss advanced selfsupervised methods that are built with the intuition of adapting the notion of regular equivalence beyond having neighbors with similar attributes, to instead having neighbors with similar node labels (or regular task equivalence). More specifically, the general idea is that if every node constructs a pretext vector based on information in regards to the labels from their neighborhood, then two nodes having similar (or dissimilar) vectors, we would encourage to be similar (or dissimilar) in the embedding space. The significance of this new attempt is that if the concept of regular task equivalence can work, it will open new doors to design more advanced pretext tasks based on the concept of equivalence from each individual resource or their combinations.
Based on our analysis from Section 4, constructing a selfsupervised pretext task that helps the downstream task from only structure or attribute information is not always able to find additional improvement, since such information could have already been partially maintained by GCN. Thus, given that there can be different downstream tasks and associated information (e.g., a small set of labeled nodes), we can directly exploit the task specific selfsupervised information referred as SelfTask in this work. In particular, we will develop various pretext tasks of SelfTask that extend the idea to maintain regular task equivalence defined over the set of class labels.
We first investigate modifying one of the best performing global structure pretext tasks, Distance2Cluster to take into consideration information from the labeled nodes. To incorporate label information with graph structure information, we propose to predict the distance vector from each node to the labeled nodes (i.e., ) as the SSL task. For class and unlabeled node , we calculate the average, minimum and maximum shortest path length from to all labeled nodes in class . Thus, the distance vector for can be denoted as . Formally, the SSL objective can be formulated as a multiple regression problem as follows,
(8) 
This formulation can be seen as a way of strengthening the global structure pretext, but mostly focused on leveraging the task specific information of labeled nodes.
In Section 4, we analyzed the types of similarity that GNNs are naturally positioned to maintain and how our basic pretext tasks are able to further improve them, namely those based on structure and attributes. This led to our definition of regular task equivalence, which if maintained would imply that nodes who have neighbors with similar labels should themselves be similar in the embedding space. However, in the node classification task this would require us to have labels for our neighbors so that we can harness this new concept of similarity in selfsupervised pretext tasks. However, labels often are sparse for the task of node classification. Therefore, we propose to use a similarity based function. It can utilize structure, attributes, and the current labeled nodes to construct a neighbor label distribution context vector for each node as follows,
(9) 
More specifically, we define the context of node as all nodes within its hop neighbors where the th element of the label distribution vector can be defined as,
(10) 
where denotes the neighborhood set from of node , then denotes only those in the neighborhood set having been assigned class (with similar definitions for neighborhood sets), and . Furthermore, the objective function for this pretext task based on the concept of regular task equivalence can then be formulated as,
(11) 
Several methods can be selected for extending the label information to all unlabeled nodes for . One way is to use a method based on structure equivalence where we elect to use Label Propagation (LP) Zhu et al. (2003) since it only uses A (although others like shortest path could be extended here as used in Distance2Label). Another way is using both structure equivalence and attribute equivalence where we use the Iterative Classification Algorithm (ICA) Sen et al. (2008) that utilizes both A and X. The neighbor label distribution context vectors could be noisy due to the inclusion of weak labels produced by (e.g., when using LP or ICA). Next we introduce two methods to improve ContextLabel.
There are various ways to define the similarity based function such as LP and ICA. Thus, one possible way to improve ContextLabel is to ensemble various functions
. If we let the class probabilities for a node
to be and , respectively when using LP and ICA inside , then we can combine them to select as,(12) 
We can use the ensembled for constructing context label distribution like Eq. (10) and following the pretext objective defined in Eq. (11).
We design CorrectedLabel as an alternative pretext task to enhance ContextLabel by iteratively improving the context vectors. More specifically, we take the approach of iteratively training the GNN and correcting the labels similar to the iterative training in Han et al. (2019) using training and correction phases. In the training phase, we use the corrected label to build the corrected context label distribution vector for unlabeled nodes similar to Eq. (10). We use to denote the unlabeled data samples with its corrected context label distribution in addition to for the SSL task. Then GNN is trained on both the original (e.g., ) and corrected (e.g., ) context distributions where the loss can be formulated as,
(13)  
where the first and second terms are to fit the original and corrected context distributions respectively, and controls the contribution from the corrected context distributions.
In the label correction phase, we employ the trained GCN to select class prototypes
(represented as deep features) for each category
, which are used to generate the corrected label. More specifically, we first randomly sample nodes in the same class to calculate their pairwise similarity matrix where is the cosine similarity between two nodes based on their embeddings. Then we define density for each node as,(14) 
where is a constant value (which we selected as the value ranked in top 40% in as suggested in Han et al. (2019)) and or if , or , respectively. According to the formulation, a smaller indicates that the node is less similar to other nodes in the same class. The nodes with inconsistent labels are usually isolated from others while nodes with correct labels should be close to each other. Hence, we select the nodes with top largest values as the class prototypes. Then we calculate the corrected label for node as,
(15) 
where is used to denote the cosine similarity between two samples. In other words, we use the average similarity between and prototypes to represent the similarity between and the corresponding class, and then assign the class having the largest similarity to . By iterating the two phases, the GNN can gradually learn corrected labels (e.g., ).
In this section, we evaluate the effectiveness of the proposed SelfTask pretext tasks presented in Section 5. Before presenting our experimental results and observations, we first introduce the experimental settings.
To validate the proposed approaches, we conduct experiments on four benchmark datasets, including Cora, Citeseer and Pubmed Kipf and Welling (2016) shown in Table 1, and Reddit Hamilton et al. (2017). More specifically, Reddit has 232,965 nodes, 57,307,946 edges, 210 classes, 5,414 node features and training/validation/test node split as 152,410/23,699/55,334, respectively. We note that all experiments are performed in the transductive setting.
We adopt 2layer GCN as the backbone for node classification model, with hidden units of 128, regularization , dropout rate and learning rate
. For the SSL loss, we take out the hidden representations from the first layer of GCN and feed them through a linear layer to solve SSL pretext task. We utilize the strategy of jointing training to integrate SSL with GCNs. The weighting parameter
for joint training is searched from . The parameter of CorrectedLabel is searched from . For ContextLabel, EnsembleLabel, and CorrectedLabel, the neighborhood rangeis set to 2 for Cora, Citeseer and Pubmed, and 1 for Reddit. All the experiments are repeated 10 times and we report the average accuracy with standard deviation. The hyperparameters of all the models are tuned based on the loss and accuracy on the validation set. In addition to the vanilla 2layer GCN
Kipf and Welling (2016), we also include two recent SSL methods on graph neural networks as baselines – (1) SelfTraining Li et al. (2018): it first trains a graph neural network and adds the most confident predictions of unlabeled data to the label set as pseudolabels for later training; and (2) M3S Sun et al. (2019): it repeatedly assigns pseudolabels and trains on augmented labeled set for times where it employs DeepCluster Caron et al. (2018) and SelfTraining to perform selfchecking based on the generated pseudolabels.Model  Cora  Citeseer  Pubmed  

GCN  
SelfTraining  
M3S  
SelfTaskDistance2Labeled^{1}^{1}1    
SelTaskContextLabelLP  
SelfTaskContextLabelICA  
SelfTaskEnsembleLabel  
SelfTaskCorrectedLabelLP  
SelfTaskCorrectedLabelICA 
SelfTaskDistance2Labeled is not scalable on the Reddit dataset where the labeled/unlabeled data is huge, since as defined it requires calculating the shortest path length distance from labeled data to unlabeled data.
The node classification performance is demonstrated in Table 4. We first note that most of pretext tasks of SelfTask outperform existing SSL methods, i.e., SelfTraining and M3S. This observation not only demonstrates the effectiveness of SelfTask but also indicates that the deep insights from the preliminary analysis have tremendous potentials to inspire new pretext tasks on graphs. We observe that the pretext tasks based on ContextLabel consistently improve GCN across all datasets by a large margin. For instance, SelfTaskContextLabelICA improves GCN by , and on Cora, Citeseer and Pubmed datasets respectively, achieving the stateoftheart performance. By contrast, most of the basic SSL tasks can only improve on one dataset or achieve small improvement, which demonstrates the importance of task specific information in constructing stronger pretext tasks. Moreover, label correction consistently boosts the performance of SelfTask on all datasets while label ensemble can only boost SelfTaskContextLabelLP for most of the time. This observation indicates that label correction can better extend label information to unlabeled nodes than ensemble. However, label correction is much less inefficient as it will optimize the process of correcting labels for unlabeled nodes. Hence, in practice, we need to balance the computational efficiency and predictive accuracy when choosing the best strategy.
The proposed SelfTask pretext tasks depends on the label information. In this subsection, we examine if SelfTask can still work when having a very limited number labeled samples. We randomly sampled 5 or 10 nodes per class for training and the same number of nodes for validation. All remaining labeled nodes are used testing. We repeated this process for 10 times and compare our best model, denoted as SelfTask with GCN and M3S. Since the performance of SelfTraining is always worse than that of M3S, we do not include its performance. The results are shown in Figure 4. As we can see from the figure, the performance of GCN drops rapidly with the decrease of labeled samples. However, SelfTask achieves even greater improvement when the labeled samples are fewer and consistently outperforms the stateoftheart baselines. Especially under the setting of 5 samples per class on Citeseer dataset, our proposed model improves GCN by a large margin of . These observations suggest that SelfTask can be applied to the scenarios when labels are very sparse.
In this subsection, we explore the sensitivity of hyperparameters for the best model, SelfTaskCorrectedLabelICA. Here we alter the value of and to see the changes of the model in terms of test accuracy. More concretely, we vary in the range of and from to with an interval of . We only report the results on the Cora dataset since similar patterns are observed in other datasets. The accuracy change in terms of is illustrated in Figure (a)a. We can see the performance of our model first increases with the increase of . This result supports that incorporating SSL can boost the performance of GNNs. However, when is large, the performance reduces due to the overfitting on the SSL task. Figure (b)b shows the impact of . Employing label correction () outperforms not using it (), which suggests the effectiveness of label correction.
In this section, we introduce the related work including selfsupervised learning and graph neural networks.
SSL is a novel learning framework that generates additional supervised signals to train deep learning models through carefully designed pretext tasks. SSL has been proven to effectively alleviate the problem of lack of labeled training data Kolesnikov et al. (2019). In the image domain, various selfsupervised learning techniques have been developed for learning highlevel image representations. Doersch et al. Doersch et al. (2015) first proposed to predict the relative locations of image patches. Following this line of research, Noroozi et al. Noroozi and Favaro (2016) designed a pretext task called Jigsaw Puzzle. More types of pretext tasks have also been investigated, such as image rotation Gidaris et al. (2018), image clustering Caron et al. (2018)
Pathak et al. (2016)Zhang et al. (2016) and motion segmentation prediction Pathak et al. (2017). In the domain of graphs, there are a few works incorporating SSL. Sun et al. Sun et al. (2019) utilized the clustering assignments of node embeddings as guidance to update the graph neural networks. Peng et al. Peng et al. (2020) proposed to use the global context of nodes as the supervisory signals to learn node embeddings.GNNs can be roughly categorized into spectral methods and spatial methods. Spectral methods were initially developed based on spectral theory Bruna et al. (2013); Defferrard et al. (2016); Kipf and Welling (2016). Bruna et al. Bruna et al. (2013) first extended the notion of convolution to nongrid structures. Afterward, a simplified version of spectral GNNs called ChebNet Defferrard et al. (2016) was developed. Next, GCN is proposed by Kipf et al. Kipf and Welling (2016), where Chebnet is further simplified based on its firstorder approximation. Later, Wu et al. Wu et al. (2019a) proposed Simple Graph Convolution (SGC) to simplify GCN by removing nonlinearities and collapsing weight matrices. Spatial methods consider the topological structure of the graph, and aggregate the information of nodes according to local information Hamilton et al. (2017); Veličković et al. (2018). Hamilton et al. Hamilton et al. (2017) proposed an inductive learning method called GraphSAGE for largescale networks. Veličković et al. Veličković et al. (2018) proposed graph attention network (GAT), which includes an attention mechanism to graph convolutions. Further, Rong et al. Rong et al. (2019) developed deep graph convolution network by applying DropEdge mechanism to randomly drop edges during training. For a thorough review, please refer to recent surveys Wu et al. (2020); Zhou et al. (2018).
Applying selfsupervised learning to GNNs is a cuttingedge research topic with great potential. To facilitate this line of research, we have carefully studied SSL in GNNs for the task of node classification. We first introduce various basic SSL pretext tasks for graphs and present detailed empirical study to understand when and why SSL works for GNNs and which strategy can better work with GNNs. Next, based on our insights, we propose a new direction SelfTask to build advanced pretext tasks which further exploit taskspecific selfsupervised information. Extensive experiments on realworld datasets demonstrate that our advanced method achieves stateoftheart performance. Future work can be done on exploring new pretext tasks and applying the proposed SSL strategies in pretraining graph neural networks.
Deep clustering for unsupervised learning of visual features
. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §1, §6.1, §7.1.Proceedings of the IEEE conference on Computer Vision and Pattern Recognition
, pp. 1920–1929. Cited by: §1, §4.2, §7.1.International conference on machine learning
, Cited by: §1.ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §6.1.