Automated Self-Supervised Learning for Graphs

06/10/2021 ∙ by Wei Jin, et al. ∙ Snap Inc. Michigan State University 10

Graph self-supervised learning has gained increasing attention due to its capacity to learn expressive node representations. Many pretext tasks, or loss functions have been designed from distinct perspectives. However, we observe that different pretext tasks affect downstream tasks differently cross datasets, which suggests that searching pretext tasks is crucial for graph self-supervised learning. Different from existing works focusing on designing single pretext tasks, this work aims to investigate how to automatically leverage multiple pretext tasks effectively. Nevertheless, evaluating representations derived from multiple pretext tasks without direct access to ground truth labels makes this problem challenging. To address this obstacle, we make use of a key principle of many real-world graphs, i.e., homophily, or the principle that “like attracts like,” as the guidance to effectively search various self-supervised pretext tasks. We provide theoretical understanding and empirical evidence to justify the flexibility of homophily in this search task. Then we propose the AutoSSL framework which can automatically search over combinations of various self-supervised tasks. By evaluating the framework on 7 real-world datasets, our experimental results show that AutoSSL can significantly boost the performance on downstream tasks including node clustering and node classification compared with training under individual tasks. Code will be released at https://github.com/ChandlerBang/AutoSSL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

Code Repositories

AutoSSL

Implementation of paper "Automated Self-Supervised Learning for Graphs"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are pivotal data structures describing the relationships between entities in various domains such as social media, biology, transportation and financial systems wu2019comprehensive-survey; battaglia2018relational

. Due to their prevalence and rich descriptive capacity, pattern mining and discovery on graph data is a prominent research area with powerful implications. As the generalization of deep neural networks on graph data, graph neural networks (GNNs) have proved to be powerful in learning representations for graphs and associated entities (nodes, edges, subgraphs), and they have been employed in various applications such as node classification 

kipf2016semi; gat, node clustering arvga, recommender systems pinsage and drug discovery duvenaud2015convolutional.

In recent years, the explosive interest in self-supervised learning (SSL) has suggested its great potential in empowering stronger neural networks in an unsupervised manner chen2020simple; kolesnikov2019revisiting; doersch2015unsupervised. Many self-supervised methods have also been developed to facilitate graph representation learning xie2021self-survey; jin2020graph such as Dgi velickovic2018deep, Par/Clu you2020does and MvGRL hassani2020contrastive. Given graph and node attribute data, they construct pretext tasks, which are called SSL tasks, based on structural and attribute information to provide self-supervision for training graph neural networks without accessing any labeled data. For example, the pretext task of Par is to predict the graph partitions of nodes. We examine how a variety of SSL tasks including Dgi, Par, Clu, PairDis peng2020self and PairSim jin2020node perform over 3 datasets. Their node clustering and node classification performance ranks are illustrated in Figure (a)a and (b)b, respectively. From these figures, we observe that different SSL tasks have distinct downstream performance cross datasets. This observation suggests that the success of SSL tasks strongly depends on the datasets and downstream tasks. Learning representations with a single task naturally leads to ignoring useful information from other tasks. As a result, searching SSL tasks is crucial, which motivates us to study on how to automatically compose a variety of graph self-supervised tasks to learn better node representations.

(a) Node Clustering
(b) Node Classification
(c) Combining Two Tasks
(d) AutoSSL
Figure 1: (a)(b): Performance of 5 SSL tasks ranked best (1) to worst (5) by color on node clustering and classification, showing disparate performance across datasets and tasks. (c): Clustering performance heatmap on Citeseer when combining 2 SSL tasks, PairSim and PairDis, with different weights. (d) AutoSSL’s search trajectory for task weights, achieving near-ideal performance.

However, combining multiple different SSL tasks for unlabeled representation learning is immensely challenging. Although promising results have been achieved in multi-task self-supervised learning for computer vision, most of them assign equal weights to SSL tasks 

doersch2017multi; ren2018cross; zamir2018taskonomy. Such combination might not always yield better performance than a single task, as different tasks have distinct importance according to specific dataset and downstream tasks. To illustrate this intuition, we combine two SSL tasks, PairDis and PairSim, with varied weights and illustrate the corresponding node clustering performance in Figure (c)c. It clearly indicates that different choices of weights yield different performance. To circumvent this problem, we could plausibly search different weights for SSL tasks to optimize downstream tasks. However, to achieve such goal, we have two obstacles. First, the search space is huge, and thus search can be highly expensive. Hence, it is desirable to automatically learn these weights. Second, searching for optimal task weights typically requires guidance from downstream performance, which is naturally missing under the unsupervised setting. Thus, how to design an unsupervised surrogate evaluation measure that can guide the search process is necessary.

It is evident that many real-world graphs such as friendship networks, citation networks, co-authorship networks and co-purchase networks mcpherson2001birds; shchur2018pitfalls satisfy the homophily assumption, i.e., “like attracts like”, or that connected nodes tend to share the same label. This is useful prior knowledge, as it directly relates the label information of downstream tasks to the graph structure. In this work, we explicitly take advantage of this prior knowledge and assume that the predicted labels from good node embeddings should also adhere to homophily. Given the lack of ground-truth labels during SSL, we propose a pseudo-homophily measure to evaluate the quality of the node embeddings trained from specific combinations of SSL task. With pseudo-homophily, we are able to design an automated framework for SSL task search, namely AutoSSL. Our work makes three significant contributions:

  1. To bridge the gap between unsupervised representation and downstream labels, we propose pseudo-homophily to measure the quality of the representation. Moreover, given graphs with high homophily, we theoretically show that pseudo-homophily maximization can help maximize the upper bound of mutual information between pseudo-labels and downstream labels.

  2. Based on pseudo-homophily, we propose two strategies to efficiently search SSL tasks, one employing evolution algorithm and the other performing differentiable search via meta-gradient descent. AutoSSL is able to adjust the task weights during search as shown in Figure (d)d.

  3. We evaluate the proposed AutoSSL by composing various individual tasks on 7 real-world datasets. Extensive experiments have demonstrated that AutoSSL can significantly improve the performance of individual tasks on node clustering and node classification (e.g., up to 10.0% relative improvement on node clustering).

2 Background and Related Work

Graph Neural Networks. Graph neural networks (GNNs) are powerful tools for extracting useful information from graph data liu2021elastic; wu2019comprehensive-survey; kipf2016semi; gat; hamilton2017inductive; vgae; arvga. They aim to learn a mapping function parameterized by to map the input graph into a low-dimensional space. Most graph neural networks follow a message passing scheme gilmer2017neural where the node representation is obtained by aggregating the representation of its neighbors and its own representation.

Self-Supervised Learning in GNNs. Graph neural networks have achieved superior performance in various applications; but they also require costly task-dependent labels to learn rich representations. To alleviate the need for the huge amount of labeled data, recent studies have employed self-supervised learning in graph neural networks to provide additional supervision jin2020graph; velickovic2018deep; you2020does; hassani2020contrastive; hu2019strategies; qiu2020gcc; zhu2020graph-gca. Specifically, those SSL methods construct a pre-defined pretext task to assign pseudo-labels for unlabeled nodes/graphs and then train the model on the designed pretext task to learn representations. A recent work JOAO you2021graph on graph contrastive learning is proposed to automatically select data augmentation. We note that JOAO is designed for contrastive learning framework on graph classification task, while our proposed AutoSSL focuses on node-level applications and is more flexible in that AutoSSL can be applied to various SSL tasks.

Multi-Task Self-Supervised Learning. Our work is also related to multi-task self-supervised learning doersch2017multi; ren2018cross; zamir2018taskonomy. However, most of them assume the tasks with equal weights and perform training under the supervised setting. Our work differs from them by (1) learning different weights for different tasks and (2) not requiring access to labeled data.

Automated Loss Function Search.

Tremendous efforts have been paid to automate every aspect of machine learning applications 

yao2018taking; liu2018darts, such as feature engineering, model architecture search and loss function search. Among them, our work is highly related to loss function search zhao2021autoloss; xu2018autoloss; wang2020loss; li2019lfs. However, these methods are developed under the supervised setting and not applicable in self-supervised learning. Another related work is ELo piergiovanni2020evolving, which evolves multiple self-supervised losses based on Zipf distribution matching for action recognition. However, it is designed exclusively for image data and not applicable to non-grid graph-structured data. The problem of self-supervised loss search for graphs remains rarely explored. To bridge the gap, we propose an automated framework for searching SSL losses towards graph data in an unsupervised manner.

3 Automated Self-Supervised Task Search with AutoSSL

In this section, we present the proposed framework of automated self-supervised task search, namely AutoSSL. Given a graph , a GNN encoder and a set of self-supervised losses (tasks) , we aim at learning a set of loss weights such that trained with the weighted loss combination can extract meaningful features from the given graph data. The key challenge is how to mathematically define “meaningful features”. If we have the access to the labels of the downstream task, we can define “meaningful features” as the features (node embeddings) that can have high performance on the given downstream task. Then we can simply adopt the downstream performance as the optimization goal and formulate the problem of automated self-supervised task search as follows:

(1)

where denotes the quality measure for the obtained node embeddings, and it can be any metric that evaluates the downstream performance such as cross-entropy loss for the node classification task. However, under the self-supervised setting, we do not have the access to labeled data and thus cannot employ the downstream performance to measure the embedding quality. Instead, we need an unsupervised quality measure to evaluate the quality of obtained embeddings. In a nutshell, one challenge of automated self-supervised learning is: how to construct the goal of automated task search without the access to label information of the downstream tasks.

3.1 Pseudo-Homophily

Most common graphs adhere to the principle of homophily, i.e., “birds of a feather flock together”  mcpherson2001birds, which suggests that connected nodes often belong to the same class; e.g. connected publications in a citation network often have the same topic, and friends in social networks often share interests newman2018networks. Homophily is often calculated as the fraction of intra-class edges in a graph zhu2020beyond. Formally, it can be defined as follows,

Definition 1 (Homophily).

The homophily of a graph

with node label vector

is given by

(2)

where indicates node ’s label and is the indicator function.

We calculate the homophily for seven widely used datasets as shown in Appendix A and we find that they all have high homophily, e.g., 0.93 in the Physics dataset. Considering the high homophily in those datasets, intuitively the predicted labels from the extracted node features should also have high homophily. Hence, the prior information of graph homophily in ground truth labels can serve as strong guidance for searching combinations of self-supervised tasks. As mentioned before, in self-supervised tasks, the ground truth labels are not available. Motivated by DeepCluster caron2018deep which uses the cluster assignments of learned features as pseudo-labels to train the neural network, we propose to calculate the homophily based on the cluster assignments, which we term as pseudo-homophily. Specifically, we first perform -means clustering on the obtained node embeddings to get clusters. Then the cluster results are used as the pseudo labels to calculate homophily based on Eq. (2). Note that though many graphs in the real world have high homophily, there also exist heterophily graphs zhu2020beyond; pei2020geom which have low homophily. We leave such type of graphs for future work.

Theoretical analysis. In this work, we propose to achieve self-supervised task search via maximizing pseudo-homophily. To understand its rationality, we develop the following theorem to show that pseudo-homophily maximization is related to the upper bound of mutual information between pseudo-labels and ground truth labels.

Theorem 1.

Suppose that we are given with a graph , a pseudo label vector and a ground truth label vector defined on the node set. We denote the homophily of and over as and , respectively. If the classes in and are balanced and < , the following results hold: (1) the mutual information between and , i.e., MI(,), has an upper bound , where with and denoting the largest node degree in the graph; (2) if , we have .

Proof. The detailed proof of this theorem can be found in Appendix B.

The above theorem suggests that a larger difference between pseudo-homophily and real homophily results in a lower upper bound of mutual information between the pseudo-labels and ground truth labels. Therefore, maximizing pseudo-homophily is to maximize the upper bound of mutual information between pseudo-labels and ground truth labels, since we assume that the given graph has high homophily.

3.2 Search Algorithms

In the last subsection, we have demonstrated the importance of maximizing pseudo-homophily. Thus in the optimization problem of Eq. (1), we can simply set to be negative pseudo-homophily. However, the evaluation of a specific task combination involves fitting a model and evaluating its pseudo-homophily, which can be highly expensive. Therefore, another challenge for automated self-supervised task search is how to design an efficient algorithm. In the following, we introduce the details of the search strategies designed in this work, i.e. AutoSSL-es and AutoSSL-ds.

3.2.1 AutoSSL-es: Evolutionary Strategy

Evolution algorithms are often used in automated machine learning such as hyperparameter tuning due to their parallelism nature by design 

loshchilov2016cma. In this work, we employ the covariance matrix adaptation evolution strategy (CMA-ES) hansen2003reducing, a state-of-the-art optimizer for continuous black-box functions, to evolve the combined self-supervised loss. We name this self-supervised task search approach as AutoSSL-es. In each iteration of CMA-ES, it samples a set of candidate solutions (i.e., task weights

) from a multivariate normal distribution and trains the GNN encoder under the combined loss function. The embeddings from the trained encoder are then evaluated by

. Based on

, CMA-ES adjusts the normal distribution to give higher probabilities to good samples that can potentially produce a lower value of

. Note that we constrain in and sample candidate combinations for each iteration, which is trivially parallelizable as every candidate combination can be evaluated independently.

3.2.2 AutoSSL-ds: Differentiable Search via Meta-Gradient Descent

While the aforementioned AutoSSL-es

is parallelizable, the search cost is still expensive because it requires to evaluate a large population of candidate combinations where every evaluation involves fitting the model in large training epochs. Thus, it is desired to develop gradient-based search methods to accelerate the search process. In this subsection, we introduce the other variant of our proposed framework,

AutoSSL-ds, which performs differentiable search via meta-gradient descent. However, pseudo-homophily is not differentiable as it is based on hard cluster assignments from -means clustering. Next, we will first present how to make the clustering process differentiable and then introduce how to perform differentiable search.

Soft Clustering. Although

-means clustering assigns hard assignments of data samples to clusters, it can be viewed as a special case of Gaussian mixture model which makes soft assignments based on the posterior probabilities 

bishop2006pattern. Given a Gaussian mixture model with centroids

and fixed variances

, we can calculate the posterior probability as follows:

(3)

where is the feature vector of data samples. By Bayes rule and considering an equal prior, i.e., , we can compute the probability of a feature vector belonging to a cluster as:

(4)

If , we can obtain the hard assignments as the -means algorithm. As we can see, the probability of each feature vector belonging to a cluster reduces to computing the distance between them. Then we can construct our homophily loss function as follows:

(5)

where is a loss function measuring the difference between the inputs. With soft assignments, the gradient of w.r.t. becomes tractable.

Search via Meta Gradient Descent. We now detail the differentiable search process for AutoSSL-ds. A naive method to tackle bilevel problems is to alternatively optimize the inner and outer problems through gradient descent. However, we cannot perform gradient descent for the outer problem in Eq. (1) where is not directly related to . To address this issue, we can utilize meta-gradients, i.e., gradients w.r.t. hyperparameters, which have been widely used in solving bi-level problems in meta learning finn2017model; zugner2019adversarial

. To obtain meta-gradients, we need to backpropagate through the learning phase of the neural network. Concretely, the meta-gradient of

with respect to is expressed as

(6)

where stands for the inner optimization that obtains and it is typically multiple steps of gradient descent. As an illustration, we consider as steps of vanilla gradient descent with learning rate ,

(7)

By unrolling the training procedure, we can express meta-gradient as

(8)

with . Since , we have

(9)

Note that also depends on the task weights (see Eq. (7)). Thus, its derivative w.r.t. the task weights chains back until . By unrolling all the inner optimization steps, we can obtain the meta-gradient and use it to perform gradient descent on to reduce :

(10)

where is the learning rate for meta-gradient descent (outer optimization).

However, if we use the whole training trajectory to calculate the precise meta-gradient, it would have an extremely high memory footprint since we need to store in memory. Thus, inspired by DARTS liu2018darts, we use an online updating rule where we only perform one step gradient descent on and then update in each iteration. Note that under this training paradigm our aim is not to automatically find the best task weights, as are changing at each iteration. Instead we target at generating appropriate gradients to update GNN parameters that can minimize . During the process, we constrain in and dynamically adjust the task weights in a differentiable manner. The detailed algorithm for AutoSSL-ds is summarized in Appendix C.

4 Experimental Evaluation

In this section, we empirically evaluate the effectiveness of the proposed AutoSSL on self-supervised task search on real-world datasets. We aim to answer four questions as follows. Q1: Can AutoSSL achieve better performance compared to training on individual SSL tasks? Q2: How does AutoSSL compare to other unsupervised and supervised node representation learning baselines? Q3: Can we observe relations between AutoSSL’s pseudo-homophily objective and downstream classification performance? and Q4: How do the SSL task weights, pseudo-homophily objective, and downstream performance evolve during AutoSSL’s training?

4.1 Experimental Setting

Since our goal is to enable automated combination search and discovery of SSL tasks, we use 5 such tasks including contrastive learning method and predictive methods – (1) Dgi velickovic2018deep: it is a contrastive learning method maximizing the mutual information between graph representation and node representation; (2) Clu you2020does, it predicts partition labels from Metis graph partition karypis1998fast-metis; (3) Par you2020does, it predicts clustered labels from -means clustering on node features; (4) PairSim jin2020node, it predicts pairwise feature similarity between node pairs and (5) PairDis peng2020self, it predicts shortest path length between node pairs. The proposed AutoSSL framework automatically learns to jointly leverage the 5 above tasks and carefully mediate their influence. We also note that (1) the recent contrastive learning method, MvGRL hassani2020contrastive, needs to deal with a dense diffusion matrix and is prohibitively memory/time-consuming for larger graphs; thus, we only include it as a baseline to compare as shown in Table 2; and (2) the proposed framework is general and it is straightforward to combine other SSL tasks.

We perform experiments on 7 real-world datasets widely used in the literature yang2016revisiting; shchur2018pitfalls; mernyei2020wiki, i.e., Physics, CS, Photo, Computers, WikiCS, Citeseer and CoraFull. To demonstrate the effectiveness of the proposed framework, we follow hassani2020contrastive and evaluate all methods on two different downstream tasks: node clustering and node classification. For the task of node clustering, we perform

-means clustering on the obtained embeddings. We set the number of clusters to the number of ground truth classes and report the normalized mutual information (NMI) between the cluster results and ground truth labels. Regarding the node classification task, we train a logistic regression model on the obtained node embeddings and report the classification accuracy on test nodes.

Note that labels are never used for self-supervised task search. All experiments are performed under 5 different random seeds and results are averaged. Following Dgi and MvGRL, we use a simple one-layer Gcn kipf2016semi as our encoder and set the size of hidden dimensions to 512. We set and use L1 loss in the homophily loss function throughout the experiments. Further details of dataset statistics, data splits, and hyperparameter settings can be found in Appendix A.

Dataset Metric Self-Supervised Task AutoSSL
Clu Par PairSim PairDis Dgi ES DS
WikiCS NMI
ACC
P-H
Citeseer NMI
ACC
P-H
Computers NMI
ACC
P-H
CoraFull NMI
ACC
P-H
CS NMI
ACC
P-H
Physics NMI
ACC
P-H
Photo NMI
ACC
P-H
Table 1: Performance comparison of self-supervised tasks (losses) on node clustering and node classification. The NMI rows indicate node clustering performance; ACC rows indicate node classification accuracy (%); P-H stands for pseudo-homophily. AutoSSL regularly outperforms individual pretext tasks. (Bold: best in all methods; Underline: best in individual tasks).

4.2 Performance Comparison with Individual Tasks

To answer Q1, Table 1 summarizes the results for individual self-supervised tasks and AutoSSL under the two downstream tasks, i.e., node clustering and node classification. From the table, we make several observations. Obs. 1: individual self-supervised tasks have different node clustering and node classification performance for different datasets. For example, in Photo, Dgi achieves the highest classification accuracy while Par achieves the highest clustering performance; Clu performs better than PairDis in both NMI and ACC on Physics while it cannot outperform PairDis in WikiCS, Citeseer, Computers and CoraFull. This observation suggests the importance of searching suitable SSL tasks to benefit downstream tasks on different datasets. Obs. 2: Most of the time, combinations of SSL tasks searched by AutoSSL can consistently improve the node clustering and classification performance over the best individual task on the all datasets. For example, the relative improvement over the best individual task on NMI from AutoSSL-es is 7.3% for WikiCS and 10.0% for Photo, and its relative improvement on ACC is 1.3% for WikiCS. These results indicate that composing multiple SSL tasks can help the model encode different types of information and avoid overfitting to one single task. Obs. 3: We further note that individual tasks resulted in different pseudo-homophily as shown in the P-H rows of Table 1. Among them, Clu tends to result in a low pseudo-homophily and often performs much worse than other tasks in node clustering task, which supports our theoretical analysis in Section 3.1. It also demonstrates the necessity to increase pseudo-homophily as the two variants of AutoSSL effectively search tasks that lead to higher pseudo-homophily. Obs. 4: The performance of AutoSSL-es and AutoSSL-ds is close when their searched tasks lead to similar pseudo-homophily: the differences in pseudo-homophily, NMI and ACC are relative smaller in datasets other than Photo and Computers. It is worth noting that sometimes AutoSSL-ds can even achieve higher pseudo-homophily than AutoSSL-es. This indicates that the online updating rule for in AutoSSL-ds not only can greatly reduce the searching time but also can generate good task combinations. In addition to efficiency, we highlight another major difference between them: AutoSSL-es directly finds the best task weights while AutoSSL-ds adjusts the task weights to generate appropriate gradients to update model parameters. Hence, if we hope to find the best task weights and retrain the model, we should turn to AutoSSL-es. More details on their differences can be found in Appendix D.2.

4.3 Performance Comparison with Supervised and Unsupervised Baselines

To answer Q2, we compare AutoSSL with representative unsupervised and supervised node representation learning baselines. Specifically, for node classification we include 4 unsupervised baselines, i.e., Gae vgae, Vgae vgae, Arvga arvga and MvGRL, and 2 supervised baselines, i.e. Gcn and Gat gat. We also provide the logistic regression performance on raw features and embeddings generated by a randomly initialized encoder, named as Raw-Feat and Random-Init, respectively. Note that the two supervised baselines, Gcn and Gat, use label information for node representation learning in an end-to-end manner, while other baselines and AutoSSL do not leverage label information to learn representations. The average performance and variances are reported in Table 2. From the table, we find that AutoSSL outperforms unsupervised baselines in all datasets except Citeseer while the performance on Citeseer is still comparable to the state-of-the-art contrastive learning method MVGRL. When compared to supervised baselines, AutoSSL-ds outperforms Gcn and Gat in 4 out of 7 datasets, e.g., a 1.7% relative improvement over Gat on Computers. AutoSSL-es also outperforms Gcn and Gat in 3/4 out of 7 datasets. In other words, our unsupervised representation learning AutoSSL can achieve comparable performance with supervised representation learning baselines. In addition, we use the same unsupervised baselines for node clustering and report the results in Table 3. Both AutoSSL-es and AutoSSL-ds show highly competitive clustering performance. For instance, AutoSSL-es achieves 22.2% and 27.5% relative improvement over the second best baseline on Physics and WikiCS; AutoSSL-ds also achieves 22.2% and 19.8% relative improvement on these two datasets. These results further validate that composing SSL tasks appropriately can produce expressive and generalizable representations.

Model WikiCS Citeseer Computers CoraFull CS Physics Photo
Random-Init
Raw-Feat
Gae
Vgae
Arvga
MvGRL
AutoSSL-es
AutoSSL-ds
Gcn
Gat
Table 2: Node classification accuracy (%). The last two rows are supervised baselines. AutoSSL consistently outperforms alternative self-supervised approaches, and frequently outperforms supervised ones. (Bold: best; Underline: runner-up).
Model WikiCS Citeseer Computers CoraFull CS Physics Photo
Random-Init
Raw-Feat
Gae
Vgae
Arvga
MvGRL
AutoSSL-es
AutoSSL-ds
Table 3: Clustering performance (NMI). AutoSSL embeddings routinely exhibit superior NMI to alternatives. (Bold: best; Underline: runner-up).

4.4 Relation between Downstream Performance and Pseudo-Homophily

In this subsection, we investigate the relation between downstream performance and pseudo-homophily and correspondingly answer Q3. Specifically, we use the candidate task weights sampled in the AutoSSL-es searching trajectory, and illustrate their node clustering (NMI) and node classification performance (ACC) with respect to their pseudo-homophily. The results on Computers and WikiCS are shown in Figure 2 and results for other datasets are shown in Appendix D.1. We observe that the downstream performance tends to be better if the learned embeddings tend to have higher pseudo-homophily. We also can observe that clustering performance has a clear relation with pseudo-homophily for all datasets. Hence, the results empirically support our theoretical analysis in Section 3.1 that lower pseudo-homophily leads to a lower upper bound of mutual information with ground truth labels. While classification accuracy has a less evident pattern, we can still observe that higher accuracy tends to concentrate on the high pseudo-homophily regions for 5 out of 7 datasets.

(a) Computers: NMI
(b) WikiCS: NMI
(c) Computers: ACC
(d) WikiCS: ACC
Figure 2: Relationship between downstream performance and pseudo-homophily.

4.5 Evolution of SSL Task Weights, Pseudo-Homophily and Downstream Performance

Visualization for Learned Task Weights. To answer Q4, we visualize the final task weights searched by AutoSSL-es on all datasets through the heatmap in Figure (a)a. From the figure, we make three observations. Obs. 1: The searched task weights vary significantly from dataset to dataset. For example, the weights for Par and Dgi are [0.198, 0.980] on Physics while they are [0.955, 0.066] on WikiCS. Obs. 2: In general, Par benefits co-purchase networks, i.e. Photo and Computers; Dgi is crucial for citation/co-authorship networks, i.e. Physics, CS, Citeseer, and CoraFull. We conjecture that local structure information (the information that Par captures) is essential for co-purchase networks while both local and global information (the information that DGI captures) are necessary in citation/co-authorship networks. Obs. 3: AutoSSL-es always gives very low weights to Clu. It could be the reason that the pseudo-labels clustered from raw features are not good supervision on the selected datasets.

We also provide the evolution of task weights in AutoSSL-ds for CoraFull dataset in Figure (b)b. The weights of the 5 tasks eventually become stable: Clu and PairDis are assigned with small values while PairSim, Dgi and Clu are assigned with large values. Thus, both AutoSSL-es and AutoSSL-ds agree that PairDis and Par are less important for CoraFull.

(a) AutoSSL-es
(b) AutoSSL-ds
(a) CoraFull
(b) Citeseer
Figure 3: Visualization of Task Weights.
Figure 4: P-H change of AutoSSL-es
Figure 3: Visualization of Task Weights.

Pseudo-Homophily Over Iterations. We further investigate how pseudo-homophily changes over iterations. For AutoSSL-es, we illustrate the mean value of resulted pseudo-homophily in each iteration (round) in Figure 4. We only show the results on CoraFull and Citeseer while similar patterns are exhibited in other datasets. It is clear that AutoSSL-es can effectively increase the pseudo-homophily and thus search for better self-supervised task weights. For AutoSSL-ds, we plot the changes of pseudo-homophily, NMI, ACC and homophily loss (Eq. (5)) in Figure 5. From Figure (a)a and (b)b, we can observe that pseudo-homophily first increases and then becomes stable through iterations. The situation is a bit different for clustering and classification performance: NMI and ACC first increase with the increase of pseudo-homophily and then drop when pseudo-homophily is relatively stable. This indicates that overtraining can hurt downstream performance as the model will have the risk of overfitting on the combined SSL tasks. However, as shown in the figure, if we stop at the iteration when pseudo-homophily reaches the maximum value we can still get a high NMI and ACC. On a separate note, Figure (c)c shows how the homophily loss used in AutoSSL-ds changes over iterations. We note that in the first iterations the homophily loss is low but the pseudo-homophily is also low. This is because the embeddings in the first few epochs are less separable and would lead to very close soft-assignment of clusters. As shown in the figure, however, the problem is resolved as the embeddings become more distinguishable through iterations. Thus, we argue that the homophily loss in Eq. (5) is still a good proxy in optimizing pseudo-homophily.

(a) Clustering
(b) Classification
(c) Homophily loss
Figure 5: Pseudo-homophily versus NMI/ACC/Loss on Citeseer for AutoSSL-ds. The vertical dashed line indicates the iteration when pseudo-homophily reaches the maximum value.

5 Conclusion

Graph self-supervised learning has achieved great success in learning expressive node/graph representations. In this work, however, we find that SSL tasks designed for graphs perform differently on different datasets and downstream tasks. Thus, it is worth composing multiple SSL tasks to jointly encode multiple sources of information and produce more generalizable representations. However, without access to labeled data, it poses a great challenge in measuring the quality of the combinations of SSL tasks. To address this issue, we take advantage of graph homophily and propose pseudo-homophily to measure the quality of combinations of SSL tasks. We then theoretically show that maximizing pseudo-homophily can help maximize the upper bound of mutual information between the pseudo-labels and ground truth labels. Based on the pseudo-homophily measure, we develop two automated frameworks AutoSSL-es and AutoSSL-ds to search the task weights efficiently. Extensive experiments have demonstrated that AutoSSL is able to produce more generalize representations by combining various SSL tasks.

References

Appendix A Experimental Setup

Dataset Statistics. We evaluate the proposed framework on seven real-world datasets. The dataset statistics are shown in 4

. All datasets can be loaded from PyTorch Geometric 

[fey2019fast-pyg]. When we evaluate the node classification performance, we need to use the training and test data. For WikiCS [mernyei2020wiki] and Citeseer [yang2016revisiting], we use the public data splits provided by the authors. For other datasets, we split the nodes into 10%/10%/80% for training/validation/test.

Dataset Network Type #Nodes #Edges #Classes #Features Homophily
WikiCS Reference network 11,701 216,123 10 300 0.70
CS Co-authorship network 18,333 81,894 15 6,805 0.81
Physics Co-authorship network 34,493 247,962 5 8,415 0.93
Computers Co-purchase network 13,381 245,778 10 767 0.78
Photo Co-purchase network 7,487 119,043 8 745 0.83
CoraFull Citation network 19,793 65,311 70 8,710 0.57
Citeseer Citation network 3,327 4,732 6 3,703 0.74
Table 4: Dataset statistics.

Hyper-parameter Settings. When calculating pseudo-homophily, we set the number of clusters to 5 for all datasets. Following Dgi [velickovic2018deep] and MvGRL [hassani2020contrastive], we we use a simple one-layer Gcn [kipf2016semi] as our encoder. We set the size of hidden dimensions to 512, weight decay to 0, dropout rate to 0. For individual SSL methods and AutoSSL-es, we set learning rate to 0.001, use Adam optimizer [kingma2014adam], train the models with 1000 epochs and adopt early stopping strategy with a patience of 20 epochs. For AutoSSL-ds, we train the models with 1000 epochs and choose the model checkpoint that achieves the highest pseudo-homophily. We use Adam optimizer for both inner and outer optimization. The learning rate for outer optimization is set to 0.05. For AutoSSL-es, we use a population size of 8 for each round. Due to limited computational resources, we perform 80 rounds for Citeseer, 40 rounds for CS, Computers, CoraFull, Photo, Physics, Computers and WikiCS. We repeat the experiments on 5 different random seeds and report the mean values and variances for downstream performance. To fit Dgi into GPU memory on larger datasets and accelerate its training, instead of using all the nodes we sample 2000 positive samples and 2000 negative samples for Dgi on all datasets except Citeseer .

Hardware and Software Configurations. We perform experiments on one NVIDIA Tesla K80 GPU and one NVIDIA Tesla V100 GPU. Additionally, we use eight CPUs, with the model name as Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz. The operating system we use is CentOS Linux 7 (Core).

Appendix B Proof

Theorem 1. Suppose that we are given with a graph , a pseudo label vector and a ground truth label vector defined on the node set. We denote the homophily of and over as and , respectively. If the classes in and are balanced and < , the following results hold: (1) the mutual information between and , i.e., MI(,), has an upper bound , where with and denoting the largest node degree in the graph; (2) if , we have .

Proof.

(1) We start with the proof of the first result. The mutual information between two random variables

and is expressed as

(11)

Let and denote the set of nodes in the -th class in and , respectively. Following the definition in Eq. (11), the mutual information between and can be formulated as,

(12)

where denote the number of classes in and . Since here we only consider 2 classes in and , we have

(13)

Let , and . We then have

(14)

With the equations above, we rewrite as follows,

(15)

Then we rewrite result (1) in the theorem as an optimization problem,

(16)

with constraints,

(17)

Note that the equality of holds when and are the same. However, and have different homophily, which indicates cannot reach (the same for ). Let denote the inter-class edges for and , respectively. Thus, and . Since , we have . This indicates that there are at least edges in connecting nodes that belong to the same ground truth class, as shown in Figure 6.

Figure 6: Illustration for . The two dashed rectangles divide the nodes into and ; red and blue nodes denote nodes in and , respectively.

Let denote the maximum degree in the graph and we know that at least nodes are “misplaced” in , e.g., in Figure 6 the red node in should be placed in to achieve . Let , and we have and .

With the new constraints, we rewrite the optimization problem as

(18)

Further, the derivative of is expressed as follows,

(19)

Let , we have ; let , we have . Thus, is monotonically decreasing at and monotonically increasing .

Note that in the theorem we assume the pseudo-labels and ground truth classes are balanced, i.e., . Then becomes,

(20)

Hence, is monotonically decreasing at and monotonically increasing . So the maximum value of is at either or . Further it is easy to know that . Then we have , and we can get the maximum value of as follows,

(21)

with . In other words, reaches its upper bound when or .

(2) From the constraints and in Eq (18), we have . Based on the discussion in (1), we know that is monotonically decreasing at , which means an increase of leads to a decrease in , i.e., a smaller value of . Since , a decrease in will lead to a increase in . Then we have if .

Remark on a more generalized case. We now discuss the case where we do not have assumptions on and . As we demonstrated in the above discussion, is monotonically decreasing at and monotonically increasing . Thus, the maximum value of should be one of the values of and . As our goal is to show that would be small with low , to simplify the analysis, we consider a large value of (or a small value of ) which satisfies and . This indicates is bounded by . Then the maximum value of , i.e., , is expressed as

(22)

When , it is easy to see that larger (or smaller ) will lead to smaller because both