Log In Sign Up

Self-Supervised Graph Representation Learning via Global Context Prediction

To take full advantage of fast-growing unlabeled networked data, this paper introduces a novel self-supervised strategy for graph representation learning by exploiting natural supervision provided by the data itself. Inspired by human social behavior, we assume that the global context of each node is composed of all nodes in the graph since two arbitrary entities in a connected network could interact with each other via paths of varying length. Based on this, we investigate whether the global context can be a source of free and effective supervisory signals for learning useful node representations. Specifically, we randomly select pairs of nodes in a graph and train a well-designed neural net to predict the contextual position of one node relative to the other. Our underlying hypothesis is that the representations learned from such within-graph context would capture the global topology of the graph and finely characterize the similarity and differentiation between nodes, which is conducive to various downstream learning tasks. Extensive benchmark experiments including node classification, clustering, and link prediction demonstrate that our approach outperforms many state-of-the-art unsupervised methods and sometimes even exceeds the performance of supervised counterparts.


page 1

page 2

page 3

page 4


Graph InfoClust: Leveraging cluster-level node information for unsupervised graph representation learning

Unsupervised (or self-supervised) graph representation learning is essen...

Self-Supervised Learning of Contextual Embeddings for Link Prediction in Heterogeneous Networks

Representation learning methods for heterogeneous networks produce a low...

Graph Representation Learning Beyond Node and Homophily

Unsupervised graph representation learning aims to distill various graph...

LaundroGraph: Self-Supervised Graph Representation Learning for Anti-Money Laundering

Anti-money laundering (AML) regulations mandate financial institutions t...

Omni-Granular Ego-Semantic Propagation for Self-Supervised Graph Representation Learning

Unsupervised/self-supervised graph representation learning is critical f...

Classifying pairs with trees for supervised biological network inference

Networks are ubiquitous in biology and computational approaches have bee...

1 Introduction

Graph representation learning has attracted a great deal of interest from researchers in recent years. Learning effective node representations can benefit a variety of practical downstream tasks, e.g., classification [1], community detection [30], and graph alignment [9]. Compared to many well-performed supervised algorithms [10, 29], unsupervised methods  [7, 27] have a definite advantage that they are free from expensive manual labeling effort and therefore can fully utilize a vast amount of unlabelled data. However, despite the empirical success, what

should be learned has been a central issue for unsupervised learning. In the absence of handcrafted annotation, how to design an appropriate objective function to learn desirable node representations is a challenging problem.

Figure 1: A toy example of our self-supervised task involving predicting the relative contextual position of one node to another.

Fortunately, self-supervised learning 

[13], as a branch of unsupervised learning, empowers us to train unlabeled data with free supervised signals obtained from the data itself. It has been successfully applied on image and video data [2, 4]

. By training various pretext tasks such as predicting the rotation angle of an image or inferring the correct temporal order of a sequence of frames in a video, useful latent vectors can be learned from unlabeled data in a supervised manner, which helps to achieve desirable performance in relevant tasks such as object detection and classification.

A natural question is: can we also get free supervision from graph-structured data? After much deliberation, our answer is positive. Recall a common phenomenon in a social network as illustrated in Figure 1, you () are more likely to interact with your direct friends (e.g., ) than your friends’ friends (e.g., ), but it is also possible that your friends (e.g., ) may be influenced by their other friends (e.g., and ) and then affect you, due to the link structure of the network. In this sense, for a node in a graph , all nodes in constitute its context as any other node may interact with it through a path of varying length111Typically, the length of a path refers to the number of edges it traverses, also known as hop count. If there is no path between two nodes, their distance is infinite., e.g., interacts with

via a path of length 3. Since such context covers the global topology of the graph, we call it the global context. Nevertheless, how to effectively capture the global structure of a graph remains a challenging issue. Although existing graph neural network methods such as graph convolutional networks (GCNs) 

[15] can stack multiple layers to capture high-order relations between nodes, they suffer from over-smoothing when the number of layers increases [17], as illustrated in Figure 2. Furthermore, it is difficult to choose an appropriate number of stacked layers. In this work, we propose to use the length of a path, i.e., hop count, to characterize the global context. The path length can naturally and faithfully reflect the extent of similarity between two nodes. The shorter the path length, the greater the degree of interaction between them. More importantly, such supervisory information can be obtained for free from the graph data, making it possible to learn node representations in a self-supervised fashion.

This paper provides a self-supervised graph representation learning framework SGRL involving predicting relative contextual position for a pair of nodes in a graph. In particular, given two arbitrary nodes, the task is to train a neural net to infer the contextual position of one node relative to the other. For instance, in Figure 1, the neural net should be able to answer a question: is one hop away from , or two or more hops away? To perform well on this task, it requires the learned node representations to encode global topological information while capable of discriminating the similarity and dissimilarity between pairs of nodes. The main contributions of our work are summarized as follows:

  • We make the first attempt to investigate a natural supervisory signal hidden in graph-structured data, i.e., hop count, and exploit this signal to learn node representations on unlabeled datasets in a self-supervised manner.

  • We propose an effective self-supervised learning framework SGRL that trains a neural net to predict the relative contextual position between pairs of nodes, which learns global-context-aware node representations.

  • We conduct extensive experiments to evaluate SGRL on three common learning tasks. The results show that it exhibits competitive performance compared with state-of-the-art unsupervised methods and sometimes even outperforms some strong supervised baselines.

2 Related Work

Self-supervised learning is a form of unsupervised learning where the data provides the supervision to train a pretext task. The key is to automatically generate supervisory signals based on the data itself, which helps to guide the learning algorithm to capture the underlying patterns of the data. As a general technique, self-supervised learning finds many applications, ranging from language modeling [28] to robotics [11]

. Notably, it has been widely used on image and video data to learn useful visual features. Various pretext tasks have been proposed such as cross-channel prediction, spatial context prediction, colorization, and watching objects move 

[13]. Although self-supervision has been successfully applied in many areas, it remains unclear whether it works or not in the graph domain. The goal of this paper is to investigate its effectiveness in learning on graph-structured data.

Graph representation learning is an important task and has become a research hot-spot in recent years. In general, existing approaches are broadly categorized as (1) factorization-based [23], (2) random walk-based [22], and (3) neural network-based [18].

Recently, graph convolutional network (GCN) [15] and its multiple variants have become the dominant approach in graph modeling, thanks to the utilization of graph convolution that effectively fuses graph topology and node features. However, the majority of these methods [26, 31, 29] requires external guidance, i.e., annotated labels, which limits their applicability. In contrast, unsupervised algorithms [8, 7, 27] do not require any external labels, but their performances are often not comparable to the supervised counterparts. Some unsupervised methods require advanced knowledge and sophisticated design to ensure their models can learn useful node representations without explicit supervision. Fortunately, self-supervised learning opens up an opportunity for betting utilizing the abundant unlabeled data. A recently proposed multi-stage self-supervised framework M3S [25] has been shown empirically successful. However, in essence, M3S does not get rid of external guidance as it still requires a few initial labels as the basis for enlarging the label set subsequently. To make better use of unlabeled data, in this work, we propose a novel self-supervised formulation to learn node representations on graphs without any external labels.

Figure 2: Results of an unsupervised baseline formulated by stacking varying number of graph convolutional layers on node classification (left) and clustering (right) tasks. Initially the performance improves as the number of layer increases. But more layers lead to over-smoothing and performance decay.

3 Methodology

Figure 3: The proposed self-supervised framework SGRL for learning node representations over graph-structured data.

3.1 Problem Formulation

Let be a graph with nodes and edges where each node is affiliated with a set of -dimensional attributes (features) . records the attribute information of nodes, where represents a feature vector of node for . Moreover, nodes are interconnected to form edges, represented by an adjacency matrix . We aim to learn an encoder that projects each node to a -dimensional space under the guidance of natural supervision automatically obtained from the input graph itself instead of external annotated labels, such that the nodes will be represented in the global context as . Formally, this free supervisory signals would function as pseudo-labels to train encoder by solving



is a classifier to predict the pseudo-labels. Note that labels

are important in the learning procedure since they determine what should be captured and represented in latent vectors. In this sense, it is possible for us to construct specific pseudo-labels such that desired information can be encoded in node representations.

3.2 Global Context of A Node

Many studies [22, 6, 23] have found that the interaction between nodes is not limited to their direct connection, i.e., the observed first-order proximity, and the complementary high-order relations could capture more underlying information that would deepen our understanding of graphs. Thus, we assume that all nodes in constitute the global context of node as any other node could interact with it through a path , which is much more comprehensive than a context specified by the limited window size in random walk based algorithms. Formally, such a global context of is denoted as

. To encode the global information, we plan to estimate the likelihood of predicting its context given one arbitrary node in

, i.e.,


To learn representations, we introduce graph encoder into Eq. (2

) which presents a probability distribution of node co-occurrence, then yielding an optimization problem of maximizing the log-probability


Then we factorize the objective function of optimization problem (3) based on an independence assumption [6] as following,


For the conditional likelihood of each node pair and , a typical solution is to define it as a softmax function


and then adopt specific classifiers to learn such a posterior distribution, e.g.

, employing logistic regression to predict the context 

[20]. But such models would result in a large number of categories that equals to , consuming vast computational resource. Moreover, under our global context assumption, the classifier cannot be trained to work properly as all target classes will be positive. Hence, we propose another strategy introduced as follows to predict , by utilizing hop count as supervision to guide the global context prediction in a fine-grained way.

3.3 A Natural Supervisory Signal: Hop Count

An interesting discovery in the famous small-world experiment [21]

presents a heuristic: any pair of entities in the network owns the shortest path between them, which is usually the best path for message propagation. This makes it possible to divide the global context based on the hop count of the shortest path. Accordingly, we define a hop-based global context

for each node as follows.


A hop-based global context includes a set of nodes that can reach through the shortest path with different hop counts, i.e., is composed of multiple specific -hop context which only contains nodes within hops.

where is the upper bound of the hop count from other nodes to in graph , and is the length of path .

For each target node , node only belongs to one specific -hop context , i.e., , there exists . It is not hard to find that such hop-based context can well reflect the extent of interaction between nodes and . Specifically, if path between and is relatively long, their communication has to pass through many relay points, and naturally, they interact with each other in a low extent. The closer the -hop context to is, the stronger the relation between them. As a matter of course, hop counts can be utilized as supervisory signals to distinguish the degree of interaction between two nodes.

In this way, for each target node , with categories and pseudo-labels in accordance with , the learning objective, illustrated in Figure 3, is to predict the hop count (relative contextual position) between and by solving the following optimization problem


where is an operation used to measure the interaction between two vectors, and one of the available schemes is explained in § 4.3. It can be seen that Eq. (6) empowers the model to learn on unlabeled data in a supervised fashion. With the guidance of the pseudo-labels, the model not only encodes the global topology but also distinguishes the fine-grained interaction between nodes such that the learned representations are capable of finely characterizing the similarity and dissimilarity between nodes, which facilitates downstream tasks such as classification and clustering.

Note that Eq. (6) is difficult to solve in practice as the upper bound of hop count for different target nodes varies and precisely determining is not easy for a big graph. Therefore, modifying it to adapt to realistic situations is necessary. Inspired by the small-world phenomenon which demonstrates that two entities in the network are about six or fewer connections away from each other, we suppose that the number of hops between nodes is within a certain range, not uncontrollable (Note that the average shortest path length of the largest component in the tested benchmarks Cora, Citeseer, and Pubmed is 6.3, 9.3, and 6.3, respectively). So for classes attached to each node , we divide them into “major” categories by merging multiple classes, and update pseudo-labels to accordingly. Then, we obtain the objective function of the proposed SGRL:


The design of “major” categories follows a principle of both clearly discriminating the dissimilarity and partly tolerating the similarity. For instance, the degree of interaction between a node and its 1-hop context is significantly different from its 2-hop context since you may not know your friends’ friends at all. So treating them as two “major” classes is appropriate. In contrast, the distinction between higher-hop contexts is relatively vague, merging them into one “major” class seems more reasonable. reflects the number of such predefined classes. A trivial solution is to treat the 1-hop context as one class and the rest as another, which is similar to the idea of reconstructing the adjacency matrix . Detailed explanation about “major” categories explored in this work is in the following section. Now, the parameters of encoder and classifier can be jointly learned by optimizing Eq. (7), and the output of the optimized is our desired representation.

4 Experiments

4.1 Datasets

We utilize a variety of standard real-world datasets collected from different domains to comprehensively evaluate the effectiveness of our SGRL on three common learning tasks. The detailed statistics are summarized in Table 1.

Dataset # Nodes # Edges # Features # Classes
Cora 2,708 5,429 1,433 7
Citeseer 3,327 4,732 3,703 6
Pubmed 19,717 44,338 500 3
PPI 56,944 818,716 50 121
Reddit 231,443 11,606,919 602 41
BlogCatalog 5,196 171,743 8,189 6
Flickr 7,575 239,738 12,047 9
Table 1: Dataset Statistics.
  • Cora, Citeseer, and Pubmed [15]: three standard citation networks in which nodes are documents and edges indicate citation relations. In the experiments, they are employed for node classification (transductive) and clustering tasks.

  • PPI [33]: a protein-protein interaction dataset that consists of networks corresponding to different human tissues. It is used for node classification (inductive, multi-label) task.

  • Reddit [8]: a social network constructed with Reddit posts in different topical communities. It is used for node classification (inductive) task.

  • BlogCatalog and Flickr [16]: two social networks in which users are treated as nodes and friend relations represent edges. Following [6], we randomly delete 20, 50, and 70 edges in these datasets, and use the damaged graph to conduct link prediction.

Algorithm Cora Citeseer Pubmed


Raw features 56.60.4 57.80.2 69.10.2
node2vec 67.40.4 47.50.3 72.60.5
EP-B 78.11.5 71.01.4 79.62.1
DGI 82.30.6 71.80.7 76.80.6
graphite 82.10.06 71.00.07 79.30.03
GMNN-unsup 82.8 71.5 81.6
SGRL (ours) 83.70.2 72.10.5 82.40.2

label ✓

GCN 81.5 70.3 79.0
GAT 83.00.7 72.50.7 79.00.3
GWNN 82.8 71.7 79.1
GMNN-sup 83.7 73.1 81.8
Table 2: Accuracy (%) on transductive classification task.

4.2 Baseline Methods

As our setup belongs to unsupervised learning, we mainly compare against two classes of the state-of-the-art unsupervised methods: random-walk based algorithms and GNNs. For the first category, we choose DeepWalk [22] and node2vec [6]. For the latter, we select EP-B [3], DGI [27], graphite [7], GMNN [24], and unsupervised GraphSAGE [8]. Particularly, since SGRL considers global topology, we also compare it with AGC [32] which exploits adaptive graph convolution to capture high-order relations between nodes. To further demonstrate the potential of unsupervised learning, we give some additional results of supervised approaches, including GCN [15], GAT [26], FastGCN [1], GWNN [29], and Adapt [10].

For fair comparison, the dimensionality of learned representations on all datasets is set to 512, unless noted otherwise. For node2vec, we set the number of random walks to 10, the walk length to 80, the window size to 10, and the parameters and both to 0.25. Parameters of DGI are the same as in [27]. The results of other baselines are taken from their original papers.

4.3 Experimental Setup

(a) DGI
(b) SGRL (ours)
(c) node2vec
(d) DGI
(e) SGRL (ours)
Figure 4: (a-b) t-SNE plots of node pairs w.r.t. topological distance on Cora. The color corresponds to the length of the shortest path between pairs of nodes. Vectors learned by SGRL present better structural properties. (c-e) Visualization of the learned representations on Cora.

Detailed architecture of SGrl.

For graph encoder

, we resort to the standard graph convolutional (GC) layer with ReLU activation function 


. Specifically, in the inductive classification task (PPI and Reddit), we construct the encoder with two 512-neuron GC layers; in other tasks, our encoder is a 512-neuron GC layer. Considering that the

used to measure the interaction between pairs of nodes should be symmetric to ensure , we would calculate the element-wise distance between two vectors to achieve it, i.e., let , where means to take absolute value for each element. In addition, the following 4 “major” categories are used in our experiments: (more discussion later).

Sampling strategy.

Note that Eq. (7) involves the calculation between each pair of nodes in , which makes it computationally expensive and memory-consuming for large graphs. Besides, the number of node pairs included in each “major” class varies greatly, it would incur the class imbalance problem [12]

. To circumvent these issues, we perform a batch-sampling of node pairs based on a uniform distribution over the classes. In detail, We first randomly select a batch of fixed-size target nodes in

, and then for each target node, sample node pairs from each “ major ” class at an adaptive ratio to ensure inter-class balance.

Algorithm PPI Reddit


Raw features 42.2 58.5
DeepWalk - 32.4
GraphSAGE-GCN 46.5 90.8
GraphSAGE-mean 48.6 89.7
GraphSAGE-LSTM 48.2 90.7
GraphSAGE-pool 50.2 89.2
DGI 63.80.20 94.00.10
SGRL (ours) 66.00.04 95.00.02

label ✓

GAT 97.30.20 -
FastGCN - 93.7
Adapt - 96.30.32
Table 3: Micro-averaged F1 (%) on inductive classification task.

Implementation details.

As a preprocessing step, we employ NetworkX to build the hop-based global context for each node in parallel. During training, we adopt Glorot initialization [5] and Adam optimizer [14] with an initial learning rate in

. The number of epochs is tuned in

, while setting a fixed number of epochs on the inductive classification task (20 on Reddit, 50 on PPI). Besides, we use the subsampling skill introduced in [8] to make Reddit and PPI fit into GPU memory. In specific, a minibatch of 256 nodes is first selected, and then for each selected node, we uniformly sample 8 neighbors from its first and second-order neighborhoods, respectively.

Evaluation metrics.

Following the experimental setup described in [27]

, we feed the learned representations into a simple logistic regression classifier to evaluate the node-level classification performance. Mean accuracy after 50 runs is used to assess the transductive task, and the micro-averaged F1 score averaged after 50 runs is used for the inductive one. For the clustering task, we apply the K-means algorithm to group the learned embeddings and report the NMI score. For the link prediction task, we use Area Under the ROC Curve (AUC) as the criteria. Similarly, we report the averaged result after 10 runs.

4.4 Results

Node classification.

The results of transductive and inductive task are reported in Tables 2 and 3, where numbers in bold indicate the best results among unsupervised methods. As can be observed, SGRL outperforms all other unsupervised algorithms, especially on Pubmed and PPI, which affirms the effectiveness of hop count as free supervisory signals. This confirms the benefit of our proposed self-learning task, i.e., global context prediction. Good performance on reasoning about the relative contextual position can only be achieved if the learned representations could encode global topological information and finely discriminate the similarity and differentiation between nodes (t-SNE [19] plots w.r.t. topological distance are given in Figure 4 (a-b)), which indirectly contributes to classification. Besides, SGRL exhibits comparable results to some supervised models like GCN and GWNN, even achieves the best result on Pubmed. We believe that self-supervised learning has more potential in learning high-quality representations than supervised manners as the supervision built from the data itself could capture the inherent characteristics of data better than manual labels. Moreover, SGRL exhibits a comparable training speed with GNN-based baselines.


Table 4 summarizes the results. Although DGI achieves the best performance on Cora and Citeseer, our simple framework SGRL also exhibits competitive performance (an illustration is shown in Figure 4 (c-e)) and obtains the highest NMI on Pubmed. Note that SGRL outperforms AGC, a clustering-oriented model adaptively capturing high-order relations among nodes, which demonstrates that high-order relations are somewhat limited in capturing the underlying structures of the graph while our consideration of global topology and fine-grained similarity can be beneficial to learn cluster structures.

Algorithm Cora Citeseer Pubmed
Raw features 0.135 0.237 0.314
node2vec 0.449 0.232 0.288
DGI 0.557 0.438 0.292
AGC 0.537 0.411 0.316
SGRL (ours) 0.540 0.432 0.332
Table 4: Clustering quality in terms of NMI.
Algorithm BlogCatalog Flickr
20% 50% 70% 20% 50% 70%
node2vec 79.9 76.5 72.4 73.9 70.0 63.1
DGI 77.7 76.0 75.4 90.6 88.8 69.2
SGRL (ours) 80.4 78.7 78.2 91.4 90.9 89.8
Table 5: AUC scores (%) for link prediction.

Link prediction.

As can be seen from Table 5, SGRL consistently outperforms DGI and node2vec under different edge removal rates, showing that the representation learned by global context prediction could delicately characterize the similarity and differentiation between nodes from a global topological viewpoint to predict missing links. By contrast, the necessity of a task-oriented negative sample generating function weakens the performance of DGI in this task. This shows that SGRL has a better generalization ability.

4.5 Further Discussion on Label Categories

In Table 6 we investigate how the quality of the self-supervised learned embeddings depends on the construction of “major” classes. It can be found that clearly distinguishing 1-hop, 2-hop, and 3-hop contexts into 3 distinct “major” classes benefits to improving the quality of representations, while further differentiating 4-hop and higher-hop contexts would degrade the performance. We believe the reason is that only making 1-hop context discernible offers too few categories for recognition, i.e., providing less supervisory information, while too many “major” categories are not distinguishable enough as the distinction between higher-hop contexts is vague. Hence, in the experiments, we adopt a scheme that combines 3-hop and 4-hop contexts into a single class, which indeed presents better empirical performance.

# Classes Merge Policy Accuracy
2 82.4
3 83.0
4 83.2
5 82.7
6 82.7
Table 6: Accuracy (%) w.r.t. the formation of label classes on Cora.

5 Conclusion

In this work, we have presented a novel self-supervised framework SGRL for learning node representations, which to our knowledge is the first attempt on exploring free supervisory signals in graph-structured data for representation learning. Extensive experiments demonstrate the effectiveness of our framework. We hope our work will inspire more research in self-supervised graph representation learning.


  • [1] J. Chen, T. Ma, and C. Xiao (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. In ICLR, Cited by: §1, §4.2.
  • [2] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §1.
  • [3] A. G. Duran and M. Niepert (2017) Learning graph representations with embedding propagation. In NeurIPS, Cited by: §4.2.
  • [4] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. ICLR. Cited by: §1.
  • [5] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Cited by: §4.3.
  • [6] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In KDD, Cited by: §3.2, §3.2, 4th item, §4.2.
  • [7] A. Grover, A. Zweig, and S. Ermon (2019) Graphite: iterative generative modeling of graphs. In ICML, Cited by: §1, §2, §4.2.
  • [8] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NeurIPS, Cited by: §2, 3rd item, §4.2, §4.3.
  • [9] M. Heimann, H. Shen, T. Safavi, and D. Koutra (2018) Regal: representation learning-based graph alignment. In CIKM, Cited by: §1.
  • [10] W. Huang, T. Zhang, Y. Rong, and J. Huang (2018) Adaptive sampling towards fast graph representation learning. In NeurIPS, Cited by: §1, §4.2.
  • [11] E. Jang, C. Devin, V. Vanhoucke, and S. Levine (2018) Grasp2vec: learning object representations from self-supervised grasping. In CoRL, Cited by: §2.
  • [12] N. Japkowicz and S. Stephen (2002) The class imbalance problem: a systematic study. Intelligent data analysis 6 (5), pp. 429–449. Cited by: §4.3.
  • [13] L. Jing and Y. Tian (2019) Self-supervised visual feature learning with deep neural networks: a survey. arXiv preprint arXiv:1902.06162. Cited by: §1, §2.
  • [14] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
  • [15] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2, 1st item, §4.2, §4.3.
  • [16] J. Li, X. Hu, J. Tang, and H. Liu (2015)

    Unsupervised streaming feature selection in social media

    In CIKM, Cited by: 4th item.
  • [17] Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    In AAAI, Cited by: §1.
  • [18] R. Li, S. Wang, F. Zhu, and J. Huang (2018)

    Adaptive graph convolutional neural networks

    In AAAI, Cited by: §2.
  • [19] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne.

    Journal of machine learning research

    9 (Nov), pp. 2579–2605.
    Cited by: §4.4.
  • [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NeurIPS, Cited by: §3.2.
  • [21] M. Newman (2018) Networks. Oxford university press. Cited by: §3.3.
  • [22] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In KDD, Cited by: §2, §3.2, §4.2.
  • [23] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM, Cited by: §2, §3.2.
  • [24] M. Qu, Y. Bengio, and J. Tang (2019) GMNN: graph markov neural networks. In ICML, Cited by: §4.2.
  • [25] K. Sun, Z. Zhu, and Z. Lin (2019) Multi-stage self-supervised learning for graph convolutional networks. arXiv preprint arXiv:1902.11038. Cited by: §2.
  • [26] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2, §4.2.
  • [27] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2019) Deep graph infomax. In ICLR, Cited by: §1, §2, §4.2, §4.2, §4.3.
  • [28] J. Wu, X. Wang, and W. Y. Wang (2019) Self-supervised dialogue learning. arXiv preprint arXiv:1907.00448. Cited by: §2.
  • [29] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng (2019) Graph wavelet neural network. In ICLR, Cited by: §1, §2, §4.2.
  • [30] F. Ye, C. Chen, and Z. Zheng (2018)

    Deep autoencoder-like nonnegative matrix factorization for community detection

    In CIKM, Cited by: §1.
  • [31] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Yeung (2018) Gaan: gated attention networks for learning on large and spatiotemporal graphs. In UAI, Cited by: §2.
  • [32] X. Zhang, H. Liu, Q. Li, and X. Wu (2019) Attributed graph clustering via adaptive graph convolution. In IJCAI, Cited by: §4.2.
  • [33] M. Zitnik and J. Leskovec (2017) Predicting multicellular function through multi-layer tissue networks. Bioinformatics 33 (14), pp. i190–i198. Cited by: 2nd item.