Graph representation learning has been used in many real-world domains that are related to graph-structured data, including bioinformatics , chemoinformatics [17, 27], social networks  and cyber-security . There are two important tasks in graph analysis, i.e., label predictions on nodes and graphs. For instance, in the study of chemical molecules, to help discover chemical properties of new molecules, graph classification task [22, 21, 39] is used to predict the label of molecule, where a molecule can be represented as a graph with the atom represented as nodes and chemical bond represented as edges.
Graph neural networks (GNNs) are applied to graph-based data to improve the prediction performance due to their ability to learn high-level features by propagating, transforming, and aggregating neighborhood information across edges [11, 13]. There are various neighborhood aggregation methods to capture the structures and attributes of graphs, such as average , generalized  and attention-based aggregation . However, it is still difficult for GNNs to learn useful graph representations since the real-world graphs are usually large, sparse and noisy, and the valid information is often contained in several small subgraphs where the conventional aggregation methods cannot capture. To solve this problem, Lee et al. 
presented the Graph Attention Model (GAM) that focuses on small parts of graphs in order to predict the labels of the entire graphs. In order to improve the performance, the GAM model also integrates global information from various parts of the graph via different random set of nodes. This suggests that local and global information are both important in graph representation learning. In the analysis of real-world graphs, it is necessary to gather information from individual nodes and edges as well as the subgraphs of graph that represent the discriminative patterns. Recently, Yinget al.  proposed a graph pooling module, DiffPool, to generate hierarchical representations of graphs for the purpose of graph classification. This mechanism allows GNNs to encode the local and global structural information to obtain the final graph representation.
Although the above methods perform well in the graph classification task, they are task-specific and focus on supervised learning. These methods depend highly on vast quantities of labeled graph data, which is often costly and error-prone in the real world. To address this problem, Velic̆kovićet al.  applies mutual information maximization to learn node representations of graph-structured inputs without using labelled data, and demonstrates competitive performance to supervised learning on several node classification benchmarks. Inspired by this work, we propose a novel unsupervised learning method, Unsupervised Hierarchical Graph Representation (UHGR), to learn hierarchical graph representations based on mutual information maximization, which includes node embeddings and graph embeddings. We summarize the main contributions as follows:
We propose an unsupervised hierarchical graph representation learning method to capture the local and global structural information of arbitrary sized graphs, which does not depend on any task-specific information (e.g., class labels). This method is generic enough to be used in various scenarios such as node embedding and graph embedding.
We demonstrate that the graph representations from the proposed model can achieve comparable node and graph classification performance to supervised baseline methods on real-world data sets.
Our visualization of the hierarchical cluster assignment demonstrates that the proposed method can automatically learn meaningful and interpretable clusters across different levels of coarseness based on the structural information of graphs.
2 Related Work
Our work builds upon recent researches on graph neural networks and graph representation learning. Here we only focus on the most related works.
Graph Neural Network. A wide variety of graph neural networks have been applied in node classification [34, 10, 19] and graph classification tasks [5, 7, 40, 39, 21] in recent years. In node classification, GAT 
stacks masked self-attentional layers to classify a node by attending over its neighbors in different weights. LGCN builds a trainable graph convolutional layer to select a fixed number of neighboring nodes in order to transform graph data into grid-like data, which is suitable for typical convolutional operations. PPNP  combines graph convolutional networks (GCN) and PageRank to overcome the problem that the size of the observed neighborhood of a node is difficult to extend. In graph classification, the main challenge is to build a useful low-dimensional graph representation based on the node embeddings of the entire graph. One straightforward solution, presented by Duvenaud et al.  and Velic̆ković et al.  is to sum or average a graph’s node embeddings. However, this solution ignores the structural information of graphs and considers that all nodes contribute the same weight to the calculation of graph representation. Therefore, DiffPool  is proposed for graph classification that can learn hierarchical graph representations with a graph pooling module. Although this method solves the problem that existing GNN methods are flat and ignore hierarchical structure of graphs, it needs to learn under the supervision of graph-level labels. In addition, the real-world graphs are usually large and noisy, GAM  is proposed for the attention-based graph classification, which utilizes the attention mechanism to focus on small but informative parts of graphs. However, all of these approaches depend on task-specific information to learn node embeddings or graph embeddings. Besides, most of them ignore the hierarchical representation of graphs, and thus have limited capabilities of capturing the natural structures of the real-world graphs .
Graph Representation Learning. Learning a good representation not only enables us to capture the latent variables of the data , but also helps improve the performance of downstream tasks. For graph-structured data, the learned low-dimensional representations (embeddings) can encode information of a graph’s nodes, or the entire graph in the case of the GNN model. Many of the existing graph representations are focused on node embeddings by using random walk based objectives [26, 12, 13]. In addition, LINE  and Sybrandt et al.  focus on modelling first-order and second-order relationships between node neighborhoods to learn node embeddings and graph embeddings. Glimer et al.  propose a common framework to learn message passing algorithms and aggregate the node embeddings. Janossy Pooling  is a permutation-invariant aggregator function to learn node embeddings. Velic̆ković et al.  propose an alternative unsupervised node embeddng method based on mutual information . HARP  proposes a hierarchical paradigm to learn low-dimensional representations of a graph’s nodes. This paradigm utilizes a smaller graph that approximates the original global structure to obtain good initializations for learning representations of the original graph.
There are also some studies on learning representations of the entire graphs in an unsupervised manner, which is quite different from the task of node embedding. In node embedding, the goal is to learn low-dimensional vector to represent a node with/without the supervision of labels (e.g., node labels and graph labels). Graph2vec  is an unsupervised graph embedding method inspired by the document embedding models , but may not capture global structure, as this method only uses subtrees for graph embeddings. Taheri et al. 
generate sequences from graphs and train a long short-term memory (LSTM) autoencoder model to embed these graph sequences into continuous vectors. The LSTM network cannot be operated in parallel and isn’t appropriately to model large graphs. Some recent approaches have proposed applying attention mechanism on graphs[4, 21] that can determines which parts of the graph to have more attention. Yet the attention mechanism only focuses on local information which is not enough to achieve satisfactory node or graph representations. Recently, BayesPool  is proposed to use variational Bayes based on an encoder-decoder architecture to learn hierarchical graph representations in an unsupervised manner. Using Encoder-decoder architecture leads to this method being overly focused on node-based details, rather than more high-level node/graph embeddings. Different from previous representation learning methods, in this work we use an unsupervised learning framework based on mutuainformation with contrastive loss, to learn hierarchical graph representations.
3 Proposed Method
Inspired by the recent success of unsupervised learning based upon mutual information maximization [15, 35], we propose a novel unsupervised embedding framework, UHGR, to capture structural information and learn a hierarchical graph representation. This method is based on the maximization of mutual information between “local” representations and high-level “global” representations, which enables us to learn both node and graph representations. The proposed method utilizes unsupervised learning to aggregate structural information to generate hierarchical representations. This unsupervised manner makes the graph representations feasible for various downstream tasks, such as node classification and graph classification. Meanwhile, this method overcomes the shortcomings when integrating different structural information of graphs in previous studies. To evaluate our method, we apply the learned representations on the node and graph classification tasks, and compare the classification results with several baseline methods.
The undirected graph is comprised of nodes, each with features. Here, where the original node feature is represented by row of . Furthermore, the adjacency matrix contains a nonzero entry to indicate an edge between node and node . The goal of this work is to create different levels of low-rank encodings of , which we accomplish by training an encoder to cluster local parts of the graph and create more coarsened graphs, eventually output the final representation of the original graph. Each coarsened graph has its own node features and an adjacency matrix that are trainable. In order to train the Encoder module, we apply a hierarchical approach where is repeatedly coarsened from . The and are from the original graph, and the is the final graph representation of the original graph. Following this scheme, the number of nodes in the successively coarsened graphs is non-increasing. Because represents the node embeddings of Level i, if , then . The feature vector corresponding to the coarse nodes is determined by a separate hierarchical level, , which learns node embeddings of level from the previous level of coarseness.
This paper uses graph neural networks (GNNs) to create representations of the graphs at different levels, which is able to capture hierarchical structures and generate flexible graph embeddings. A key component of the proposed method is how to cluster partial parts of the graph and generate more coarsened graphs based on the output of GNNs without any labels. In the following parts, we outline the different modules of UHGR and illustrate how to learn hierarchical graph representations based on mutual information maximization.
3.2 Encoder module
The hierarchical Encoder has two main functions: message-passing function and readout function . is used to iteratively compute node representations from their neighborhood’s features :
where are the node embeddings of the k-th step from message-passing function , which depends on the adjacency matrix and the previous node embeddings . At the initial step (k=1), are initialized by the original node features . After K iterations, the module outputs the final node embeddings . can be implemented by different types of GNNs. In this work, we consider two general GNNs: Graph Convolutional Networks (GCNs)  and Graph Attention Networks (GATs) .
Graph Convolutional Networks. GCNs implement using the following rule:
where is the adjacency matrix with self-loops and is the corresponding degree matrix. For the nonlinearity
, we apply the parametric ReLU function, and is a trainable weight matrix.
Graph Attention Networks. GATs leverage self-attentional layers to set learnable weights to measure the importance of neighborhoods when aggregating feature information from node’s neighbors. When computing new feature representation for a central node, each neighborhood receives a different weight by measuring the relation between its feature vector and the central node’s vector. Node i and its neighborhood node j have the following relations:
where is the attention coefficients and
represents a single-layer feed-forward neural network to perform self-attention on the nodes.
is a shared weight matrix for every node to perform linear transformation.indicates the importance of node j’s features to node i after normalizing during the feature aggregation process.
3.3 Graph pooling module
To assign nodes to clusters at each hierarchical layer, we apply DIFFPOOL  to create node embeddings and adjacency matrix for next coarsened layer (i+1) from layer i.
The graph pooling module takes the adjacency matrix and the features of the nodes or cluster nodes at layer as the input of the GNN module to get the new embedding matrices of nodes or cluster nodes. Then the DIFFPOOL module takes the node embedding matrices and the adjacency matrix to generate a coarsened adjacency matrix and new embeddings for each of the nodes or cluster nodes in this coarsened graph. Then, the new coarsened graphs are fed to the GNN module to generate a coarser version of the input graph. This whole process is repeated several times until the final graph representation is generated, which contains only one general node or cluster node. Compared to other hierarchical representation learning methods, our model learns a hierarchical representation strategy automatically, which doesn’t depend on the specific task and can be trained end-to-end. Generally, this unsupervised manner embeds the original graph to a coarser one by grouping the similar subgraphs together.
3.4 Discriminator module
Similar to Deep InfoMax [15, 35], we introduce a discriminator module to help training the Encoder module and Graph pooling module, which enables our model to output the satisfied representations. The discriminator module trains the encoder to maximize the mutual information between a high-level graph representation and local features of the graphs, and is able to capture the unique graph representation for each graph individually. The local features are also included in the learned node embeddings, which represents the hierarchy of the original graphs. In this context, the final output representation of hierarchical learning is the graph-level summary representation , and the local graph features are from the node embeddings of the original graph . Therefore, our hierarchical model can be written by the following equation:
where represents an adjacency matrix of the original graph, and is used to obtain a hierarchical graph-level representation. GNN module can be any node embedding module such as GCN and GAT. The readout function utilizes the unsupervised hierarchical process to summarize the graph-level vector .
For the objective function, we follow the same loss function as DGI, which computes the standard binary cross-entropy between graph samples from the joint and the product of marginals:
where a discriminator
, is employed to represent the probability scores of the local-global pair. The negative samples are drawn by combining the summary vectorwith the local features from other graphs. Through minimizing this objective, our model can effectively extract useful local and global information of the input graph based on the mutual information maximization.
We evaluate the graph representation learned from UHGR on both graph classification and node classification tasks. In each case, UHGR is used to learn graph and node representations in a fully unsupervised manner. The graph and node classification tasks are performed by directly feeding the learned representations into simple linear classifiers. We also conduct the visualization experiments on learned representations to verify whether it’s reasonable to assign clusters in an unsupervised manner.
4.1 Data sets
To evaluate the ability of UHGR to learn hierarchical representations from arbitrary complex graphs, we perform it on a variety of real-world graphs chosen from the commonly used benchmarks. For the node classification task, we consider the transductive learning settings and choose three standard data sets, Cora, Citeseer, and Pubmed , as summarized in Table 1. We employ the same training, validation and testing settings as those in DGI , and report the node classification accuracy on the testing data, averaged over 50 runs of training. For graph classification task, we use protein data sets including D&D [6, 29] and PROTEINS [2, 6], the chemical molecules data set NCI1 [37, 29], and the scientific collaboration data set COLLAB . More information on these data sets is shown in Table 2. For this graph classification task, we perform 10-fold cross-validation to evaluate the performance, and apply the average over 10 folds as the final accuracy result. The visualization experiments are conducted on the data sets for graph classification tasks. We feed the original graph to output a coarser one based on the learned hierarchical cluster assignments.
|Data set||Graphs||Classes||Avg.# Nodes||Avg.# Edges|
4.2 Experimental setup
As discussed in section 3, UHGR
includes Encoder module, Graph pooling module and Discriminator module. The Encoder module encodes node representations using one GAT layer or one GCN layer. During the Graph pooling module, we apply two DIFFPOOL layers to all of the data sets. Three GCN layers are performed between these two DIFFPOOL layers. In the hierarchical cluster setting, the number of clusters after DIFFPOOL layer is set be to 10-30% of the number of nodes or clusters before pooling. The Readout function in the Discriminator module is built on the top of the DIFFPOOL architecture, which enables us to learn the hierarchical graph representations. Finally, the Discriminator module relies on the mutual information maximization to achieve the unsupervised graph learning. We also apply Batch normalization25] to build graph neural network model and run it on NVIDIA Tesla V100 GPU. In order to demonstrate the effectiveness of our proposed model, we evaluate it on the following three tasks: node classification, graph classification, and analysis of hierarchical cluster assignment.
4.3 Results for Node Classification
|X||Raw features||47.9 0.4%||49.3 0.2%||69.1 0.3%|
|X, A||DeepWalk + features||70.7 0.6%||51.4 0.5%||74.3 0.9%|
|X, A||DGI||82.3 0.6%||71.8 0.7%||76.8 0.6%|
|X, A, Y||GCN||81.5%||70.3%||79.0%|
|X, A, Y||GAT||83.0 0.7%||72.5 0.7%||79.0 0.3%|
|X, A||GAT-UHGR (ours)||78.5 0.1%||62.6 0.3%||77.4 0.6%|
|X, A||GCN-UHGR (ours)||76.7 0.1%||62.5 0.1%||75.1 0.3%|
lists the node classification results on data sets Cora, Citeseer and Pubmed using our method and other existing methods. For the operation of node embeddings, we test two different GNN module variants: GATs and GCNs. The GATs module outperforms GCNs on most of the benchmarks, indicating that self-attention mechanism is more suitable for capturing local structural information. For the Cora and Citeseer data sets, we set both hidden dimension and output dimension to 320 and 400, respectively. And for the Pubmed data set, 128-dimensional hidden size and output size for GCN model and 100-dimensional hidden size and output size for GAT model are tested in our experiments. Due to the limitation of GPU memory size, we didn’t test larger hidden dimension and output dimension even though the node representations may be more powerful. According to the results, our model achieves better classification performance than DeepWalk, and obtains comparable performance with supervised learning methods.
4.4 Results for Graph Classification
|X, A, Y||GRAPHSAGE||68.3%||75.4%||70.5%||-|
|X, A, Y||SET2SET||71.8%||78.1%||74.3%||-|
|X, A, Y||DIFFPOOL||75.5%||80.6%||76.3%||79.3%|
|X, A||graph2vec||-||-||75.4 %||75.0 %|
|X, A||GAT-UHGR (ours)||67.4%||75.6%||75.9%||66.6%|
|X, A||GCN-UHGR (ours)||66.9%||77.4%||74.7%||66.6%|
Table 4 compares the graph classification performance of our unsupervised learning method with other supervised learning baselines on datasets COLLAB, D&D, PROTEINS and NCI-1. The results show that our unsupervised method obtains similar performance as DIFFPOOL method on the PROTRINS benchmark and achieves comparable results with supervised methods, e.g. GRAPHSAGE, indicating that our method can learn useful graph representations even without graph labels. We also find that GAT-UHGR model performs better than GCN-UHGR model on the datasets COLLAB and PROTEINS, and performs worse than GCN-UHGR model only on the D&D dataset. This suggests that different graph datasets need different Encoder layer to capture useful representations in order to achieve better classification performance. Compared with other unsupervised model, e.g., graph2vec , GAT-UHGR
model obtains comparable classification results on PROTEIN data set. However, graph2vec utilizes a SVM classifier to perform 1024-dimensional embeddings of graphs, where our method directly uses the graph representations to train and test a simple linear classifier. For the embedding dimensions, we simply set it to 20-360 to demonstrate the validity of the learned hierarchical representations, and doesn’t further optimize this hyperparameter to achieve better classification performance due to hardware limitations.
4.5 Visualization of hierarchical representation
In addition to generating useful representation for classification tasks, our model can also create meaningful and interpretable representations in a hierarchical way. To evaluate the meanings of the learned hierarchical graphs, we visualize the cluster assignments after the DIFFPOOL layer. Figure 2 shows the visualization of node assignments on the graphs from different data sets. Different node colors represent different node cluster labels from cluster assignment probabilities. Figure 2 (a) is the node assignment on COLLAB data set and it is clear that our model can capture the hierarchical structure in these graphs. From Figure 2 (b) and (c), we also observe that many meaningful structures, including clique-like, tree-like and cycle-like structures, are captured by the model. This is because the DIFFPOOL layer computes the node assignment based on the node feature matrix and adjacency matrix, thus the input nodes with similar features and local structure obtain similar node assignment. Even if the subgraphs with similar patterns are far away, our model can still assign them into the same cluster. In general, our unsupervised learning method based on mutual information can capture different hierarchical structures.
In this paper, we propose an unsupervised hierarchical representation learning model based on mutual information, UHGR, to learn node embeddings and graph embeddings. The mutual information maximization between global representation and local parts of the graphs can encourage the model to learn related structural information in all locations. This unsupervised learning model is able to learn task-independent graph representations. In addition, it can learn hierarchical graph representation, which is meaningful and easy to interpret. To demonstrate the effectiveness of the model, we perform node classification and graph classification tasks based on the learned representations. The results show that our model can achieve comparable results with supervised methods on several tested data sets. Finally, through visualization of the hierarchical cluster assignment, we show that our model is able to generate hierarchical representations by clustering different meaningful structures.
-  (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §2.
-  (2005) Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. i47–i56. Cited by: §4.1.
Harp: hierarchical representation learning for networks.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
-  (2017) GRAM: graph-based attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 787–795. Cited by: §2.
Discriminative embeddings of latent variable models for structured data.
International conference on machine learning, pp. 2702–2711. Cited by: §2.
-  (2003) Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology 330 (4), pp. 771–783. Cited by: §4.1.
-  (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §2.
-  (2016) Community-based question answering via heterogeneous social network learning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2017) Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 6530–6539. Cited by: §1.
-  (2018) Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1416–1424. Cited by: §2.
-  (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §1, §2.
-  (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.
-  (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §2.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §3.2.
Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §3.4, §3.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.2.
-  (2017) Predicting organic reaction outcomes with weisfeiler-lehman network. In Advances in Neural Information Processing Systems, pp. 2607–2616. Cited by: §1.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §3.2.
-  (2019) Predict then propagate: graph neural networks meet personalized pagerank. In Seventh International Conference on Learning Representations, Cited by: §2.
-  (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: §2.
-  (2018) Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1666–1674. Cited by: §1, §1, §2, §2.
Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. Journal of chemical information and modeling 53 (7), pp. 1563–1575. Cited by: §1.
-  (2018) Janossy pooling: learning deep permutation-invariant functions for variable-size inputs. arXiv preprint arXiv:1811.01900. Cited by: §2.
-  (2017) Graph2vec: learning distributed representations of graphs. arXiv preprint arXiv:1707.05005. Cited by: §2, §4.4.
-  (2017) Automatic differentiation in pytorch. Cited by: §4.2.
-  (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §2.
Schnet: a continuous-filter convolutional neural network for modeling quantum interactions. In Advances in Neural Information Processing Systems, pp. 991–1001. Cited by: §1.
-  (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
-  (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §4.1.
-  (2019) FOBE and hobe: first-and high-order bipartite embeddings. arXiv preprint arXiv:1905.10953. Cited by: §2.
Learning graph representations with recurrent neural network autoencoders. In Proc. KDD Deep Learn. Day, pp. 1–8. Cited by: §2.
-  (2015) Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: §2.
-  (2020) Unsupervised hierarchical graph representation learning with variational bayes. External Links: Cited by: §2.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2, §3.2.
-  (2018) Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §1, §1, §2, §2, §3.4, §3.4, §3, §4.1.
-  (2014) CyberSAGE: a tool for automatic security assessment of cyber-physical systems. In International Conference on Quantitative Evaluation of Systems, pp. 384–387. Cited by: §1.
-  (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems 14 (3), pp. 347–375. Cited by: §4.1.
-  (2015) Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. Cited by: §4.1.
-  (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4805–4815. Cited by: §1, §1, §2, §3.2, §3.3.
-  (2018) An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.