1 Learning Graph Embeddings from Graph Proximity Metrics
We introduce UGraphEmb ( Unsupervised Graphlevel Embbedding) for learning graphlevel representations in an unsupervised and inductive way. Recent years we have witnessed the great popularity of graph representation learning with success in not only nodelevel tasks such as node classification (Kipf & Welling, 2016a) and link prediction (Zhang & Chen, 2018) but also graphlevel tasks such as graph classification (Ying et al., 2018) and graph similarity/distance computation (Bai et al., 2019).
There has been a rich body of work (Belkin & Niyogi, 2003; Qiu et al., 2018) on nodelevel embeddings that turn each node in a graph into a vector preserving nodenode proximity (similarity/distance). It is thus natural to raise the question: Can we embed an entire graph into a vector in an unsupervised way, and how? However, most existing methods for graphlevel, i.e. wholegraph, embeddings assume a supervised model (Zhang & Chen, 2019), with only a few exceptions such as Graph Kernels (Yanardag & Vishwanathan, 2015), which typically count subgraphs for a given graph and can be slow, and Graph2Vec (Narayanan et al., 2017), which is transductive.
A key challenge facing designing an unsupervised graphlevel embedding model is the lack of graphlevel signals in the training stage. Unlike nodelevel embedding which has a long history in utilizing the link structure of a graph to embed nodes, there lacks such natural proximity (similarity/distance) information between graphs. Supervised methods, therefore, typically resort to graph labels as guidance and use aggregation based methods, e.g. average of node embeddings, to generate graphlevel embeddings, with the implicit assumption that good nodelevel embeddings would automatically lead to good graphlevel embeddings using only “intragraph information” such as node attributes, link structure, etc.
However, this assumption is problematic, as simple aggregation of node embeddings can only preserve limited graphlevel properties, which is, however, often insufficient in measuring graphgraph proximity (“intergraph” information). Inspired by the recent progress on graph proximity modeling (Ktena et al., 2017; Bai et al., 2019), we propose a novel framework, UGraphEmb that employs multiscale aggregations of nodelevel embeddings, guided by the graphgraph proximity defined by wellaccepted and domainagnostic graph proximity metrics such as Graph Edit Distance (GED) (Bunke, 1983), Maximum Common Subgraph (MCS) (Bunke & Shearer, 1998), etc.
The goal of UGraphEmb is to learn highquality graphlevel representations in a completely unsupervised and inductive fashion: During training, it learns a function that maps a graph into a universal embedding space best preserving graphgraph proximity, so that after training, any new graph can be mapped to this embedding space by applying the learned function. Inspired by the recent success of pretraining methods in the text domain, such as ELMO (Peters et al., 2018), Bert (Devlin et al., 2018), and GPT (Radford et al., 2018). we further finetune the model via incorporating a supervised loss, to obtain better performance in downstream tasks, including graph classification, graph similarity ranking, graph visualization, etc.
2 The Proposed Approach: UGraphEmb
We present the overall architecture of our unsupervised inductive graphlevel embedding framework UGraphEmb in Figure 2. The key novelty of UGraphEmb is the use of graphgraph proximity. To predict the proximity between two graphs, UGraphEmb generates one embedding per graph from node embeddings using a novel mechanism called MultiScale Node Attention (MSNA), and computes the proximity using the two graphlevel embeddings.
2.1 Inductive WholeGraph Embedding
2.1.1 Node Embedding Generation
For each graph, UGraphEmb first generates a set of node embeddings using neighbor aggregation methods based on Graph Convolutional Networks (GCN) (Kipf & Welling, 2016a), which are permutationinvariant and inductive. We adopt the most recent and stateoftheart method, Graph Isomorphism Network (GIN) (Xu et al., 2019), in our framework. GIN has been proven to be theoretically the most powerful GNN under the neighbor aggregation framework (Xu et al., 2019).
2.1.2 Graph Embedding Generation
After node embeddings are generated, UGraphEmb generates one embedding per graph using these node embeddings. Existing methods are typically based on aggregating node embeddings, by either a simple sum or average, or some more sophisticated way to aggregate. Appendix B provides a more detailed survey and explanation on some intuitions described next.
However, since our goal is to embed each graph as a single point in the embedding space that preserves graphgraph proximity, the graph embedding generation model should capture structural difference at multiple scales (Xu et al., 2018) and be adaptive to different graph proximity metrics. Thus, we propose the following MultiScale Node Attention (MSNA) mechanism. Denote the input node embeddings of graph as , where the th row, is the embedding of node . The graph level embedding is obtained as follows:
(1) 
where denotes concatenation, denotes the number of neighbor aggregation layers, denotes the following multihead attention mechanism that transforms node embeddings into a graphlevel embedding, and
denotes multilayer perceptrons with learnable weights
applied on the concatenated attention results.The intuition behind Equation 1 is that, instead of only using the node embeddings generated by the last neighbor aggregation layer, we use the node embeddings generated by each of the neighbor aggregation layers. is defined as follows:
(2) 
where is the number of nodes,
is the sigmoid function
, and is the weight parameters for the th node embedding layer. During the generation of wholegraph embeddings, the attention weight assigned to each node should be adaptive to the graph proximity metric. To achieve that, the weight is determined by both the node embedding , and a learnable graph representation. The learnable graph representation is adaptive to a particular graph proximity via the learnable weight matrix .2.2 Unsupervised Loss via InterGraph Proximity Preservation
2.2.1 Definition of Graph Proximity
The key novelty of UGraphEmb is the use of graphgraph proximity. It is important to select an appropriate graph proximity (similarity/distance) metric.
In this paper, we use GED as an example metric to demonstrate UGraphEmb. GED measures the minimum number of edit operations to transform one graph to the other, where an edit operation on a graph is an insertion or deletion of a node/edge or relabelling of a node. Thus, the GED metric takes both the graph structure and the node labels/attributes into account. Appendix C contains more details on GED.
2.2.2 Prediction of Graph Proximity
Once the proximity metric is defined, and the wholegraph embeddings for and are obtained, denoted as and , we can compute the similarity/distance between the two graphs.
Since GED is a welldefined graph distance metric, we can minimize the difference between the predicted distance and the groundtruth distance:
(3) 
where is a graph pair sampled from the training set and is the GED between them.
After training, the learned neural network can be appleid to any graph, and the graphlevel embeddings can facilitate a series of downstream tasks, and can be finetuned for specific tasks. For example, for graph classification, a supervised loss function can be used to further enhance the performance.
3 Experiments
We evaluate our model, UGraphEmb, against a number of stateoftheart approaches designed for unsupervised node and graph embeddings, on three tasks. The appendices give more details on datasets, data preprocessing, model configurations, evaluation strategies, result analysis, Task 3 (in Appendix G), etc.
3.1 Task 1: Graph Classification
Intuitively, the higher the quality of the embeddings, the better the classification accuracy. Thus, we feed the graphlevel embeddings generated by UGraphEmb
and the baselines into a logistic regression classifier to evaluate the quality: (1)
Graph Kernels (GK, DGK, SP, DSP, WL, and DWL); (2) Graph2Vec (Narayanan et al., 2017); (3) NetMF (Qiu et al., 2018); (4) GraphSAGE (Hamilton et al., 2017).For Graph Kernels, we also try using the kernel matrix and SVM classifier as it is the standard procedure outlined in (Yanardag & Vishwanathan, 2015), and report the better accuracy of the two. For (3) and (4), we try different averaging schemes on node embeddings to obtain the graphlevel embeddings and report their best accuracy. Appendix F.1 gives more details on the experimental settings.
As shown in Table 1, UGraphEmb without finetuning, i.e. using only the unsupervised “intergraph” information, can already achieve top 2 on 3 out of 5 datasets and demonstrates competitive accuracy on the other datasets. With finetuning (UGraphEmbF), our model can achieve the best result on 3 out of 5 datasets. Methods specifically designed for graphlevel embeddings (Graph Kernels, Graph2Vec, and UGraphEmb) consistently outperform methods designed for nodelevel embeddings (NetMF and GraphSAGE), suggesting that good nodelevel embeddings do not naturally imply good graphlevel representations.
Method  Ptc  ImdbMulti  Web  Nci109  Reddit12k 
GK  
DGK  
SP  
DSP  
WL  80.22  
DWL  80.32  
Graph2Vec  40.91  
NetMF  
GraphSAGE  
UGraphEmb  72.54  50.06  39.97  
UGraphEmbF  73.56  50.97  45.03  41.84 
Method  Ptc  ImdbMulti  Web  Nci109  Reddit12k  

p@10  p@10  p@10  p@10  p@10  
GK  
DGK  
SP  
DSP  
WL  
DWL  
Graph2Vec  
NetMF  
GraphSAGE  
Beam  
Hungarian  
VJ  
HED  0.667  
UGraphEmb  0.840  0.457  0.853  0.816  0.303  0.476  0.189  0.572  0.365 
3.2 Task 2: Similarity Ranking
For each dataset, we split it into training, validation, and testing sets by 6:2:2, and report the averaged Mean Squared Error (mse), Kendall’s Rank Correlation Coefficient () (Kendall, 1938), and Precision at 10 (p@10) to test the ranking performance. Appendix F.2 shows more details on the setup.
Table 2 shows that UGraphEmb achieves stateoftheart ranking performance under all settings except one. This should not be a surprise, because only UGraphEmb utilizes the groundtruth GED results collectively determined by Beam (Neuhaus et al., 2006), Hungarian (Riesen & Bunke, 2009), and VJ (Fankhauser et al., 2011). UGraphEmb even outperforms HED (Fischer et al., 2015), a stateoftheart approximate GED computation algorithm, under most settings, further confirming its strong ability to generate proximitypreserving graph embeddings by learning from a specific graph proximity metric, which is GED in this case.
References
 Bai et al. (2019) Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. Simgnn: A neural network approach to fast graph similarity computation. WSDM, 2019.
 Belkin & Niyogi (2003) Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
 Blumenthal & Gamper (2018) David B Blumenthal and Johann Gamper. On the exact computation of the graph edit distance. Pattern Recognition Letters, 2018.
 Bunke (1983) H Bunke. What is the distance between graphs. Bulletin of the EATCS, 20:35–39, 1983.
 Bunke & Shearer (1998) Horst Bunke and Kim Shearer. A graph distance metric based on the maximal common subgraph. Pattern recognition letters, 19(34):255–259, 1998.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pp. 3844–3852, 2016.
 Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 Du et al. (2018) Lun Du, Yun Wang, Guojie Song, Zhicong Lu, and Junshan Wang. Dynamic network embedding: An extended approach for skipgram based network embedding. In IJCAI, pp. 2086–2092, 2018.
 Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, pp. 2224–2232, 2015.
 Fankhauser et al. (2011) Stefan Fankhauser, Kaspar Riesen, and Horst Bunke. Speeding up graph edit distance computation through fast bipartite matching. In International Workshop on GraphBased Representations in Pattern Recognition, pp. 102–111. Springer, 2011.
 Fischer et al. (2015) Andreas Fischer, Ching Y Suen, Volkmar Frinken, Kaspar Riesen, and Horst Bunke. Approximation of graph edit distance based on hausdorff matching. Pattern Recognition, 48(2):331–343, 2015.
 Frasconi et al. (1998) Paolo Frasconi, Marco Gori, and Alessandro Sperduti. A general framework for adaptive processing of data structures. IEEE transactions on Neural Networks, 9(5):768–786, 1998.
 Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In ICML, pp. 1263–1272. JMLR. org, 2017.
 Girija (2016) Sanjay Surendranath Girija. Tensorflow: Largescale machine learning on heterogeneous distributed systems. 2016.
 Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In SIGKDD, pp. 855–864. ACM, 2016.
 Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, pp. 1024–1034, 2017.

Hart et al. (1968)
Peter E Hart, Nils J Nilsson, and Bertram Raphael.
A formal basis for the heuristic determination of minimum cost paths.
IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.  Kearnes et al. (2016) Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30(8):595–608, 2016.
 Kendall (1938) Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
 Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 Kipf et al. (2018) Thomas Kipf, Ethan Fetaya, KuanChieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. ICML, 2018.
 Kipf & Welling (2016a) Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. ICLR, 2016a.

Kipf & Welling (2016b)
Thomas N Kipf and Max Welling.
Variational graph autoencoders.
NIPS Workshop on Bayesian Deep Learning
, 2016b.  Ktena et al. (2017) Sofia Ira Ktena, Sarah Parisot, Enzo Ferrante, Martin Rajchl, Matthew Lee, Ben Glocker, and Daniel Rueckert. Distance metric learning using graph convolutional networks: Application to functional brain networks. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 469–477. Springer, 2017.
 Liang & Zhao (2017) Yongjiang Liang and Peixiang Zhao. Similarity search in graph databases: A multilayered indexing approach. In ICDE, pp. 783–794. IEEE, 2017.
 Ma et al. (2018) Tengfei Ma, Cao Xiao, Jiayu Zhou, and Fei Wang. Drug similarity integration through attentive multiview graph autoencoders. IJCAI, 2018.
 Maaten & Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119, 2013.
 Narayanan et al. (2017) Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. graph2vec: Learning distributed representations of graphs. KDD MLG Workshop, 2017.
 Neuhaus et al. (2006) Michel Neuhaus, Kaspar Riesen, and Horst Bunke. Fast suboptimal algorithms for the computation of graph edit distance. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 163–172. Springer, 2006.
 Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In ICML, pp. 2014–2023, 2016.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In SIGKDD, pp. 701–710. ACM, 2014.
 Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. NAACL, 2018.
 Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifyingdeepwalk, line, pte, and node2vec. WSDM, 2018.
 Qureshi et al. (2007) Rashid Jalal Qureshi, JeanYves Ramel, and Hubert Cardot. Graph based shapes representation and recognition. In International Workshop on GraphBased Representations in Pattern Recognition, pp. 49–60. Springer, 2007.
 Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pretraining. 2018.
 Riesen & Bunke (2008) Kaspar Riesen and Horst Bunke. Iam graph database repository for graph based pattern recognition and machine learning. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 287–297. Springer, 2008.
 Riesen & Bunke (2009) Kaspar Riesen and Horst Bunke. Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision computing, 27(7):950–959, 2009.
 Riesen et al. (2013) Kaspar Riesen, Sandro Emmenegger, and Horst Bunke. A novel software toolkit for graph edit distance computation. In International Workshop on GraphBased Representations in Pattern Recognition, pp. 142–151. Springer, 2013.
 Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
 Shervashidze & Borgwardt (2009) Nino Shervashidze and Karsten Borgwardt. Fast subtree kernels on graphs. In NIPS, pp. 1660–1668, 2009.
 Shervashidze et al. (2009) Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495, 2009.
 Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 Shrivastava & Li (2014) Anshumali Shrivastava and Ping Li. A new space for comparing graphs. In Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 62–71. IEEE Press, 2014.
 Simonovsky & Komodakis (2017) Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proc. CVPR, 2017.
 Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In WWW, pp. 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
 Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. ICLR, 2018.
 Wale et al. (2008) Nikil Wale, Ian A Watson, and George Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347–375, 2008.
 Wang et al. (2016) Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In SIGKDD, pp. 1225–1234. ACM, 2016.
 Wang et al. (2017) Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. Community preserving network embedding. In AAAI, pp. 203–209, 2017.
 Williams (2001) Christopher KI Williams. On a connection between kernel pca and metric multidimensional scaling. In Advances in neural information processing systems, pp. 675–681, 2001.
 Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Kenichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. ICML, 2018.
 Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? ICLR, 2019.
 Yan et al. (2005) Xifeng Yan, Philip S Yu, and Jiawei Han. Substructure similarity search in graph databases. In SIGMOD, pp. 766–777. ACM, 2005.
 Yanardag & Vishwanathan (2015) Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In SIGKDD, pp. 1365–1374. ACM, 2015.
 Ying et al. (2018) Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. NeurIPS, 2018.
 Zhang & Chen (2018) Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In NeurIPS, pp. 5171–5181, 2018.
 Zhang et al. (2018) Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An endtoend deep learning architecture for graph classification. In AAAI, 2018.
 Zhang & Chen (2019) Xinyi Zhang and Lihui Chen. Capsule graph neural network. ICLR, 2019.
 Zhao et al. (2018) Xiaohan Zhao, Bo Zong, Ziyu Guan, Kai Zhang, and Wei Zhao. Substructure assembling network for graph classification. AAAI, 2018.
Appendices
A Comparison with Existing Frameworks
To better see the novelty of our proposed framework, UGraphEmb, we present a detailed study on two related existing frameworks for node and graph embeddings. As shown in Figure 2, we summarize graph neural network architectures for learning graph representations into three frameworks:

Framework 1: Supervised/Unsupervised framework for nodelevel tasks, e.g. node classification, link prediction, etc.

Framework 2: Supervised endtoend neural networks for graphlevel tasks, typically graph classification.

Framework 3: UGraphEmb, unsupervised framework for multiple graphlevel tasks with the key novelty in using graphgraph proximity.
The rest of this section describes the first two frameworks in detail, and compare UGraphEmb with various other related methods when appropriate. This section also serves as a more thorough survey of graph embedding methods, proving more background knowledge in the area of node and graph representation learning.
a.1 Framework 1: Node Embedding (Supervised and Unsupervised)
Since the goal is to perform nodelevel tasks, the key is the “Node Embedding Model” which produces one embedding per node for the input graph. As described in the main paper, there are many methods to obtain such node embeddings, such as:

Matrix Factorization:
This category includes a vast amount of both early and recent works on network (node) embedding, such as Laplacian Eigenmaps (LLE) (Belkin & Niyogi, 2003), MNMF (Wang et al., 2017), NetMF (Qiu et al., 2018), etc.
Many interesting insights and theoretical analysis have been discovered and presented, but since this work focuses on neural network based methods, the reader is referred to (Qiu et al., 2018) for a complete discussion.

Direct Encoding (Free Encoder):
This simple way of directly initializing one embedding per node randomly can be traced back to the Natural Language Processing domain – the classic
Word2Vec (Mikolov et al., 2013) model indeed randomly initializes one embedding per word where gradients flow back during optimization.Node embedding methods such as LINE (Tang et al., 2015) ans DeepWalk (Perozzi et al., 2014) adopt this scheme. DeepWalk is also known as “skipgram based methods” (Du et al., 2018) due to its use of Word2Vec.
However, such methods are intrinsically transductive – they cannot handle new node unseen in the training set in a straightforward way (Hamilton et al., 2017). For Word2Vec though, it is typically not a concern since outofvocabulary words tend to be rare in a large text corpus.
This also reveals a fundamental difference between the text domain and graph domain – words have their specific semantic meaning making them identifiable across different documents, yet nodes in a graph or graphs usually lack such identity. This calls for the need of inductive representation learning, which is addressed more recently by the next type of node embedding model.

Graph Convolution (Neighbor Aggregation):
As discussed in the main paper, Graph Convolutional Network (GCN) (Defferrard et al., 2016) boils down to the aggregation operation that is applied to every node in a graph. This essentially allows the neural network model to learn a function that maps input graph to output node embeddings:
(4) where and denote the adjacency matrix (link structure) and the node and/or edge features/attributes, and is the node embedding matrix.
The importance of such function is evident – for any new node outside the training set of nodes, the neighbor aggregation models can simply apply the learned to obtain ; for any new graph outside the training set of graphs, the same procedure also works, achieving inductivity. Permutationinvariance can also be achieved as discussed in the main paper.
So far we have discussed about the node embedding generation step. In order to make the node embeddings highquality, additional components are usually necessary as auxiliaries/guidance in the architectures of methods belong to Framework 1, including:

Predict Node Context:
The goal is to use the node embeddings to reconstruct certain “node local context” – in other words, to force the node embeddings to preserver certain local structure. We highlight three popular types of definitions of such context:

1st order neighbor:
The model encourages directly connected nodes to have similar embeddings. In LINE1st, the loss function is similar to the Skipgram objective proposed in Word2Vec. In SDNE (Wang et al., 2016), an autoencoder framework, the loss function is phrased as the reconstruction loss, the typical name in autoencoders.

Higher order context:
An example is LINE2nd, which assumes that nodes sharing many connections to other nodes are similar to each other. In practice, such incorporation of higher order neighbors typically gives better performance in nodelevel tasks.

Random walk context:
“Context nodes” are defined as the nodes that cooccur on random walks on a graph. By this definition, for a given node, its context can include both its closeby neighbors and distant node. Equipped with various techniques of tuning and improving upon random walks, this type of methods seems promising.
Example methods include DeepWalk, Node2Vec (Grover & Leskovec, 2016), GraphSAGE, etc. Notice that the former two use direct encoding as its node embedding model as described previously, while GraphSAGE uses neighbor aggregation. From this, we can also see that Framework 1 indeed includes a vast amount of models and architectures.


Predict Node Label:
So far all the methods we have discussed about are unsupervised node embedding methods. As said in the main paper, to evaluate these unsupervised node embeddings, a second stage is needed, which can be viewed as a series of downstream tasks as listed in Figure 2.
However, a large amount of existing works incorporate a supervised loss function into the model, making the entire model trainable endtoend.
Before finishing presenting Framework 1, we highlight important distinctions between the proposed framework and the following baseline methods:
a.1.1 UGraphEmb vs NetMF
NetMF is among the stateoftheart matrix factorization based methods for node embeddings. It performs eigendecomposition, oneside bounding, and rank
approximation by Singular Value Decomposition, etc. for a graph, and is transductive.
UGraphEmb is graphlevel and inductive. Section F.1.2 gives more details on how we obtain graphlevel embeddings out of nodelevel embeddings for NetMF.a.1.2 UGraphEmb vs GraphSAGE
GraphSAGE belongs to neighbor aggregation based methods. Although being unsupervised and inductive, by design GraphSAGE performs nodelevel embeddings via an unsupervised loss based on context nodes on random walks (denoted as “Random walk context” as in Figure 2), while UGraphEmb performs graphlevel embeddings via the MSNA mechanism, capturing structural difference at multiple scales and adaptive to a given graph similarity/distance metric.
a.2 Framework 2: Supervised Graph Embedding
The second framework we identify as supervised graph embedding. So far graph classification is the dominating and perhaps the only important task for Frmaework 2.
Here we highlight some existing works to demonstrate its popularity, including PatchySan (Niepert et al., 2016), ECC (Simonovsky & Komodakis, 2017), Set2Set (Gilmer et al., 2017), GraphSAGE, DGCNN/SortPool (Zhang et al., 2018), SAN (Zhao et al., 2018), DiffPool (Ying et al., 2018), CapsGNN (Zhang & Chen, 2019), etc.
Notice that most of these models adopt the neighbor aggregation based node embedding methods described previously, which are inductive so that for new graphs outside the training set of graphs, their graphlevel embeddings can be generated, ans graph classification can be performed.
a.2.1 UGraphEmb vs Graph Kernels
Although Graph Kernels (Yanardag & Vishwanathan, 2015) are not supervised methods, we still make a comparison here, because Graph Kernels are a family of methods designed for graph classification, the same task as Framework 2.
Different graph kernels extract different types of substructures in a graph, e.g. graphlets (Shervashidze et al., 2009), subtree patterns (Shervashidze & Borgwardt, 2009), etc., and the resulting vector representation for each graph is typically called “feature vector” (Yanardag & Vishwanathan, 2015), encoding the count/frequency of substructures. These feature vectors are analogous to graphlevel embeddings, but the end goal is to create a kernel matrix encoding the similarity between all the graph pairs in the dataset fed into a kerkenl SVM classifier for graph classification.
Compared with graph kernels, UGraphEmb learns a function that preserves a general graph similarity/distance metric such as Graph Edit Distance (GED), and as a result, yields a graphlevel embedding for each graph that can be used to facilitate a series of downstream tasks. It is inductive, i.e. handles unseen graphs due to the learned function. In contrast, although Graph Kernels can be considered as inductive (Shervashidze et al., 2011), graph kernels have to perform the subgraph extraction for every graph, which can be slow and cannot adapt to different graph proximity metrics.
a.2.2 UGraphEmb vs Graph2Vec
Similar to DeepWalk, Graph2Vec is also inspired by the classic Word2Vec paper, but instead of generating node embeddings, it is designed to generate graphlevel embeddings, by treating each graph as a bag of rooted subgraphs, and adopting Doc2Vec (Mikolov et al., 2013) instead of Word2Vec. The difference between Graph2Vec and UGraphEmb is that, Graph2Vec is transductive (similar to Graph Kernels), as explained in Section A.1, while UGraphEmb is inductive.
a.3 Framework 3: UGraphEmb
This is our proposed framework, which is the key novelty of the paper. Now since we have introduced Framework 1 and Framework 2, it can be clearly seen that the use of graphgraph proximity is a very different and new perspective of performing graphlevel embeddings. UGraphEmb satifies all the following properties: graphlevel, unsupervised, and inductive.
a.4 Summary
Method  Citation  G  U  I 
LLE  Belkin & Niyogi (2003)  
GCN  Kipf & Welling (2016a)  
GIN  Xu et al. (2019)  
DiffPool  Ying et al. (2018)  
Graph Kernels    
Graph2Vec  Narayanan et al. (2017)  
NetMF  Qiu et al. (2018)  
GraphSAGE  Hamilton et al. (2017)  
UGraphEmb  this paper  ✓ 
In Table 3, we can see that only UGraphEmb and Graph Kernels satisfy all the three properties. However, Graph Kernels cannot adapt to or learn from general graph proximity metrics like GED.
B GraphLevel Embedding Generation
b.1 A Brief Survey
Across the years, many methods have been proposed for the generation of graphlevel embeddings from nodelevel embeddings, with the most popular ones including:

Sum/Average:
The graph embedding is simply the summation or average of the node embeddings (Duvenaud et al., 2015).

Supersource:
Dated back to early works (Frasconi et al., 1998; Scarselli et al., 2009), the idea is to introduce a “dummy/super node” connecting to every node in the graph, so that during aggregation, this supersource node absorbs information from all the nodes. Its embedding is treated as the graphlevel embedding.
The edge between the supersource node and every other node is directed, so that information flows from other nodes to the supersource node. Otherwise, two faraway nodes would affect each other due to the supersource node.

Coarsening/Pooling:

Others:
b.2 Design of WholeGraph Embedding Generation Model
However, since our goal is to embed each graph as a single point in the embedding space that preserves graphgraph proximity, the graph embedding generation model should:

Capture structural difference at multiple scales.
Applying a neighbor aggregation layer on nodes such as GIN cause the information to flow from a node to its direct neighbors, so sequentially stacking layers would cause the final representation of a node to include information from its th order neighbors.
However, after many neighbor aggregation layers, the learned embeddings could be too coarse to capture the structural difference in small local regions between two similar graphs. Capturing structural difference at multiple scales is therefore important for UGraphEmb to generate highquality graphlevel embeddings.

Be adaptive to different graph proximity metrics.
UGraphEmb is a general framework that should be able to preserve the graphgraph proximity under any graph proximity metric, such as GED and MCS. A simple aggregation of node embeddings without any learnable parameters limits the expressive power of existing graphlevel embedding models.
These existing methods are based on aggregation of node embeddings without flexibility/learnability to adapt to different graph similarity/distance metrics, and lack the capability to capture structural difference at multiple scales. We therefore propose a MultiScale Node Attention (MSNA) mechanism to address both issues for UGraphEmb.
C Graph Proximity Metrics
c.1 Graph Proximity Metric Selection
Since the key novelty of UGraphEmb is the use of graphgraph proximity, it is important to select an appropriate graph proximity (similarity/distance) metric. We identify three categories of candidates:

Proximity defined by graph labels.
For graphs that come with labels, we may treat graphs of the same label to be similar to each other. However, such proximity metric may be too coarse, unable to distinguish between graphs of the same label.

Proximity given by domain knowledge or human experts.
For example, in drugdrug interaction detection Ma et al. (2018), a domainspecific metric to encode compound chemical structure can be used to compute the similarities between chemical graphs. However, such metrics do not generalize to graphs in other domains. Sometimes, this information may be very expensive to obtain. For example, to measure brain network similarities, a domainspecific preprocessing pipeline involving skull striping, bandpass filtering, etc. is needed. The final dataset only contains networks from 871 humans Ktena et al. (2017).

Proximity defined by domainagnostic and wellaccepted metrics.
In this paper, we use GED as an example metric to demonstrate UGraphEmb.
c.2 Graph Edit Distance (GED)
The edit distance between two graphs (Bunke, 1983) and is the number of edit operations in the optimal alignments that transform into , where an edit operation on a graph is an insertion or deletion of a node/edge or relabelling of a node. Note that other variants of GED definitions exist (Riesen et al., 2013), and we adopt the most basic version in this work. Fig. 3 shows an example of GED between two simple graphs.
Notice that although UGraphEmb currently does not handle edge types, UGraphEmb is a general framework and can be extended to handle edge types, e.g. adapting the graph neural network described in (Kipf et al., 2018).
c.3 Graph Proximity Metric and Loss Function
Multidimensional scaling (MDS) is a classic form of dimensionality reduction Williams (2001). The idea is to embed data points in a low dimensional space so that their pairwise distances are preserved, e.g. via minimizing the loss function
(5) 
where and are the embeddings of points and , and is their distance.
Denote the wholegraph embeddings for and as and . Since GED is a welldefined graph distance metric, we can minimize the difference between the predicted distance and the groundtruth distance:
(6) 
where is a graph pair sampled from the training set and is the GED between them.
Alternatively, if the graph proximity metric is about similarity, such as in the case of MCS, we can use the following loss function:
(7) 
D Datasets
d.1 Detailed Description of Datasets
Five real graph datasets are used for the experiments. A concise summary can be found in Table 4.
Dataset  Meaning  #Node Labels  #Graphs  #Graph Labels  Min  Max  Mean  Std 

Ptc  Chemical Compounds  19  344  2  2  109  25.5  16.2 
ImdbMulti  Social Networks  1  1500  3  7  89  13.0  8.5 
Web  Text Documents  15507  2340  20  3  404  35.5  37.0 
Nci109  Chemical Compounds  38  4127  2  4  106  29.6  13.5 
Reddit12k  Social Networks  1  11929  11  2  3782  391.4  428.7 
Statistics of datasets. “Min”, “Max”, “Mean”, and “Std” refer to the minimum, maximum, mean, and standard deviation of the graph sizes (number of nodes), respectively.

Ptc (Shrivastava & Li, 2014) is a collection of 344 chemical compounds which report the carcinogenicity for rats. There are 19 node labels for each node.

ImdbMulti (Yanardag & Vishwanathan, 2015) consists of 1500 egonetworks of movie actors/actresses with unlabeled nodes representing the people and edges representing the collaboration relationship. The nodes are unlabeled, but there could be 3 graph labels for each graph.

Web (Riesen & Bunke, 2008) is a collection of 2340 documents from 20 categories. Each node is a word, and there is an edge between two words if one word precedes the other. Since one word can appear in multiple sentences, the entire document is represented as a graph. Only the most frequent words are used to construct the graph, and there are 15507 words in total, thus 15507 node types associated with the dataset.

Nci109 (Wale et al., 2008) is another bioinformatics dataset. It contains 4127 chemical compounds tested for their ability to suppress or inhibit human tumor cell growth.

Reddit12k (Yanardag & Vishwanathan, 2015) contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the fact that one of the two users responded to the comment of the other user. There is 1 of 11 graph labels associated with each of these 11929 discussion graphs, representing the category of the community.
d.2 Additional Notes on Web
Since each graph node in Web represents a word, it is natural to consider incorporating the semantic similarity between two words, e.g. using Word2Vec, into the GED definition, and even the broader topic of text matching and retrieval.
In fact, the definition of GED does not specify that node labels must be discrete. There exists some variant of GED definition that can define node label difference in a more complicated way (Riesen et al., 2013), which is a promising direction to explore in future. It is also promising to explore document embedding based on graph representation of text.
E Data Preprocessing
For each dataset, we randomly split 60%, 20%, and 20% of all the graphs as training set, validation set, and testing set, respectively. For each graph in the testing set, we treat it as a query graph, and let the model compute the distance between the query graph and every graph in the training and validation sets.
e.1 GroundTruth GED Computation
To compute groundtruth GED for training pair generation as well as similarity ranking evaluation, we have the following candidate GED computation algorithms:

A* (Hart et al., 1968):
It is an exact GED solver, but due to the NPhard nature of GED, it runs in exponential time complexity. What is worse, a recent study shows that no currently available algorithm can reliably compute GED within reasonable time between graphs with more than 16 nodes (Blumenthal & Gamper, 2018).

They are approximate GED computation algorithms with subexponential time complexity, quadratic time complexity, and quadratic time complexity, respectively. They are all guaranteed to return upper bounds of the exact GEDs, i.e. their computed GEDs are always greater than or equal to the actual exact GEDs.

HED (Fischer et al., 2015):
It is another approximate GED solver running in quadratic time, but instead yields lower bounds of exact GEDs.
We take the minimum distance computed by Beam (Neuhaus et al., 2006), Hungarian (Riesen & Bunke, 2009), and VJ (Fankhauser et al., 2011). The minimum is taken because their returned GEDs are guaranteed to be upper bounds of the true GEDs. In fact, the ICPR 2016 Graph Distance Contest ^{1}^{1}1https://gdc2016.greyc.fr/ also adopts this approach to handle large graphs.
We normalize the GEDs according to the following formula: , where denotes the number of nodes of (Qureshi et al., 2007).
For the smaller datasets Ptc, ImdbMulti, and Web, we compute the groundtruth GEDs for all the pairs in the training set. For the larger datasets Nci109 and Reddit12k, we do not compute all the pairs, and instead cap the computation at around 10 hours.
We run the groundtruth GED solvers on a CPU server with 32 cores, and utilize at most 20 cores at the same time, using code from (Riesen et al., 2013). The details are shown in Table 5.
Dataset  #Total Pairs  #Comp. Pairs  Time 

Ptc  118336  42436  9.23 Mins 
ImdbMulti  2250000  810000  4.72 Hours 
Web  5475600  1971216  8.23 Hours 
Nci109  17032129  2084272  10.05 Hours 
Reddit12k  142301041  2124992  10.42 Hours 
e.2 “HyperLevel” Graph
At this point, it is worth mentioning that the training procedure of UGraphEmb is stochastic, i.e. UGraphEmb is trained on a subset of graph pairs in each iteration. Moreover, UGraphEmb does not require the computation all the graph pairs in the training set, so the notion of “hyperlevel” graph as mentioned in the main paper does not imply that UGraphEmb constructs a fully connected graph where each node is a graph in the dataset.
In future, it would be promising to explore other techniques to construct such “hyperlevel” graph beyond the current way of random selection of graph pairs in the training set.
e.3 Node Label Encoding
For Ptc, Web, and Nci109
, the original node representations are onehot encoded according to the node labels. For graphs with unlabeled nodes, i.e.,
ImdbMulti and Reddit12k, we treat every node to have the same label, resulting in the same constant number as the initialize representation.In future, it would be interesting to consider more sophisticated ways to encode these node labels, because node labels help identifying different nodes across different graph datasets, which is an important component for a successful pretraining method for graphs. Consider that we want to combine multiple different graph datasets of different domains for largescale pretraining of graph neural networks. Then how to handle different node labels in different datasets and domains becomes an important issue.
F Parameter Settings and Experimental Details
We evaluate our model, UGraphEmb, against a number of stateoftheart approaches designed for unsupervised node and graph embeddings, to answer the following questions:

How superb are the graphlevel embeddings generated by UGraphEmb, when evaluated with downstream tasks including graph classification and similarity ranking?

Do the graphlevel embeddings generated by UGraphEmb provide meaningful visualization for the graphs in a graph database?

Is the quality of the embeddings generated by UGraphEmb sensitive to choices of hyperparamters?
For the proposed model, to make a fair comparison with baselines, we use a single network architecture on all the datasets, and run the model using exactly the same test graphs as used in the baselines.
We set the number of GIN
layers to 3, and use ReLU as the activation function. The output dimensions for the 1st, 2nd, and 3rd layers of
GIN are 256, 128, and 64, respectively. Following the original paper of GIN (Xu et al., 2019), we fix to 0.Then we transform the concreted embeddings into graphlevel embeddings of dimension 256 by using two fully connected (dense) layers, which are denoted as in the main paper.
The model is written in TensorFlow (Girija, 2016). We conduct all the experiments on a single machine with an Intel i76800K CPU and one Nvidia Titan GPU. As for training, we set the batch size to 256, i.e. 256 graph pairs (512 graphs) per minibatch, use the Adam algorithm for optimization (Kingma & Ba, 2015), and fix the initial learning rate to 0.001. We set the number of iterations to 20000, and select the best model based on the lowest validation loss.
f.1 Task 1: Graph Classification
f.1.1 Evaluation Procedure
Since UGraphEmb is unsupervised, we evaluate all the methods following the standard strategy for evluating unsupervised node embeddings (Tang et al., 2015; Wang et al., 2016). It has three stages: (1) Train a model using the training set with validation set for parameter tuning; (2) Train a standard logistic regression classifier using the embeddings as features as well as their groundtruth graph labels; (3) Run the model on the graphs in the testing set and feed their embeddings into the classifier for label prediction.
f.1.2 Baseline Setup
By default, we use the results reported in the original work for baseline comparison. However, in cases where the results are not available, we use the code released by the original authors, performing a hyperparameter search based on the original author’s guidelines. Notice that our baselines include a variety of methods of different flavors:

Graph Kernels:
For the Graph Kernels
baselines, there are two schemes to evaluate: (1) Treat the features extracted by each kernel method as the graphlevel embeddings, and perform the second and third stages described previously; (2) Feed the SVM kernels generated by each method into a kernel SVM classifier as in
(Yanardag & Vishwanathan, 2015). The second scheme typically yields better accuracy and is more typical. We perform both schemes and report the better of the two accuracy scores for each baseline kernel. All the six versions of the Graph Kernels are described in detail in (Yanardag & Vishwanathan, 2015). 
Graph2Vec:
Similar to Graph Kernels, Graph2Vec is also transductive, meaning it has to see all the graphs in both the training set and the testing set, and generates a graphlevel embedding for each.

NetMF and GraphSAGE:
Since they generate nodelevel embeddings, we take the average of node embeddings as the graphlevel embedding. We also try various types of averaging schemes, including weighted by the node degree, weighted by the inverse of node degree, as well as summation. We report the best accuracy achieved by these schemes.
There is no training needed to be done for NetMF, since it is based on matrix factorization. For GraphSAGE, we combine all the graphs in the training set, resulting in one single graph to train GraphSAGE, which is consistent with its original design for inductive nodelevel embeddings. After training, we use the trained GraphSAGE model to generate graphlevel embeddings for each individual graph in the test set, consistent with how UGraphEmb handles graphs in the test set.
f.1.3 Embedding Dimension
For all the baseline methods, we ensure that the dimension of the graphlevel embeddings is the same as our UGraphEmb by setting hyperparameters for each properly. For Graph Kernels, however, they extract and count subgraphs, and for a given dataset, the number of unique subgraphs extracted depend on the dataset, which determines the dimension of the feature vector for each graph in the dataset. Thus, we do not limit the number of subgraphs they extract, giving them advantage, and follow the guidelines in their original papers for hyperparameter tuning.
f.1.4 FineTuning
To incorporate the supervised loss function (crossentropy loss) into our model, we use multiple fully connected layers to reduce the dimension of graphlevel embeddings to the number of graph labels. When the finetuning procedure starts, we switch to using the supervised loss function to train the model with the same learning rate and batch size as before.
After finetuning, the graph label information is integrated into the graphlevel embeddings. We still feed the embeddings into the logistic regression classifier for evaluation to ensure it is consistent for all the configurations of all the models. The accuracy based on the prediction of the model is typically much higher because it utilizes supervised information for graph label prediction.
f.2 Task 2: Similarity Ranking
f.2.1 Evaluation Procedure
For all the methods, we adopt the procedure outlined in Section F.1.2 to obtain graphlevel embeddings. For each graph in the test set, we treat it as a graph query, compute the similarity/distance score between the query graph and every graph in the training set, and rank the results, compared with the groundtruth ranking results by the groundtruth GED solvers.
We compute both the similarity score (inner product of two graphlevel embeddings) and distance score (squared L2 distance between two graphlevel embeddings) for every graph pair when doing the query, and report the better of the two in the paper. To verify that the actual ranking of the graphs makes sense, we perform several case studies. As shown in Figure 4, UGraphEmb computes the distance score between the query and every graph in the training set. Although the exact distance score is not exactly the same as the groundtruth normalized GEDs, the relatively position and ranking are quite reasonable.
Notice that for Graph Kernels, the three deep versions, i.e. DGK, DSP, and WDL, generate the same graphlevel embeddings as the nondeep versions, i.e. GK, SP, and DL, but use the idea of Word2Vec to model the relation between subgraphs (Yanardag & Vishwanathan, 2015). Consequently, the nondeep versions simply compute the dot products between embeddings to generate the kernel matrices, but the deep versions further modify the kernel matrices, resulting in different graphgraph similarity scores. We thus evaluate the deep versions using their modified kernel matrices.
G More Experiments
g.1 Task 3: Embedding Visualization
Visualizing the embeddings on a twodimensional space is a popular way to evaluate node embedding methods Tang et al. (2015). However, we are among the first to investigate the question: Are the graphlevel embeddings generated by a model like UGraphEmb provide meaningful visualization?
We feed the graph emebddings learned by all the methods into the visualization tool tSNE Maaten & Hinton (2008). The three deep graph kernels, i.e. DGK, DSP, and WDL, generate the same embeddings as the nondeep versions, but use additional techniques Yanardag & Vishwanathan (2015) to modify the similarity kernel matrices, resulting in different classification and ranking performance.
From Figure 5, we can see that UGraphEmb can separate the graphs in ImdbMulti into multiple clusters, where graphs in each cluster share some common substructures.
Such clustering effect is likely due to our use of graphgraph proximity scores, and is thus not observed in NetMF or GraphSAGE. For Graph Kernels and Graph2Vec though, there are indeed clustering effects, but by examining the actual graphs, we can see that graphgraph proximity is not wellpreserved by their clusters (e.g. for WL graph 1, 2 and 9 should be close to each other; for Graph2Vec, graph 1, 2, and 12 should be close to each other), explaining their worse similarity ranking performance compared to UGraphEmb.
g.1.1 Detailed Evaluation Procedure
We feed the graphlevel embeddings into the tSNE (Maaten & Hinton, 2008)
tool to project them into a 2D space. We then do the following linear interpolation: (1) Select two points in the 2D space; (2) Form a line segment between the two selected points; (3) Split the line into 11 equallength line segments, resulting in 12 points on the line segment in total; (4) Go through these 12 points: For each point, find an embedding point in the 2D space that is closest to it; (5) Label the 12 embedding points on the embedding plot and draw the actual graph to the right of the embedding plot. This yields the visualization of the
ImdbMulti dataset.g.2 Parameter Sensitivity of UGraphEmb
We evaluate how the dimension of the graphlevel embeddings and the percentage of graph pairs with groundtruth GEDs used to train the model can affect the results. We report the graph classification accuracy on ImdbMulti.
As can be seen in Figure 6, the performance becomes marginally better if larger dimensions are used. For the percentage of training pairs with groundtruth GEDs, the performance drops as less pairs are used. Note that the xaxis is in logscale. When we only use 0.001% of all the training graph pairs (only 8 pairs with groundtruth GEDs), the performance is still better than many baseline methods, exhibiting impressive insensitivity to data sparsity.
H Analysis and Discussion of Experimental Results
On graph classification, UGraphEmb does not achieve top 2 on Web and Nci109, which can be attributed to the fact that there are many node labels associated with the two datasets, as shown in Table 4. Combined with the fact that we use onehot encoding for the initial node representations as described in Section E.3, UGraphEmb has limited capacity to capture the wide variety of node labels. In future, it is promising to explore other node encoding techniques, such as encoding based on node degrees and clustering coefficients (Ying et al., 2018).
Another possible reason is that the current definition of GED cannot capture the subtle difference between different node labels. For example, a Carbon atom may be chemically more similar to a Nitrogen atom than a Hydrogen atom, which should be reflected in the graph proximity metric. As mentioned in Section D.2, there are other definitions of GED that can handle such cases.
I Analysis of The MultiScale Node Attention (MSNA) Mechanism
Table 6 shows how much performance gain our proposed MultiScale Node Attention (MSNA) mechanism brings to our model, UGraphEmb. As can be seen in the table, a simple averaging scheme to generate graphlevel embeddings cannot yields much worse performance, compared with the other three mechanisms. The supersource approach is not very bad, but still worse than the attention mechanism which brings learnable components into the model. The MSNA mechanism combines the node embeddings from different GCN layers, capturing structural difference at different scales and yielding the best performance.
Method  acc  p@10  

UGraphEmbavg  34.73  0.243  0.647 
UGraphEmbssrc  46.02  0.810  0.796 
UGraphEmbna  49.51  0.851  0.810 
UGraphEmbmsna  50.06  0.853  0.816 
Please note that under all the four settings, the node embeddings layers are exactly the same, i.e. three sequential GIN layers. From UGraphEmbavg, we can see that the learnable components in the node embedding model are not enough for good graphlevel embeddings.
It is also worth mentioning that the supersource idea works reasonably well, which can be attributed to the fact that the supersource node is connected to every other node in the graph, so that every node passes information to the supersource node, contributing to a relatively informative graphlevel embedding. Compared with the averaging scheme, there is additional MLP transformation on the node embedding of the supersource node after the aggregation of other node embeddings, as indicated by the Equation for GIN in the main paper.
Comments
There are no comments yet.