1 Introduction
Messagepassing neural networks (MPNNs), such as GNN (Scarselli et al., 2008), ChebNet (Defferrard et al., 2016), GGNN (Li et al., 2016), GCN (Kipf and Welling, 2017), are powerful for learning on graphs with various applications ranging from brain networks to online social network (Gilmer et al., 2017; Wang et al., 2019). In a layer of MPNNs, each node sends its feature representation, a “message”, to the nodes in its neighborhood; and then updates its feature representation by aggregating all “messages” received from the neighborhood. The neighborhood is often defined as the set of adjacent nodes in graph. By adopting permutationinvariant aggregation functions (e.g., summation, maximum, and mean), MPNNs are able to learn representations which are invariant to isomorphic graphs, i.e., graphs that are topologically identical.
Although existing MPNNs have been successfully applied in a wide variety of scenarios, two fundamental weaknesses of MPNNs’ aggregators limit their ability to represent graphstructured data. Firstly, the aggregators lose the structural information of nodes in neighborhoods. Permutation invariance is an essential requirement for any graph learning method. To meet it, existing MPNNs adopt permutationinvariant aggregation functions which treat all “messages” from neighborhood as a set. For instance, GCN simply sums the normalized “messages” from all onehop neighbors (Kipf and Welling, 2017). Such aggregation loses the structural information of nodes in neighborhood because it does not distinguish the “messages” from different nodes. Therefore, after such aggregation, we cannot know which node contributes what to the final aggregated output.
Without modeling such structural information, as shown in (Kondor et al., 2018) and (Xu et al., 2019)
, the existing MPNNs cannot discriminate between certain nonisomorphic graphs. In those cases, MPNNs may map nonisomorphic graphs to the same feature representations, which is obviously not desirable for graph representation learning. Unlike MPNNs, classical convolutional neural networks (CNNs) avoid this problem by using aggregators (i.e., convolutional filters) with a structural receiving filed defined on grids, i.e., a Euclidean space, and are hence able to distinguish each input unit. As shown by our experiments, such structural information often contains clues regarding topology patterns in graph (e.g., hierarchy), and should be extracted and used to learn more discriminating representations for graphstructured data.
Secondly, the aggregators lack the ability to capture longrange dependencies in disassortative graphs. In MPNNs, the neighborhood is defined as the set of all neighbors one hop away (e.g., GCN), or all neighbors up to
hops away (e.g., ChebNet). In other words, only messages from nearby nodes are aggregated. The MPNNs with such aggregation are inclined to learn similar representations for proximal nodes in a graph. This implies that they are probably desirable methods for assortative graphs (e.g., citation networks
(Kipf and Welling, 2017) and community networks (Chen et al., 2019)) where node homophily holds (i.e., similar nodes are more likely to be proximal, and vice versa), but may be inappropriate to the disassortative graphs (Newman, 2002) where node homophily does not hold. For example, Ribeiro et al. (2017) shows disassortative graphs where nodes of the same class exhibit high structural similarity but are far apart from each other. In such cases, the representation ability of MPNNs may be limited significantly, since they cannot capture the important features from distant but informative nodes.A straightforward strategy to address this limitation is to use a multilayered architecture so as to receive “messages” from distant nodes. For instance, due to the localized nature of convolutional filters in classical CNNs, a single convolutional layer is similarly limited in its representational ability. CNNs typically use multiple layers connected in a hierarchical manner to learn complex and global representations. However, unlike CNNs, it is difficult for multilayer MPNNs to learn good representations for disassortative graphs because of two reasons. On one hand, relevant messages from distant nodes are mixed indistinguishably with a large number of irrelevant messages from proximal nodes in multilayer MPNNs, which implies that the relevant information will be “washed out” and cannot be extracted effectively. On the other hand, the representations of different nodes would become very similar in multilayer MPNNs, and every node’s representation actually carries the information about the entire graph (Xu et al., 2018).
In this paper, we overcome the aforementioned weaknesses of graph neural networks starting from two basic observations: i) Classical neural networks effectively address the similar limitations thanks to the stationarity, locality, and compositionality in a continuous space (Bronstein et al., 2017); ii) The notion of network geometry bridges the gap between continuous space and graph (Hoff et al., 2002; Muscoloni et al., 2017). Network geometry aims to understand networks by revealing the latent continuous space underlying them, which assumes that nodes are sampled discretely from a latent continuous space and edges are established according to their distance. In the latent space, complicated topology patterns in graphs can be preserved and presented as intuitive geometry, such as subgraph (Narayanan et al., 2016), community (Ni et al., 2019), and hierarchy (Nickel and Kiela, 2017, 2018). Inspired by those two observations, we raise an enlightening question about the aggregation scheme in graph neural network.

[leftmargin=15pt]

Can the aggregation on a graph benefit from a continuous latent space, such as using geometry in the space to build structural neighborhoods and capture longrange dependencies in the graph?
To answer the above question, we propose a novel aggregation scheme for graph neural networks, termed the geometric aggregation scheme. In the scheme, we map a graph to a continuous latent space via node embedding, and then use the geometric relationships defined in the latent space to build structural neighborhoods for aggregation. Also, we design a bilevel aggregator operating on the structural neighborhoods to update the feature representations of nodes in graph neural networks, which are able to guarantee permutation invariance for graphstructured data. Compared with existing MPNNs, the scheme extracts more structural information of the graph and can aggregate feature representations from distant nodes via mapping them to neighborhoods defined in the latent space.
We then present an implementation of the geometric aggregation scheme in graph convolutional networks, which we call GeomGCN, to perform transductive learning, node classification, on graphs. We design particular geometric relationships to build the structural neighborhood in Euclidean and hyperbolic embedding space respectively. We choose different embedding methods to map the graph to a suitable latent space for different applications, where suitable topology patterns of graph are preserved. Finally, we empirically validate and analyze GeomGCN on a wide range of open datasets of graphs, and GeomGCN achieved the stateoftheart results.
In summary, the contribution of this paper is threefold: i) We propose a novel geometric aggregation scheme for graph neural network, which operates in both graph and latent space, to overcome the aforementioned two weaknesses; ii) We present an implementation of the scheme, GeomGCN, for transductive learning in graph; iii) We validate and analyze GeomGCN via extensive comparisons with stateoftheart methods on several challenging benchmarks.
2 Geometric aggregation scheme
In this section, we start by presenting the geometric aggregation scheme, and then outline its advantages and limitations compared to existing works. As shown in Fig. 1, the aggregation scheme consists of three modules, node embedding (panel A1 and A2), structural neighborhood (panel B1 and B2), and bilevel aggregation (panel C). We will elaborate on them in the following.
A. Node embedding. This is a fundamental module which maps the nodes in a graph to a latent continuous space. Let be a graph, where each node
has a feature vector
and each edge connects two nodes. Let be a mapping function from a node in graph to a representation vector. Here, can also be considered as the position of node in a latent continuous space, and is the number of dimensions of the space. During the mapping, the structure and properties of graph are preserved and presented as the geometry in the latent space. For instance, hierarchical pattern in graph is presented as the distance to the original in embedding hyperbolic space (Nickel and Kiela, 2017). One can employ various embedding methods to infer the latent space (Cai et al., 2018; Wang et al., 2018).B. Structural neighborhood. Based on the graph and the latent space, we then build a structural neighborhood, , for the next aggregation. The structural neighborhood consists of a set of neighborhood , and a relational operator on neighborhoods .
The neighborhood in the graph, , is the set of adjacent nodes of . The neighborhood in the latent space, , is the set of nodes from which the distance to is less than a pregiven parameter . The distance function depends on the particular metric in the space. Compared with , may contain nodes which are far from in the graph, but have a certain similarity with , and hence are mapped together with in the latent space though preserving the similarity. By aggregating on such neighborhood , the longrange dependencies in disassortative graphs can be captured.
The relational operator is a function defined in the latent space. It inputs an ordered position pair of nodes and , and outputs a discrete variable which indicates the geometric relationship from to in the latent space. For ,
where is the set of the geometric relationships. According to the particular latent space and application, can be specified as an arbitrary geometric relationship of interest. A requirement on is that it should guarantee that each ordered position pair has only one geometric relationship. For example, is illustrated in Fig. 1B by a colorful 3 3 grid in a 2dimensional Euclidean space, in which each unit is corresponding to a geometric relationship to node .
C. Bilevel aggregation. With the structural neighborhood , we propose a novel bilevel aggregation scheme for graph neural network to update the hidden features of nodes. The bilevel aggregation consists of two aggregation functions and operates in a neural network layer. It can extract effectively structural information of nodes in neighborhoods as well as guarantee permutation invariance for graph. Let be the hidden features of node at the th layer, and be the node features. The th layer updates for every by the following.
(1) 
In the lowlevel, the hidden features of nodes that are in the same neighborhood and have the same geometric relationship are aggregated to a virtual node via the aggregation function . The features of the virtual node are , and the virtual node is indexed by which is corresponding to the combination of a neighborhood and a relationship . It is required to adopt a permutationinvariant function for , such as an norm (the choice of
results in average, energy, or max pooling). The low level aggregation is illustrated by dashed arrows in Fig.
1C.In the highlevel, the features of virtual nodes are further aggregated by function . The inputs of function contain both the features of virtual nodes and the identity of virtual nodes . That is, can be a function that take an ordered object as input, e.g., concatenation, to distinguish the features of different virtual nodes, thereby extracting the structural information in the neighborhoods explicitly. The output of highlevel aggregation is a vector . Then new hidden features of , , are given by a nonlinear transform, wherein is a learnable weight matrix on the th layer shared by all nodes, and
is a nonlinear activation function, e.g., a ReLU.
Permutation invariance is an essential requirement for aggregators in graph neural networks. Thus, we then prove that the proposed bilevel aggregation, Eq. 1, is able to guarantee invariance for any permutation of nodes. We firstly give a definition for permutationinvariant mapping of graph.
Definition 1. Let a bijective function be a permutation for nodes, which renames as . Let and be the node and edge set after a permutation , respectively. A mapping of graph, , is permutationinvariant if, given any permutation , we have .
Lemma 1. For a composite function , if is permutationinvariant, the entire composite function is permutationinvariant.
Proof.
Let be an isomorphic graph of after a permutation , as defined in Definition 1. If is permutationinvariant, we have . Therefore, the entire composite function is permutationinvariant because . ∎
Theorem 1. Given a graph and its structural neighborhood , the bilevel aggregation, Eq. 1, is a permutationinvariant mapping of graph.
Proof.
The bilevel aggregation, Eq. 1, is a composite function, where the lowlevel aggregation is the input of the highlevel aggregation. Thus, Eq. 1 is permutationinvariant if the lowlevel aggregation is permutationinvariant according to Lemma 1.
We then prove that the lowlevel aggregation is permutationinvariant. The lowlevel aggregation consists of subaggregations, each of which is corresponding to the nodes in a neighborhood and with a relationship to . Firstly, the input of each subaggregations is permutationinvariant because both and are determined by the given structural neighborhood , which is constant for any permutation. Secondly, Eq. 1 adopts a permutationinvariant aggregation function for the subaggregations. Thus the lowlevel aggregation is permutationinvariant. ∎
2.1 Comparisons to related work
We now discuss how the proposed geometric aggregation scheme overcomes the two aforementioned weaknesses, i.e., how it effectively models the structural information and captures the longrange dependencies, in comparison to some closely related works.
To overcome the first weakness of MPNNs, i.e., losing the structural information of nodes in neighborhoods, the proposed scheme explicitly models the structural information by exploiting the geometric relationship between nodes in latent space and then extracting the information effectively by using the bilevel aggregations. In contrast, several existing works attempt to learn some implicit structurelike information to distinguish different neighbors when aggregating features. For example, GAT (Velickovic et al., 2017), LGCL (Gao et al., 2018) and GGNN (Li et al., 2016) learn weights on “messages” from different neighbors by using attention mechanisms and node and/or edge attributes. CCN (Kondor et al., 2018) utilizes a covariance architecture to learn structureaware representations. The major difference between these works and ours is that we offer an explicit and interpretable way to model the structural information of nodes in neighborhood, with the assistance of the geometry in a latent space. We note that our work is orthogonal with existing methods and thus can be readily incorporated to further improve their performance. In particular, we exploit geometric relationships from the aspect of graph topology, while other methods focus on that of feature representation– the two aspects are complementary.
For the second weakness of MPNNs, i.e., lacking the ability to capture longrange dependencies, the proposed scheme models the longrange dependencies in disassortative graphs in two different ways. First of all, the distant (but similar) nodes in the graph can be mapped into a latentspacebased neighborhood of the target node, and then their useful feature representations can be used for aggregations. This way depends on an appropriate embedding method, which is able to preserve the similarities between the distant nodes and the target node. On the other hand, the structural information enables the method to distinguish different nodes in a graphbased neighborhood (as mentioned above). The informative nodes may have some special geometric relationships to the target node (e.g., a particular angle or distance), whose relevant features hence will be passed to the target node with much higher weights, compared to the uninformative nodes. As a result, the longrange dependencies are captured indirectly through the whole message propagation process in all graphbased neighborhoods. In literature, a recent method JKNets (Xu et al., 2018) captures the longrange dependencies by skipping connections during feature aggregations.
2.1.1 Case study on distinguishing nonisomorphic graphs
In literature, Kondor et al. (2018) and Xu et al. (2019) construct several nonisomorphic example graphs that cannot be distinguished by the aggregators (e.g., mean and maximum) in existing MPNNs. We present a case study to illustrate how to distinguish the nonisomorphic example graphs once the structural neighborhood is applied. We take two nonisomorphic graphs in (Xu et al., 2019) as an example, where each node has the same feature and after any mapping remains the same across all nodes, as shown in Fig. 2 (left). Then the aggregator, e.g., mean or maximum, over remains , and hence the final representations of the nodes are the same. That is, mean and maximum aggregators fail to distinguish the two different graphs.
In contrast, the two graphs become distinguishable once we apply a structural neighborhood in aggregation. With the structural neighborhood, the nodes have different geometric relationships to the center node in the structural neighborhood, as shown in Fig. 2 (right). Taking aggregation for as an example, we can adopt different mapping function to the neighbors with different geometric relationship to . Then, the aggregator in two graph have different inputs, , in the left graph and , , in the right graph. Finally, the aggregator (mean or maximum) will output different representations for the node in the two graphs, thereby distinguishing the topological difference between the two graphs.
3 GeomGCN: An implementation of the scheme
In this section, we present GeomGCN, a specific implementation of the geometric aggregation scheme in graph convolutional networks, to perform transductive learning in graphs. To implement the general aggregation scheme, one needs to specify its three modules: node embedding, structural neighborhood, and bilevel aggregation function.
Node embedding is the fundamental. As shown in our experiments, a common embedding method which only preserves the connection and distance pattern in a graph can already benefit the aggregation. For particular applications, one can specify embedding methods to create suitable latent spaces where particular topology patterns (e.g., hierarchy) are preserved. We employ three embedding methods, Isomap (Tenenbaum et al., 2000), Poincare embedding (Nickel and Kiela, 2017), and struc2vec (Ribeiro et al., 2017), which result in three GeomGCN variants: GeomGCNI, GeomGCNP, and GeomGCNS. Isomap is a widely used isometry embedding method, by which distance patterns (lengths of shortest paths) are preserved explicitly in the latent space. Poincare embedding and struc2vec can create particular latent spaces that preserve hierarchies and local structures in a graph, respectively. We use an embedding space of dimension 2 for ease of explanation.
The structural neighborhood of node includes its neighborhoods in both the graph and latent space. The neighborhoodingraph consists of the set of ’s adjacent nodes in the graph, and the neighborhoodinlatentspace those nodes whose distances to are less than a parameter in the latent space. We determine by increasing from zero until the average cardinality of equals to that of , – i.e., when the average neighborhood sizes in the graph and latent spaces are the same. We use Euclidean distance in the Euclidean space. In the hyperbolic space, we approximate the geodesic distance between two nodes via their Euclidean distance in the local tangent plane.
Here we simply implement the geometric operator as four relationships of the relative positions between two nodes in a 2D Euclidean or hyperbolic space. Particularly, the relationship set {upper left, upper right, lower left, lower right}, and a is given by Table 1. Note that, we adopt the rectangular coordinate system in the Euclidean space and angular coordinate in the hyperbolic space. By this way, the relationship “upper” indicates the node nearer to the origin and thus lie in a higher level in a hierarchical graph. One can design a more sophisticated operator , such as borrowing the structure of descriptors in manifold geometry (Kokkinos et al., 2012; Monti et al., 2017), thereby preserving more and richer structural information in neighborhood.
upper left  upper right  
lower left  lower right 
Finally, to implement the bilevel aggregation, we adopt the same summation of normalized hidden features as GCN (Kipf and Welling, 2017) as the aggregation function in the lowlevel aggregation,
where is the degree of node in graph, and is a Kronecker delta function that only allows the nodes with relationship to to be included. The features of all virtual nodes are further aggregated in the highlevel aggregation. The aggregation function is a concatenation for all layers except the final layer, which uses mean for its aggregation function. Then, the overall bilevel aggregation of GeomGCN is given by
where we use ReLU as the nonlinear activation function and
is the weight matrix to estimate by backpropagation.
4 Experiments
We validate GeomGCN by comparing GeomGCN’s performance with the performance of Graph Convolutional Networks (GCN) (Kipf and Welling (2017)) and Graph Attention Networks (GAT) (Velickovic et al. (2017)). Two stateoftheart graph neural networks, on transductive nodelabel classification tasks on a wide variety of open graph datasets.
4.1 Datasets
We utilize nine open graph datasets to validate the proposed GeomGCN. An overview summary of characteristics of the datasets is given in Table 2.
Dataset  Cora  Cite.  Pubm.  Cham.  Squi.  Actor  Corn.  Texa.  Wisc. 

# Nodes  2708  3327  19717  2277  5201  7600  183  183  251 
# Edges  5429  4732  44338  36101  217073  33544  295  309  499 
# Features  1433  3703  500  2325  2089  931  1703  1703  1703 
# Classes  7  6  3  5  5  5  5  5  5 
Citation networks. Cora, Citeseer, and Pubmed are standard citation network benchmark datasets (Sen et al., 2008; Namata et al., 2012). In these networks, nodes represent papers, and edges denote citations of one paper by another. Node features are the bagofwords representation of papers, and node label is the academic topic of a paper.
WebKB. WebKB^{1}^{1}1http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo11/www/wwkb
is a webpage dataset collected from computer science departments of various universities by Carnegie Mellon University. We use the three subdatasets of it, Cornell, Texas, and Wisconsin, where nodes represent web pages, and edges are hyperlinks between them. Node features are the bagofwords representation of web pages. The web pages are manually classified into the five categories, student, project, course, staff, and faculty.
Actor cooccurrence network. This dataset is the actoronly induced subgraph of the filmdirectoractorwriter network (Tang et al., 2009). Each nodes correspond to an actor, and the edge between two nodes denotes cooccurrence on the same Wikipedia page. Node features correspond to some keywords in the Wikipedia pages. We classify the nodes into five categories in term of words of actor’s Wikipedia.
Wikipedia network. Chameleon and squirrel are two pagepage networks on specific topics in Wikipedia (Rozemberczki et al., 2019). In those datasets, nodes represent web pages and edges are mutual links between pages. And node features correspond to several informative nouns in the Wikipedia pages. We classify the nodes into five categories in term of the number of the average monthly traffic of the web page.
4.2 Experimental setup
As mentioned in Section 3, we construct three GeomGCN variants by using three embedding methods, Isomap (GeomGCNI), Poincare (GeomGCNP), and struc2vec (GeomGCNS). We specify the dimension of embedding space as two, and use the relationship operator defined in Table 1, and apply mean and concatenation as the low and high level aggregation function, respectively.
With the structural neighborhood, we perform a hyperparameter search for all models on validation set. For fairness, the size of search space for each method is the same. The searching hyperparameters include number of hidden unit, initial learning rate, weight decay, and dropout. We fix the number of layer to 2 and use Adam optimizer (Kingma and Ba, 2014) for all models. We use ReLU as the activation function for GeomGCN and GCN, and ELU for GAT.
The final hyperparameter setting is dropout of , initial learning rate of
, patience of 100 epochs, weight decay of
 (WebKB datasets) or  (the other all datasets). In GCN, the number of hidden unit is (Cora), (Citeseer), (Pubmed), (WebKB), (Wikipedia), and (Actor). In GeomGCN, the number of hidden unit is times as many as the number in GCN since GeomGCN has virtual nodes. For each attention head in GAT, the number of hidden unit is (Citation networks), (WebKB), (Wikipedia), and (Actor). GAT has attention heads in layer one and (Pubmed) or (the all other datasets) attention heads in layer two.For all graph datasets, we randomly split nodes of each class into , , and for training, validation and testing. With the hyperparameter setting, we report the average performance of all models on the test sets over random splits.
4.3 Results and analysis
Results are summarized in Table 3. The reported numbers denote the mean classification accuracy in percent. In general, GeomGCN achieves stateoftheart performance. The best performing method is highlighted. From the results, Isomap embedding (GeomGCNI) which only preserves the connection and distance pattern in graph can already benefit the aggregation. We can also specify an embedding method to create a suitable latent space for a particular application (e.g., disassortative graph or hierarchical graph), by doing which a significant performance improvement is achieved (e.g., GeomGCNP).
Dataset  Cora  Cite.  Pubm.  Cham.  Squi.  Actor  Corn.  Texa.  Wisc. 

GCN  85.77  73.68  88.13  28.18  23.96  26.86  52.70  52.16  45.88 
GAT  86.37  74.32  87.62  42.93  30.03  28.45  54.32  58.38  49.41 
GeomGCNI  85.19  77.99  90.05  60.31  33.32  29.09  56.76  57.58  58.24 
GeomGCNP  84.93  75.14  88.09  60.90  38.14  31.63  60.81  67.57  64.12 
GeomGCNS  85.27  74.71  84.75  59.96  36.24  30.30  55.68  59.73  56.67 
4.3.1 Ablation study on contributions from two neighborhoods
The proposed GeomGCN aggregates “message” from two neighborhoods which are defined in graph and latent space respectively. In this section, we present an ablation study to evaluate the contribution from each neighborhood though constructing new GeomGCN variants with only one neighborhood. For the variants with only neighborhood in graph, we use “g” as a suffix of their name (e.g., GeomGCNIg), and use suffix “s” to denote the variants with only neighborhood in latent space (e.g., GeomGCNIs). Here we set GCN as a baseline so that the contribution can be measured via the performance improvement comparing with GCN. The results are summarized in Table 4, where positive improvement is denoted by an up arrow and negative improvement by a down arrow . The best performing method is highlighted.
We also design an index denoted by to measure the homophily in a graph,
A large value implies that the homophily, in term of node label, is strong in a graph, i.e., similar nodes tend to connect together. From Table 4, one can see that assortative graphs (e.g., citation networks) have a much larger than disassortative graphs (e.g., WebKB networks).
Table 4 exhibits three interesting patterns: i) Neighborhoods in graph and latent space both benefit the aggregation in most cases; ii) Neighborhoods in latent space have larger contributions in disassortative graphs (with a small ) than assortative ones, which implies relevant information from disconnected nodes is captured effectively by the neighborhoods in latent space; iii) To our surprise, several variants with only one neighborhood (in Table 4) achieve better performances than the variants with two neighborhoods (in Tabel 3). We think the reason is that GeomGCN with two neighborhoods aggregate more irrelevant “messages” than GeomGCN with only one neighborhood, and the irrelevant “messages” adversely affect the performance. Thus, we believe an attention mechanism can alleviate this issue– which we will study as future work.
Dataset  Cora  Cite.  Pumb.  Cham.  Squi.  Actor  Corn.  Texa.  Wisc. 

0.83  0.71  0.79  0.25  0.22  0.24  0.11  0.06  0.16  
GeomGCNIg  86.26  80.64  90.72  68.00  46.01  31.96  65.40  72.51  68.23 
0.48  6.96  2.59  39.82  22.05  4.04  12.70  21.35  22.35  
GeomGCNIs  77.34  72.22  85.02  61.64  37.98  30.59  62.16  60.54  64.90 
8.34  1.46  3.11  33.46  14.02  2.67  9.46  8.38  19.01  
GeomGCNPg  86.30  75.45  88.40  63.07  38.41  31.55  64.05  73.05  69.41 
0.52  1.76  0.27  34.89  14.45  3.63  11.35  21.89  23.53  
GeomGCNPs  73.14  71.65  86.95  43.20  30.47  34.59  75.40  73.51  80.39 
12.63  2.04  1.18  15.02  6.51  6.67  22.70  21.35  34.51  
GeomGCNSg  87.00  75.73  88.44  67.04  44.92  31.27  67.02  71.62  69.41 
1.23  2.04  0.31  38.86  20.96  3.35  14.32  19.46  23.52  
GeomGCNSs  66.92  66.03  79.41  49.21  31.27  30.32  62.43  63.24  64.51 
18.85  7.65  8.72  21.03  7.31  2.40  9.73  11.08  18.63 
4.3.2 Analysis of embedding space combination
The structural neighborhood in GeomGCN is very flexible, where one can combine arbitrary embedding space. To study which combination of embedding spaces is desirable, we construct new GeomGCN variants by adopting neighborhoods built by different embedding space. For the variants adopted Isomap and poincare embedding space to build neighborhood in graph and in latent space respectively, we use GeomGCNIP to denote it. The naming rule is the same for other combinations. The performances of all variants are summarized in Table 5. One can observe that several combinations achieve better performance than GeomGCN with neighborhoods built by only one embedding space (in Table 3); and there are also many combinations that have bad performance. Thus, we think it’s significant future work to design an endtoend framework that can automatically determine the right embedding spaces for GeomGCN.
Dataset  Cora  Cite.  Pubm.  Cham.  Squi.  Actor  Corn.  Texa.  Wisc. 

GeomGCNIP  85.13  79.41  90.49  65.77  45.49  31.94  60.00  66.49  62.75 
GeomGCNPI  85.09  75.08  85.64  59.19  32.65  29.16  58.11  58.11  58.63 
GeomGCNIS  84.51  77.83  88.66  58.40  35.29  29.41  54.32  57.57  57.65 
GeomGCNSI  85.31  75.50  85.52  62.13  32.57  28.97  57.30  60.00  55.10 
GeomGCNPS  85.65  74.84  84.96  56.34  28.27  29.53  58.11  62.43  60.59 
GeomGCNSP  85.43  75.71  88.00  65.81  44.53  31.16  58.38  67.84  65.10 
4.3.3 Analysis of time complexity
Time complexity is very important for graph neural networks because realworld graphs are always very large. In this subsection, we firstly present the theoretical time complexity of GeomGCN and then compare the real running time of GCN, GAT, and GeomGCN.
To update the representations of one node, the time complexity of GeomGCN is where is the size of input representations, is the number of hidden unit in nonlinear transform for each virtual node (i.e., ), and is the number of virtual nodes. GeomGCN has times complexity than GCN whose time complexity is .
We also compare the real running time (500 epochs) of GCN, GAT, and GeomGCN on all datasets with the hyperparameters described in Section 4.2. Results are shown in Fig. 3 (a). One can see that GCN is the fastest, and GAT and GeomGCN are on the same level. An important future work is to develop accelerating technology so as to solve the scalability of GeomGCN.
(a)  (b) 
4.3.4 Visualization
To study what patterns are learned in the feature representations of node by GeomGCN, we visualize the feature representations extracted by the last layer of GeomGCNP on Cora dataset by mapping it into a 2D space though tSNE (Maaten and Hinton, 2008), as shown in Fig. 3 (b). In the figure, the nodes with the same label exhibit spatial clustering, which could shows the discriminative power of GeomGCN. That all nodes distribute radially in the figure indicates the proposed model learn graph’s hierarchy by Poincare embedding.
4.4 Conclusion and future work
We tackle the two major weaknesses of existing messagepassing neural networks over graphs– losses of discriminative structures and longrange dependencies. As our key insight, we bridge a discrete graph to a continuous geometric space via graph embedding. That is, we exploit the principle of convolution: spatial aggregation over a meaningful space– and our approach thus extracts or “recovers” the lost information (discriminative structures and longrange dependencies) in an embedding space from a graph. We proposed a general geometric aggregation scheme and instantiated it with several specific GeomGCN implementations, and our experiments validated clear advantages over the stateoftheart. As future work, we will explore techniques for choosing a right embedding method– depending not only on input graphs but also on target applications, such as epidemic dynamic prediction on social contact network (Yang et al., 2017; Pei et al., 2018).
Acknowledgments
We thank the reviewers for their valuable feedback. This work was supported in part by National Natural Science Foundation of China under grant 61876069, 61572226 and 61902145, National Science Foundation IIS 1619302 and IIS 1633755, Jilin Province Key Scientific and Technological Research and Development project under grants 20180201067GX and 20180201044GX, University science and technology research plan project of Jilin Province under grants JJKH20190156KJ, Zhejiang University ZJU Research 083650, Futurewei Technologies HF2017060011 and 094013, UIUC OVCR CCIL Planning Grant 434S34, UIUC CSBS Small Grant 434C8U, Advanced Digital Sciences Center Faculty Grant, and China Scholarships Council under scholarship 201806170202. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the funding agencies.
References

Geometric deep learning: going beyond euclidean data
. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.  A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30 (9), pp. 1616–1637. Cited by: §2.
 Supervised community detection with line graph neural networks. In International Conference on Learning Representations (ICLR), Cited by: §1.
 Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems (NeurIPS), pp. 3844–3852. Cited by: §1.
 Largescale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1416–1424. Cited by: §2.1.

Neural message passing for quantum chemistry.
In
International Conference on Machine Learning (ICML)
, pp. 1263–1272. Cited by: §1.  Latent space approaches to social network analysis. Journal of the american Statistical association 97 (460), pp. 1090–1098. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §4.2.
 Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §1, §3, §4.

Intrinsic shape context descriptors for deformable shapes.
In
2012 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 159–166. Cited by: §3.  Covariant compositional networks for learning graphs. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.1, §2.1.
 Gated graph sequence neural networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.
 Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.3.4.
 Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124. Cited by: §3.
 Machine learning meets complex networks via coalescent embedding in the hyperbolic space. Nature communications 8 (1), pp. 1615. Cited by: §1.
 Querydriven active surveying for collective classification. In International Workshop on Mining and Learning with Graphs, Cited by: §4.1.

Subgraph2vec: learning distributed representations of rooted subgraphs from large graphs
. CoRR abs/1606.08928. Cited by: §1.  Assortative mixing in networks. Physical review letters 89 (20), pp. 208701. Cited by: §1.
 Community detection on networks with ricci flow. Scientific reports 9 (1), pp. 9984. Cited by: §1.
 Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems (NeurIPS), pp. 6338–6347. Cited by: §1, §2, §3.
 Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning (ICML), pp. 3776–3785. Cited by: §1.

Group sparse bayesian learning for active surveillance on epidemic dynamics.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, pp. 800–807. Cited by: §4.4.  struc2vec: learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394. Cited by: §1, §3.
 Multiscale attributed node embedding. arXiv:1909.13021. Cited by: §4.1.
 The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1.
 Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
 Social influence analysis in largescale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 807–816. Cited by: §4.1.
 A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §3.
 Graph attention networks. CoRR abs/1710.10903. Cited by: §2.1, §4.
 A united approach to learning sparse attributed network embedding. In IEEE International Conference on Data Mining, pp. 557–566. Cited by: §2.
 MCNE: an endtoend framework for learning multiple conditional network representations of social network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1064–1072. Cited by: §1.
 How powerful are graph neural networks?. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.1.
 Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning (ICML), pp. 5449–5458. Cited by: §1, §2.1.
 Characterizing and discovering spatiotemporal social contact patterns for healthcare. IEEE Trans. Pattern Anal. Mach. Intell. 39 (8), pp. 1532–1546. Cited by: §4.4.
Comments
There are no comments yet.