Log In Sign Up

Hyperbolic Graph Attention Network

Graph neural network (GNN) has shown superior performance in dealing with graphs, which has attracted considerable research attention recently. However, most of the existing GNN models are primarily designed for graphs in Euclidean spaces. Recent research has proven that the graph data exhibits non-Euclidean latent anatomy. Unfortunately, there was rarely study of GNN in non-Euclidean settings so far. To bridge this gap, in this paper, we study the GNN with attention mechanism in hyperbolic spaces at the first attempt. The research of hyperbolic GNN has some unique challenges: since the hyperbolic spaces are not vector spaces, the vector operations (e.g., vector addition, subtraction, and scalar multiplication) cannot be carried. To tackle this problem, we employ the gyrovector spaces, which provide an elegant algebraic formalism for hyperbolic geometry, to transform the features in a graph; and then we propose the hyperbolic proximity based attention mechanism to aggregate the features. Moreover, as mathematical operations in hyperbolic spaces could be more complicated than those in Euclidean spaces, we further devise a novel acceleration strategy using logarithmic and exponential mappings to improve the efficiency of our proposed model. The comprehensive experimental results on four real-world datasets demonstrate the performance of our proposed hyperbolic graph attention network model, by comparisons with other state-of-the-art baseline methods.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8


Lorentzian Graph Convolutional Networks

Graph convolutional networks (GCNs) have received considerable research ...

Hyperbolic Graph Neural Networks

Learning from graph-structured data is an important task in machine lear...

The Isoperimetric Problem in a Lattice of H^3

The isoperimetric problem is one of the oldest in geometry and it consis...

Hyperbolic Graph Neural Networks: A Review of Methods and Applications

Graph neural networks generalize conventional neural networks to graph-s...

Hyperbolic Variational Graph Neural Network for Modeling Dynamic Graphs

Learning representations for graphs plays a critical role in a wide spec...

PU GNN: Chargeback Fraud Detection in P2E MMORPGs via Graph Attention Networks with Imbalanced PU Labels

The recent advent of play-to-earn (P2E) systems in massively multiplayer...

Sheaf Neural Networks with Connection Laplacians

A Sheaf Neural Network (SNN) is a type of Graph Neural Network (GNN) tha...


The real-world data usually come together with the graph structure, such as social networks, citation networks, biology networks. Graph neural network (GNN) [Gori, Monfardini, and Scarselli2005, Scarselli et al.2009], as a powerful deep representation learning method for such graph data, has shown superior performance on network analysis and aroused considerable research interest. There have been many studies using a neural network to handle the graph data. For examples, [Gori, Monfardini, and Scarselli2005, Scarselli et al.2009] leveraged deep neural network to learn node representations based on node features and the graph structure; [Defferrard, Bresson, and Vandergheynst2016, Kipf and Welling2017, Hamilton, Ying, and Leskovec2017] proposed the graph convolutional networks by generalizing the convolutional operation to graph; [Veličković et al.2018] designed a novel convolution-style graph neural network by employing the attention mechanism in GNN. These proposed GNNs have been widely used to solve many real-world application problems, such as recommendation [Ying et al.2018, Song et al.2019] and disease prediction [Parisot et al.2017].

Essentially, most of the existing GNN models are primarily designed for the graphs in Euclidean spaces. The main reason is that Euclidean space is the natural generalization of our intuition-friendly and visible three-dimensional space. However, some researchers have discovered that graph data exhibits a non-Euclidean latent anatomy [Wilson et al.2014, Bronstein et al.2017]. In such cases, the Euclidean spaces may not provide the most powerful or meaningful geometry for graph representation learning. On the other hand, some recent works [Krioukov et al.2010, Nickel and Kiela2017] have demonstrated that the hyperbolic spaces could be the latent spaces of graph data, as the hyperbolic space may reflect some properties of graph naturally, e.g., hierarchical and power-law structure [Krioukov et al.2010, Muscoloni et al.2017]. Inspired by this insight, the study of graph data in hyperbolic spaces has received increasing attention, such as hyperbolic graph embedding [Nickel and Kiela2017, Nickel and Kiela2018, Sala et al.2018, Wang, Zhang, and Shi2019].

One key property of hyperbolic spaces is that they expand faster than Euclidean spaces, because Euclidean spaces expand polynomially while hyperbolic spaces expand exponentially. For instance, each tile in Fig. 1(a) is of equal area in hyperbolic space but diminishes towards zero in Euclidean space towards the boundary. As the tiles grow exponentially, there is sufficient room to embed these tiles, so that we have shrunk the tiles in this Euclidean diagram for visualization. With these properties, hyperbolic spaces can be thought of as “continue tree”. As shown in Fig. 1(b), considering a tree with branching factor , the number of nodes at level or no more than hops from the root are and respectively. The number of nodes grows exponentially with their distance to the root of the tree, which is similar to hyperbolic spaces as they expand exponentially. Therefore, there is a strong correlation between tree-likeness graph and hyperbolic spaces [Krioukov et al.2010, Nickel and Kiela2017]. With this property, hyperbolic spaces have been considered to model complex network recently [Krioukov et al.2010, Muscoloni et al.2017]. These researches discover that graphs with hierarchical structure and power-law distribution are suitable to be modeled in hyperbolic spaces. Meanwhile, graph data with these properties exist widely, such as social networks, network community structures, citation networks and biology networks [Clauset, Shalizi, and Newman2009, Krioukov et al.2010], which motivates us to study the GNN in hyperbolic spaces.

(a) “Circle Limit 1”, by M.C Escher
(b) A tree with branching factor 3
Figure 1: Two examples of hyperbolic spaces (Poincaré disk model).

Despite the powerful modeling ability on graph data of hyperbolic spaces, there are two key challenges in designing the GNN in hyperbolic spaces: (1) One is that there are many different procedures in GNNs, e.g., the projection step, the attention mechanism, and the propagation step. However, different from Euclidean spaces, hyperbolic spaces are not vector spaces, so the vector operations (such as vector addition, and subtraction) cannot be carried in hyperbolic spaces. How can we effectively implement those procedures of GNN in hyperbolic spaces in an elegant way? (2) Another challenge is that, as the hyperbolic spaces have constant negative curvature, mathematical operations in hyperbolic spaces could be more complex than those in Euclidean spaces. Some basic properties of mathematical operations, such as the commutative or associative of “vector addition” are not satisfied anymore in hyperbolic spaces. How can we assure the learning efficiency in the proposed model?

To address the above challenges, in this paper, we propose a novel Hyperbolic graph ATtention network (denoted as HAT). Specifically, we use the framework of gyrovector spaces to build the graph attentional layer in hyperbolic spaces. Gyrovector spaces are the mathematical concepts proposed by Ungar [Ungar2001, Ungar2008], which study hyperbolic geometry in an analogy vector spaces way. In other words, just like vector spaces form algebraic formalism for Euclidean geometry, the framework of gyrovector spaces provides an elegant algebraic formalism for hyperbolic geometry. Therefore, we use the gyrovector operations in hyperbolic spaces to transform the features of the graph and exploit the proximity in hyperbolic spaces to model the attention mechanism. To improve the learning efficiency, we further propose a logarithmic mapping and exponential mapping based method to accelerate our model, in the premise of preserving the character in hyperbolic spaces. In sum, the major contributions of this work can be summarized as follows:

  • To the best of our knowledge, we are the first to study graph attention network in hyperbolic spaces, which is potential to learn better representations in graphs.

  • We propose a novel graph attention network in hyperbolic spaces, named HAT. We employ the framework of gyrovector spaces to implement the graph processing in hyperbolic spaces and design an attention mechanism based on hyperbolic proximity.

  • We design a method to accelerate our model while preserving the property in the hyperbolic spaces by using the logarithmic map and exponential map, which assures the efficiency of our proposed HAT model.

  • We conduct extensive experiments to evaluate the performance of HAT on four datasets. The results show the superiority of HAT in node classification and node clustering tasks compared with the state-of-the-art methods.

Related Work

Graph Neural Network

GNN aims to extend the deep neural network to deal with arbitrary graph-structured data [Gori, Monfardini, and Scarselli2005, Scarselli et al.2009]. Recently, there is a surge of generalizing convolutions to the graph domain. [Defferrard, Bresson, and Vandergheynst2016] utilized K-order Chebyshev polynomials to approximate smooth filters in the spectral domain. [Kipf and Welling2017] leveraged a localized first-order approximation of spectral graph convolutions to learn the node representations. [Veličković et al.2018] studied the attention mechanism in GNN, which incorporated the attention mechanism into the propagation step. To sum up, all these GNNs model graphs in Euclidean spaces so far.

Representation Learning in Hyperbolic Spaces

Recently, representation learning in hyperbolic spaces has received increasing attention. Specifically, [Nickel and Kiela2017, Nickel and Kiela2018] focused on learning the hierarchical representation of a graph. [Ganea, Bécigneul, and Hofmann2018a]

embedded the directed acyclic graphs into hyperbolic spaces to learn their feature representations. Besides, some researchers began to study deep learning in hyperbolic spaces.

[Ganea, Bécigneul, and Hofmann2018b]

generalized deep neural models in hyperbolic spaces, such as recurrent neural networks and gated recurrent unit.

[Gulcehre et al.2019] imposed hyperbolic geometry on the activations of the neural network, while the other structures of this network are in Euclidean spaces.


Hyperbolic Spaces and Graph Data

We provide some detail reasons for modeling graphs with hyperbolic geometry. As mentioned in Introduction, one key property of hyperbolic spaces is that they expand faster than Euclidean spaces. Specifically, considering a disk in a 2-dimensional hyperbolic space with constant curvature , the perimeter and area of the disk of hyperbolic radius are given as and , respectively, and both of them grow as with . 111Because of , In a 2-dimensional Euclidean space, the length of a circle and the area of a disk of Euclidean radius are given as and , growing only linearly and quadratically about . With this property, some researches discover that hyperbolic spaces may be the inherent spaces for graphs with hierarchal structure and power-law distribution [Krioukov et al.2010, Muscoloni et al.2017]. Hence, many real graphs with hierarchical structure and power-law distribution are suitable to be modeled in hyperbolic spaces [Papadopoulos et al.2012, Faqeeh, Osat, and Radicchi2018]. Moreover, some physical researchers have discovered that this kind of structure is a universal phenomenon for real-world graphs [Clauset, Moore, and Newman2008], including citation networks, social networks, biology networks [Clauset, Shalizi, and Newman2009, Krioukov et al.2010].

Gyrovector Spaces

Vector spaces form the algebraic formalism in Euclidean spaces so that we can use vector operations such as vector addition, subtraction and scalar multiplication in Euclidean spaces. We are familiar with these operations which can be used to design algorithms in Euclidean space. However, they cannot be carried in hyperbolic spaces. Fortunately, just like the vector spaces form the algebraic formalism for Euclidean geometry, the framework of gyrovector spaces provides an algebraic formalism for hyperbolic geometry [Ungar2001, Ungar2008]. The gyrovector spaces enable the vector operations, such as vector addition and scalar multiplication, to be carried in hyperbolic spaces. We can use gyrovector operations to design the algorithms in hyperbolic spaces. Therefore, we briefly introduce the framework of gyrovector spaces here.

In particular, the operations in gyrovector spaces are defined in an open -dimensional ball:

where is corresponding to the radius of the ball. If i.e., , the ball equals to the Euclidean space; if , is the open ball of radius ; if , we recover the usual ball . The gyrovector operations are performed in this -dimensional ball.

The Proposed Model

In this section, we present our hyperbolic graph attention network model, named HAT, whose framework is shown in Fig. 2. In general, we should project and transform the input node feature in a hyperbolic space, and design hyperbolic attention mechanism with the node feature. Hence, our model can be summarized as two procedures: (1) The hyperbolic feature projection

. Given the original input node feature, this procedure projects it into a hyperbolic space through the exponential map and the hyperbolic linear transformation, so as to obtain the latent representation of the node in hyperbolic space. (2)

The hyperbolic attention mechanism

. This procedure designs an attention mechanism based on the hyperbolic proximity to aggregate the latent representations. Finally, we feed the aggregated representations to a loss function for the downstream task. Here we mainly describe a single graph attentional layer, as the sole layer is used throughout all of our proposed HAT architectures in our experiments. Furthermore, we devise an acceleration strategy to speed up the proposed model by using logarithmic and exponential mapping.

Figure 2: The framework of HAT model.

The HAT Model

The hyperbolic feature projection

The input of GNN is the node feature, whose norm could be out of the open ball defined in gyrovector spaces. To make the node feature available in hyperbolic spaces, we use the exponential map to project the feature into the hyperbolic spaces. Specifically, let be the feature of node , and then for , where is a point in hyperbolic spaces and is the tangent space at point , the exponential map is given for by:


when , the exponential map is defined as:


where is a conformal factor. The operation is the Möbius addition, and it will be interpreted in Eq. (6). Here we assume that the feature lies in the tangent spaces at the point , so we can get the new feature in hyperbolic spaces via .

We then transform into a higher-level latent representation to obtain sufficient representation power. To achieve this, we use a shared linear transformation parametrized by a weight matrix (where is the dimension of the final representation) and employ the Möbius matrix-vector multiplication [Ganea, Bécigneul, and Hofmann2018b]. If , we have


and if , . Here can be considered as a latent representation in the hidden layer of HAT.

The hyperbolic attention mechanism

We then perform a self-attention mechanism on the nodes. The attention coefficient , which indicates the importance of node to node , can be computed as:


where represents the function of computing the attention coefficient. Here we only compute for nodes , where is the neighbors of node in the graph. Considering a large attention coefficient for the high similarity of nodes and , we define

based on the distance in hyperbolic spaces, which can measure the similarity between nodes. Specifically, if the generalized hyperbolic metric tensor conformal to the Euclidean one, with conformal factor

, given two node latent representations , the distance is given by:


where the operator is the Möbius addition in as:


Then, we perform the self-attention coefficient as:


Because the hyperbolic spaces are metric spaces, there are two advantages of using distance in hyperbolic spaces to calculate the self-attention coefficient. (1) Different from the inner product in Euclidean spaces, the hyperbolic distance meets the triangle inequality, so the self-attention can preserve the transitivity among nodes. (2) As we can see, the attention coefficient of a given node with itself is , which is always be the largest over its neighbors. As the representation should mainly maintain its own characteristics, this attention coefficient can meet this requirement in mathematics, while some other graph attention networks, e.g., GAT [Veličković et al.2018], cannot guarantee this.

For all the neighbors of node (including itself), we should make their attention coefficients easily comparable, so we normalize them using the softmax function:


The normalized attention coefficient is used to compute a linear combination of the latent representations of all the nodes . So the final aggregated representation for node is as follows:


where the is the accumulation of Möbius addition and is a nonlinearity function defined as ELU. The operation can be realized by the Möbius scalar multiplication. For , the Möbius scalar multiplication of by is defined as:


and .

We can apply the final representations to specific tasks and optimize them with different loss functions. In this paper, we consider the semi-supervised node classification task and use cross-entropy loss function to train our model.

Some properties of Möbius operations

To help make sense of Möbius operations, some properties of them will be expounded in this section. Some Möbius operations recover the Euclidean operations when goes to zero. Specifically, for Möbius addition and Möbius scalar multiplication, we have and , respectively. Also, the Möbius matrix-vector multiplication and Möbius scalar multiplication satisfy associativity. They have , and , respectively. The Möbius scalar multiplication also satisfies the scalar distributivity . Moreover, in general, the Möbius addition is neither commutative nor associative, which results in the inefficient problem of Eq. (9). This problem will be interpreted in the following.

Acceleration of HAT

In our proposed model HAT, the calculation of Eq. (9) is very time-consuming, which seriously affects the efficiency of HAT. As mentioned before, the Möbius addition in Eq. (9) is neither commutative nor associative, meaning that we have to calculate the results by order. Specifically, we denote as , so the accumulation term in Eq. (9) can be rewritten as follows:


As we can see, the calculation of Eq. (11) has to be in a serial manner. It is well known that there are always some hubs which have many edges in a large graph, so the calculation becomes very impractical.

Actually, some operations in gyrovector spaces can be derived with logarithmic map and exponential map. Taking the Möbius scalar multiplication as an example, it first uses the logarithmic map to project the representation into a tangent space, and then multiply the projected representation by a scalar in the tangent space, and finally project it back on the manifold with the exponential map [Ganea, Bécigneul, and Hofmann2018b]. The logarithmic map and the exponential map can move the representation between the two manifolds in a correct manner. Specifically, for two points and , the logarithmic map is given for by:


when , we have:


The logarithmic map enables us to get the representations in a tangent space. As the tangent spaces are vector spaces, we can combine the representations, just as we do it in the Euclidean spaces, i.e., . After the linear combination, we use the exponential map to project the representations back to the hyperbolic spaces, giving rise to the final representations as:


Different from the Eq. (9), the accumulation operation in the Eq. (14) is commutative and associative, so it can be computed in a parallel way. Thus, our model becomes more efficient.

Complexity Analysis

The time complexity of HAT is , where and are the dimension of input and output features, respectively. and are the numbers of nodes and edges in the graph, respectively. The complexity is on par with other GNN methods, such as GAT [Veličković et al.2018] and GCN [Kipf and Welling2017].

More importantly, our model can also be parallelized. For example, with the proposed acceleration strategy, the computation of the aggregated representation (i.e., Eq. (14)) can be parallelized across all nodes. The operations of the self-attention (i.e., Eq. (7)) can be parallelized across all edges. Specifically, taking the Cora graph [Sen et al.2008] as an example, conducted on a GPU (NVIDIA GTX 1080 Ti), HAT only costs about 84 seconds to converge with acceleration strategy, while cannot converge within 12 hours without acceleration strategy.


Dataset Cora Citeseer Pubmed Amazon Photo
# Nodes 2708 3327 19717 7650
# Edges 5429 4732 44338 143663
# Features 1433 3703 500 745
# Classes 7 6 3 8
Table 1: Summary of the datasets.

Experiments Setup


We employ four widely used real-world graphs for evaluations, including Cora, Citeseer, Pubmed [Sen et al.2008] and Amazon Photo [Shchur et al.2018]. Their detailed descriptions are summarized in Table 1. In Cora, Citeseer and Pubmed, node represents document and edge represents the citation relation. In Amazon Photo, node represents product and edge indicates that two goods are frequently bought together. All the nodes in these datasets correspond to a label and a bag-of-words representation. For fair comparison, we follow the setting of former literature [Yang, Cohen, and Salakhutdinov2016, Kipf and Welling2017, Veličković et al.2018]: for each dataset, we use only 20 nodes per class for training, 500 nodes for validation, 1000 nodes for test, and the training algorithm could access all nodes’ features.


We compare our method with the following state-of-the-art methods: (1) graph embedding methods, including some Euclidean graph embedding methods, i.e., DeepWalk [Perozzi, Al-Rfou, and Skiena2014], Node2vec [Grover and Leskovec2016], LINE [Tang et al.2015], and a hyperbolic graph embedding method, i.e., PoincaréEmb [Nickel and Kiela2017]; (2) some semi-supervised graph neural networks, i.e., GCN [Kipf and Welling2017] and GAT [Veličković et al.2018].

Dataset Dimension DeepWalk Node2vec LINE(1st) LINE(2nd) PoincaréEmb  GCN   GAT   HAT
Cora 2 0.359 0.386 0.255 0.180 0.491  0.452   0.550   0.608
4 0.566 0.593 0.314 0.324 0.536  0.714   0.751   0.787
8 0.605 0.635 0.473 0.335 0.574  0.806   0.798   0.828
16 0.617 0.645 0.485 0.381 0.642  0.815   0.819   0.831
Citeseer 2 0.257 0.316 0.193 0.180 0.287  0.357   0.512   0.546
4 0.401 0.427 0.226 0.243 0.310  0.556   0.656   0.681
8 0.427 0.451 0.261 0.245 0.365  0.679   0.697   0.712
16 0.459 0.471 0.307 0.269 0.399  0.704   0.704   0.719
Pubmed 2 0.535 0.565 0.342 0.379 0.614  0.632   0.743   0.761
4 0.645 0.669 0.504 0.380 0.629  0.708   0.761   0.767
8 0.672 0.692 0.522 0.423 0.659 0.786   0.766   0.781
16 0.681 0.697 0.529 0.479 0.678 0.791   0.770   0.782
Amazon Photo 2 0.580 0.612 0.240 0.239 0.615  0.319   0.309   0.629
4 0.756 0.768 0.321 0.613 0.769  0.559   0.686   0.782
8 0.790 0.803 0.529 0.617 0.777  0.786   0.784   0.843
16 0.798 0.808 0.624 0.630 0.788  0.819   0.835   0.858
Table 2: Quantitative results on the node classification task. The best results are marked by bold numbers.

Parameter Settings

For all the methods, we carry the experiments in the embedding dimension of 2, 4, 8, 16 (i.e., the number of hidden units in GNN). For DeepWalk and Node2vec, we set window size as 5, walk length as 80, walks per node as 40. For PoincaréEmb, LINE(1st) and LINE(2nd), we set the number of negative samples as {5, 10}. For GAT, because of the limited of dimension, we carry the experiments of single head attention. For HAT, we set . We tune the parameters for all methods via validation data. Moreover, HAT without acceleration strategy is very time-consuming, so we did not carry experiments for that case.

Experimental Results

Node Classification

Node classification is a basic task widely used to evaluate the effectiveness of representations. For GCN, GAT, and HAT, they are the semi-supervised models which can be directly used to classify the nodes. For DeepWalk and Node2vec, we employ KNN classifier with

to perform the node classification. Because the KNN classifier cannot be directly applied to hyperbolic spaces, for PoincaréEmb, we project the representations in the tangent space at via , and then feed the representations into the classifier. We report the average accuracy of 10 runs with random weight initialization.

The results are shown in Table 2. It is obvious that HAT achieves the best performance in most cases, and its superiority is more significant for the low dimension setting. Moreover, we can find that the GNN based methods (i.e., GCN, GAT, and HAT) perform better than other baselines (i.e., DeepWalk, Node2vec, LINE(1st, 2nd), and PoincaréEmb) in most cases, because of combining the graph structure and node features in their models. Furthermore, compared to GNN methods in Euclidean spaces (i.e., GCN, GAT), HAT performs better in most cases, especially in low dimension, suggesting the superiority of modeling graph in hyperbolic spaces. The superiority of hyperbolic spaces is further validated in the comparison of LINE(1st) and PoincaréEmb. Although both of them preserve the first-order proximity in graphs, the hyperbolic graph embedding method PoincaréEmb always perform better than LINE(1st).

Dataset Dimension DeepWalk Node2vec LINE(1st) LINE(2nd) PoincaréEmb  GCN   GAT   HAT
Cora 2 0.264 0.281 0.075 0.074 0.245  0.341   0.404   0.382
4 0.274 0.292 0.239 0.111 0.329  0.428   0.504   0.519
8 0.358 0.365 0.277 0.105 0.395  0.501   0.572   0.582
16 0.404 0.415 0.292 0.119 0.441  0.524   0.584   0.581
Citeseer 2 0.090 0.164 0.048 0.012 0.121  0.248   0.315   0.321
4 0.121 0.169 0.104 0.036 0.160  0.344   0.391   0.399
8 0.156 0.177 0.083 0.043 0.194  0.401   0.417   0.427
16 0.179 0.209 0.092 0.057 0.264  0.426   0.430   0.439
Pubmed 2 0.153 0.195 0.076 0.043 0.206  0.230   0.334   0.345
4 0.162 0.214 0.083 0.036 0.221  0.254   0.340   0.358
8 0.196 0.224 0.102 0.055 0.257  0.242   0.343   0.386
16 0.231 0.286 0.115 0.077 0.284  0.262   0.352   0.393
Amazon Photo 2 0.478 0.489 0.145 0.399 0.499  0.187   0.464   0.505
4 0.578 0.584 0.242 0.370 0.527  0.207   0.595   0.647
8 0.643 0.663 0.363 0.413 0.591  0.240   0.636   0.672
16 0.681 0.710 0.388 0.416 0.626  0.254   0.659   0.719
Table 3: Quantitative results on the node clustering task. The best results are marked by bold numbers.
(a) Neighbors of P1728
(b) Attention values of P1728’s neighbors
Figure 3: Neighbors of node P1728 and corresponding attention values. Different colors and patterns indicate different classes.

Node Clustering

Here we conduct the clustering task to evaluate the representation learned from different methods. For the GNN based methods (i.e., GCN, GAT, and HAT), we can get the feature representations of test nodes from the hidden layer. Here we utilize K-means to perform node clustering, and the number of clusters is set to the number of labels. For PoincaréEmb and HAT, we project these representations via

, and then feed the representations into K-means. We report the average results of normalized mutual information (NMI) of 10 runs with random weight initialization.

The results are displayed in Table 3. As we can see, HAT performs better than other baselines in most case, indicating the superior performance of HAT. Moreover, for Amazon Photo, some graph embedding methods achieve better results than GCN and GAT, while HAT still outperforms baselines, demonstrating the superiority of designing graph neural network in hyperbolic spaces. Furthermore, the superiority of hyperbolic spaces is further validated in the comparison of LINE(1st) and PoincaréEmb.

Analysis of Attention Mechanism

We examine the learned attention value of HAT in this section. Intuitively, more important nodes tend to have larger attention values. Specifically, we take the paper “P1728” in Cora dataset as an illustrative example. As shown in Fig. 3(a), the paper P1728 has 5 neighbors, and the labels of nodes are indicated by colors and patterns. From Fig. 3(b), we can see that the paper P1728 gets the highest attention value, which means the node itself plays the most essential role in its representation. P2599, P961 and P2555 get the second, third, fourth highest attention values, respectively. That is because the three papers belong to the same class with P1728, and they can make a significant contribution to identifying the class of P1728. The irrelevant class neighbors, i.e., P1358 and P2257, get the smallest attention values. Based on the above analysis, we can see that our proposed attention mechanism can automatically distinguish the difference among neighbors.

(a) DeepWalk
(b) Node2vec
(c) LINE(1st)
(d) LINE(2nd)
(e) PoincaréEmb
(f) GCN
(g) GAT
(h) HAT
Figure 4: Visualization of 2-dimension representations on Pubmed.

Graph Visualization

Graph visualization, aiming to layout a graph on a two-dimensional space, is another important graph application. Here, we take Pubmed as a case to visualize the learned representations. Followed [Nickel and Kiela2017, Nickel and Kiela2018, Ganea, Bécigneul, and Hofmann2018a], we directly visualize the learned two-dimensional representations of the nodes. As shown in Fig. 4, each point indicates one paper and its color indicates the label. We can find that three GNN methods (i.e., HAT, GCN and GAT) relatively clearly distinguish three classes of nodes. Compared to GCN and GAT, HAT distinguishes all three categories with a more clear boundary and larger discrimination.


In this paper, we make the first effort toward investigating the graph neural network in hyperbolic spaces and propose a novel hyperbolic graph attention network HAT. With the framework of gyrovector spaces, we redesign the graph operations in hyperbolic spaces, and propose an attention mechanism based on the hyperbolic proximity. We further devise an acceleration strategy to improve the efficiency of HAT. The extensive experiments on four datasets demonstrate the superiority of HAT, compared with the state-of-the-arts.


Hyperbolic representation learning has attracted considerable research attention recently. We notice that some hyperbolic GNNs were done independently during the same period [Chami et al.2019, Liu, Nickel, and Kiela2019, Bachmann, Bécigneul, and Ganea2019], and I think it is my honor to finish such a relevant work at the same time. Thank you.


  • [Bachmann, Bécigneul, and Ganea2019] Bachmann, G.; Bécigneul, G.; and Ganea, O.-E. 2019. Constant curvature graph convolutional networks. arXiv preprint arXiv:1911.05076.
  • [Bronstein et al.2017] Bronstein, M. M.; Bruna, J.; LeCun, Y.; Szlam, A.; and Vandergheynst, P. 2017. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34(4):18–42.
  • [Chami et al.2019] Chami, I.; Ying, Z.; Ré, C.; and Leskovec, J. 2019.

    Hyperbolic graph convolutional neural networks.

    In NeurIPS, 4869–4880.
  • [Clauset, Moore, and Newman2008] Clauset, A.; Moore, C.; and Newman, M. E. 2008. Hierarchical structure and the prediction of missing links in networks. Nature 453(7191):98.
  • [Clauset, Shalizi, and Newman2009] Clauset, A.; Shalizi, C. R.; and Newman, M. E. 2009. Power-law distributions in empirical data. SIAM review 51(4):661–703.
  • [Defferrard, Bresson, and Vandergheynst2016] Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, 3844–3852.
  • [Faqeeh, Osat, and Radicchi2018] Faqeeh, A.; Osat, S.; and Radicchi, F. 2018. Characterizing the analogy between hyperbolic embedding and community structure of complex networks. Physical review letters 121(9):098301.
  • [Ganea, Bécigneul, and Hofmann2018a] Ganea, O.; Bécigneul, G.; and Hofmann, T. 2018a. Hyperbolic entailment cones for learning hierarchical embeddings. In ICML, 1646–1655.
  • [Ganea, Bécigneul, and Hofmann2018b] Ganea, O.; Bécigneul, G.; and Hofmann, T. 2018b. Hyperbolic neural networks. In NeurIPS, 5350–5360.
  • [Gori, Monfardini, and Scarselli2005] Gori, M.; Monfardini, G.; and Scarselli, F. 2005. A new model for learning in graph domains. In IJCNN, 729–734.
  • [Grover and Leskovec2016] Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In SIGKDD, 855–864.
  • [Gulcehre et al.2019] Gulcehre, C.; Denil, M.; Malinowski, M.; Razavi, A.; Pascanu, R.; Hermann, K. M.; Battaglia, P.; Bapst, V.; Raposo, D.; Santoro, A.; et al. 2019. Hyperbolic attention networks. ICLR.
  • [Hamilton, Ying, and Leskovec2017] Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In NeurIPS, 1024–1034.
  • [Kipf and Welling2017] Kipf, T. N., and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. ICLR.
  • [Krioukov et al.2010] Krioukov, D.; Papadopoulos, F.; Kitsak, M.; Vahdat, A.; and Boguná, M. 2010. Hyperbolic geometry of complex networks. Physical Review E 82(3):036106.
  • [Liu, Nickel, and Kiela2019] Liu, Q.; Nickel, M.; and Kiela, D. 2019. Hyperbolic graph neural networks. In NeurIPS, 8228–8239.
  • [Muscoloni et al.2017] Muscoloni, A.; Thomas, J. M.; Ciucci, S.; Bianconi, G.; and Cannistraci, C. V. 2017. Machine learning meets complex networks via coalescent embedding in the hyperbolic space. Nature communications 8(1):1615.
  • [Nickel and Kiela2017] Nickel, M., and Kiela, D. 2017. Poincaré embeddings for learning hierarchical representations. In NeurIPS, 6338–6347.
  • [Nickel and Kiela2018] Nickel, M., and Kiela, D. 2018. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. ICML 80.
  • [Papadopoulos et al.2012] Papadopoulos, F.; Kitsak, M.; Serrano, M. Á.; Boguná, M.; and Krioukov, D. 2012. Popularity versus similarity in growing networks. Nature 489(7417):537.
  • [Parisot et al.2017] Parisot, S.; Ktena, S. I.; Ferrante, E.; Lee, M.; Moreno, R. G.; Glocker, B.; and Rueckert, D. 2017. Spectral graph convolutions for population-based disease prediction. In MICCAI, 177–185.
  • [Perozzi, Al-Rfou, and Skiena2014] Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In SIGKDD, 701–710.
  • [Sala et al.2018] Sala, F.; De Sa, C.; Gu, A.; and Re, C. 2018. Representation tradeoffs for hyperbolic embeddings. In ICML, volume 80, 4460–4469.
  • [Scarselli et al.2009] Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80.
  • [Sen et al.2008] Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and Eliassi-Rad, T. 2008. Collective classification in network data. AI magazine 29(3):93–93.
  • [Shchur et al.2018] Shchur, O.; Mumme, M.; Bojchevski, A.; and Günnemann, S. 2018. Pitfalls of graph neural network evaluation. NeurIPS 2018.
  • [Song et al.2019] Song, W.; Xiao, Z.; Wang, Y.; Charlin, L.; Zhang, M.; and Tang, J. 2019. Session-based social recommendation via dynamic graph attention networks. In WSDM, 555–563. ACM.
  • [Tang et al.2015] Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. Line: Large-scale information network embedding. In WWW, 1067–1077.
  • [Ungar2001] Ungar, A. A. 2001. Hyperbolic trigonometry and its application in the poincaré ball model of hyperbolic geometry. Computers & Mathematics with Applications 41(1-2):135–147.
  • [Ungar2008] Ungar, A. A. 2008. A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics 1(1):1–194.
  • [Veličković et al.2018] Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2018. Graph attention networks. ICLR.
  • [Wang, Zhang, and Shi2019] Wang, X.; Zhang, Y.; and Shi, C. 2019. Hyperbolic heterogeneous information network embedding. In AAAI, 5337–5344.
  • [Wilson et al.2014] Wilson, R. C.; Hancock, E. R.; Pekalska, E.; and Duin, R. P. 2014. Spherical and hyperbolic embeddings of data. IEEE transactions on pattern analysis and machine intelligence 36(11):2255–2269.
  • [Yang, Cohen, and Salakhutdinov2016] Yang, Z.; Cohen, W. W.; and Salakhutdinov, R. 2016.

    Revisiting semi-supervised learning with graph embeddings.

  • [Ying et al.2018] Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W. L.; and Leskovec, J. 2018. Graph convolutional neural networks for web-scale recommender systems. In SIGKDD, 974–983.