1. Introduction
Data describing the relationships between nodes of a graph are abundant in realworld applications, ranging from social networks analysis (Liu et al., 2019b), traffic prediction (Zhao et al., 2019), ecommerce recommendation (Cui et al., 2019), and protein structure prediction (Fout et al., 2017) to disease propagation analysis (Chami et al., 2019). In many situations, networks are intrinsically changing or timeevolving with vertices (including their attributes) and edges appearing or disappearing over time. Learning representations of dynamic structures is challenging but of high importance since it describes how the network interacts and evolves, which will help to understand and predict the behavior of the system.
A number of temporal graph embedding methods have been proposed, which can be divided into two main categories: discretetime network embeddings and continuous network embeddings (Yang et al., 2020). Discretetime network embeddings are represented in discrete time intervals denoted as multiple snapshots (Sankar et al., 2020; Pareja et al., 2020). As for continuoustime network embeddings, these can be described as timedependent eventbased models where the events, denoted by edges, occur over a time span (Nguyen et al., 2018; Trivedi et al., 2019; Chang et al., 2020). Essentially, these two schemes both focus on capturing the underlying characteristics of a temporal graph: temporal dependency and topology evolution in Euclidean space. Euclidean space is the natural generalization of intuitionfriendly and visualizable threedimensional space with appealing vectorial structure, and closedform formulas of distance and innerproduct (Ganea et al., 2018). However, the quality of the representations is determined by how well the geometry of the embedding space matches the structure of the data (Gu et al., 2019), which triggers one basic question: whether the widely used Euclidean space is the best option for network embedding of an arbitrary temporal graph. Several works (Bronstein et al., 2017; Ying et al., 2018) show that most of the graph data, e.g., social networks, communication networks, and diseasespreading networks exhibit nonEuclidean latent anatomies that show hierarchical structures and scalefree distributions as illustrated in Figure 1. This motivates us to rethink (1) whether the Euclidean manifold is the most suitable geometry for graph embedding of this kind of data and (2) is there any more powerful or proper alternative manifold to intrinsically preserve the graph properties, and if it exists, what benefits can it bring to temporal graph embedding?
Recently, hyperbolic geometry has received increasing attention and achieves stateoftheart performance in several static graph embedding tasks (Nickel and Kiela, 2017, 2018; Chami et al., 2019; Liu et al., 2019a; Zhang et al., 2019). One fundamental property of hyperbolic space is that it expands exponentially and can be regarded as a smooth version of trees, abstracting the hierarchical organization (Krioukov et al., 2010). Therefore, the (approximate) hierarchical structure and treelike data can naturally be represented by hyperbolic geometry, which instead will lead to severe structural inductive biases and high distortion if directly embedded into Euclidean space.
Despite the recent achievements in hyperbolic graph embedding, attempts on temporal networks are still scant. To fill this gap, in this work, we propose a novel hyperbolic temporal graph network (HTGN), which fully leverages the implicit hierarchical information to capture the spatial dependency and temporal regularities of evolving networks via a recurrent learning paradigm.
Following the concise and effective discretetime temporal network embedding paradigm, a temporal network is first converted into a series of snapshots over time. In HTGN, we project the nodes into the hyperbolic space, leverage hyperbolic graph neural network (HGNN) to learn the topological dependencies of the nodes at each snapshot, and then employ hyperbolic gated recurrent unit (HGRU) to further capture the temporal dependencies. A temporal network is complex and may have cyclical patterns, and a distant snapshot may be more significant than the closest one
(Sankar et al., 2020; Liu et al., 2017). Recurrent neural networks (RNNs) (Cho et al., 2014; Hochreiter and Schmidhuber, 1997) usually restrict the model to emphasize the most nearby timesteps due to its time monotonic assumption. We, therefore, design a wrapped hyperbolic temporal contextual attention (HTA) module that incorporates context from the latest historical states in hyperbolic space and assigns different weights for both distant and nearby snapshots. On the other hand, temporal coherence serves as a critical signal for sequential learning since a regular temporal graph is usually continuous and smoothlyvarying. Inspired by the cycleconsistency in video tracking (Wang et al., 2019; Dwibedi et al., 2019), we propose a novel hyperbolic temporal consistency (HTC) component that imposes a constraint on the latent representations of consecutive snapshots, ensuring the stability and generalization for tracking the evolving behaviors. In summary, the main contributions are stated as follows:
We propose a novel hyperbolic temporal graph embedding model, named HTGN, to learn temporal regularities, topological dependencies, and implicitly hierarchical organization. To the best of our knowledge, this is the first study on temporal graph embedding built upon a hyperbolic geometry powered by the recurrent learning paradigm.

HTGN applies a flexible wrapped hyperbolic temporal contextual attention (HTA) module to effectively extract the diverse scope of historical information. A hyperbolic temporal consistency (HTC) constraint is further put forward to ensure the stability and generalization of the embeddings.

Extensive experimental results on diverse realworld temporal graphs demonstrate the superiority of HTGN as it consistently outperforms the stateoftheart methods on all the datasets by large margins. The ablation study further gives insights into how each proposed component contributes to the success of the model.
2. Related works
Our work mainly relates to representation learning on temporal graph embedding and hyperbolic graph embedding.
Temporal graph embedding. Temporal graphs are mainly defined in two ways: (1) discrete temporal graphs, which are a collection of evolving graph snapshots at multiple discrete time steps; and (2) continuous temporal graphs, which update too frequently to be represented well by snapshots and are instead denoted as graph streams (Skarding et al., 2020). Snapshotbased methods can be applied to a timestamped graph by creating suitable snapshots, but the converse is infeasible in most situations due to a lack of finegrained timestamps. Hence, we here mainly focus on representation learning over discrete temporal graphs. For systematic and comprehensive reviews, readers may refer to (Skarding et al., 2020), and (Aggarwal and Subbian, 2014).
The set of approaches most relevant to our work is the recurrent learning scheme that integrates graph neural networks with the recurrent architecture, whereby the former captures graph information and the latter handles temporal dynamism by maintaining a hidden state to summarize historical snapshots. For instance, GCRN (Seo et al., 2018) offers two different architectures to capture the temporal and spatial correlations of a dynamic network. The first one is more straightforward and uses a GCN to obtain node embeddings, which are then fed into an LSTM to learn the temporal dynamism. The second is a modified LSTM that takes node features as input but replaces the fullyconnected layers therein by graph convolutions. A similar idea is explored in DySAT (Sankar et al., 2020), which instead computes node representations through joint selfattention along the two dimensions of the structural neighborhood and temporal dynamics. VRGNN (Hajiramezanali et al., 2019) integrates GCRN with VGAE (Kipf and Welling, 2016) and each node at each timestep is represented with a distribution; hence, the uncertainty of the latent node representations are also modeled. On the other hand, EvolveGCN (Pareja et al., 2020) captures the dynamism of the graph sequence by using an RNN to evolve the GCN parameters rather than the temporal dynamics of the node embeddings. Most of the prevalent methods are builtin Euclidean space which, however, may underemphasize the intrinsic powerlaw distribution and hierarchical structure.
Hyperbolic graph embedding.
Hyperbolic geometry has received increasing attention in machine learning and network science communities due to its attractive properties for modeling data with latent hierarchies. It has been applied to neural networks for problems of computer vision, natural language processing
(Nickel and Kiela, 2017; Gulcehre et al., 2019; Nickel and Kiela, 2018; Sala et al., 2018), and graph embedding tasks (Gulcehre et al., 2019; Zhang et al., 2019; Chami et al., 2019; Liu et al., 2019a). In the graph embedding field, recent works including HGNN (Liu et al., 2019a), HGCN (Chami et al., 2019), and HGAT (Zhang et al., 2019)generalize the graph convolution into hyperbolic space (the name of these methods are from corresponding literature) by moving the aggregation operation to the tangent space, where the vector operations can be performed. HGNN
(Liu et al., 2019a) focuses more on graph classification tasks and provides an extension to dynamic graph embeddings. HGAT (Zhang et al., 2019) introduces a hyperbolic attentionbased graph convolution using algebraic formalism in gyrovector and focuses on node classification and clustering tasks. HGCN (Chami et al., 2019) introduces a local aggregation scheme in local tangent space and develops a learnable curvature method for hyperbolic graph embedding. Besides, works in (Gu et al., 2019; Zhu et al., 2020) propose to learn representations over multiple geometries. The superior performance brought by hyperbolic geometry on static graphs motivates us to explore it on temporal graphs.3. Preliminary and Background
In this section, we first present the problem formulation of temporal graph embedding and introduce the widely used graph recurrent neural networks framework. Then, we introduce some fundamentals of hyperbolic geometry.
3.1. Problem Formulation
In this work, we focus on discretetime temporal graph embedding. A discretetime temporal graph (Aggarwal and Subbian, 2014; Skarding et al., 2020) is composed of a series of snapshots sampled from a temporal graph , where is the number of snapshots. Each snapshot contains the current node set and the corresponding adjacency matrix . As time evolves, nodes may appear or disappear, and edges can be added or deleted. The graph embedding aims to learn a mapping function that obtains a lowdimensional representation by giving the snapshots till timestamp . A general learning framework can be written as:
(1) 
where is the initial node features or attributes and is the latest historical state. This learning paradigm is widely used in discretetime temporal graph embedding (Seo et al., 2018; Hajiramezanali et al., 2019; Zhao et al., 2019) where is graph neural network, e.g., GCN (Kipf and Welling, 2017) aiming at modeling structural dependencies and is a recurrent network, e.g., GRU (Cho et al., 2014) to capture the evolving regularities.
3.2. Hyperbolic Geometry
A Riemannian manifold is a branch of differential geometry that involves a smooth manifold with a Riemannian metric . For each point in , there is a tangent space as the firstorder approximation of around , which is a dimensional vector space (see Figure 2).
There are multiple equivalent models for hyperbolic space. We here adopt the Poincaré ball model which is a compact representative providing visualizing and interpreting hyperbolic embeddings. The Poincaré ball model with negative curvature corresponds to the Riemannian manifold , where is an open dimensional ball. If , it degrades to Euclidean space, i.e., . In addition, (Ganea et al., 2018)
shows how Euclidean and hyperbolic spaces can be continuously deformed into each other and provide a principled manner for basic operations (e.g., addition and multiplication) as well as essential functions (e.g., linear maps and softmax layer) in the context of neural networks and deep learning.
4. Methodology
The overall framework of the proposed Hyperbolic Temporal Graph Network (HTGN) is illustrated in Figure 3. As sketched in Figure 3(a), HTGN is a recurrent learning paradigm and falls into the prevalent discretetime temporal graph architecture formulated by equation (1). An HTGN unit, shown in Figure 3(b), mainly consists of three components: (1) HGNN, the graph neural network to extract topological dependencies in hyperbolic space; (2) HTA module, an attention mechanism based on the hyperbolic proximity to obtain the attentive hidden state; (3) HGRU, the hyperbolic temporal recurrent module to capture the sequential patterns. Furthermore, we propose a hyperbolic temporal consistency constraint denoted as HTC to ensure stability and smoothness. We elaborate on the details of each respective module in the following paragraphs. For the sake of brevity, the timestamp is omitted in Section 4.1 and Section 4.2.
4.1. Feature Map
Before going into the details of each module, we first introduce two bijection operations, the exponential map and the logarithmic map, for mapping between hyperbolic space and tangent space with a local reference point (Liu et al., 2019a; Chami et al., 2019), as presented below.
Proposition 1 ().
For , , and , , then the exponential map is formulated as:
(2) 
where is conformal factor, and is the Möbius addition, for any :
(3) 
The logarithmic map is given by:
(4) 
Note that is a local reference point, we use the origin point in our work.
4.2. Hyperbolic Graph Neural Network (HGNN)
HGNN is employed to learn topological dependencies in the temporal graph leveraging promising properties of hyperbolic geometry. Analogous to GNN, an HGNN layer also includes three key operations: hyperbolic transformation, hyperbolic aggregation, and hyperbolic activation. Given a Euclidean space vector , we regard it as the point in the tangent space with the reference point and use the exponential map to project it into hyperbolic space, obtaining ,
(5) 
Then the update rule for one HGNN layer is expressed as:
Hyperbolic Linear Transformation (6a).
The hyperbolic linear transformation contains both vector multiplication and bias addition which can not be directly applied since the operations in hyperbolic space fail to meet the permutation invariant requirements. Therefore, for vector multiplication, we first project the hyperbolic vector to the tangent space and apply the operation, which is given by:
(7) 
For bias addition, we transport a bias located at to the position in parallel. Then, we use to map it back to hyperbolic space:
(8) 
Hyperbolic Aggregation (6b). The (weighted) mean operation is necessary to perform aggregation in an Euclidean graph neural network. An analog of mean aggregation in hyperbolic space is the Fréchet mean (Fréchet, 1948), however, it is difficult to apply as it lacks a closedform to compute the derivative easily (Bacák, 2014). Similar to (Chami et al., 2019; Liu et al., 2019a; Zhang et al., 2019), we perform the aggregation computation in the tangent space. We adopt the attentionbased aggregation, which is formulated as:
(9)  
where is a trainable vector, denotes concatenation operation and indicates the correlation between the neighbors and the center node . Besides, we also consider the Laplacian based method (Chami et al., 2019), that is where and are the degree of node and node .
4.3. Hyperbolic Temporal Attention (HTA)
Historical information plays an indispensable role in temporal graph modeling since it facilitates the model to learn the evolving patterns and regularities. Although the latest hidden state obtained by the recurrent neural network already carries historical information before time , some discriminative contents may still be underemphasized due to the monotonic mechanism of RNNs that temporal dependencies are decreased along the time span (Liu et al., 2017).
Inspired by (Cui et al., 2019), our proposed HTA generalizes to the latest snapshots , attending on multiple historical latent states to get a more informative hidden state. The procedure is illustrated in Algorithm 1. Specifically, we first project historical states in the state memory bank into the tangent space and concatenate them together. Then, the learnable weight matrix and vector are utilized to extract contextual information, where weights the node importance in each historical state and determines the weights across time windows.
4.4. Hyperbolic Gated Recurrent Unit (HGRU)
GRU (Cho et al., 2014), a variant of LSTM (Hochreiter and Schmidhuber, 1997), is used in this work to incorporate the current and historical node states. Similar to LSTM, the GRU adopts gating units to modulate the flow of information but gets rid of the separate memory cell. Note that, our HGRU^{1}^{1}1A HyperGRU defined by Ganea et al. (Ganea et al., 2018) is also applicable in our framework. However, we experimentally found that the proposed method built in the tangent space obtains similar performance but is more efficient for largescale data. is performed in the tangent space due to its computational efficiency.
HGRU receives the sequential input from HGNN and the attentive hidden state obtained from HTA as the input, and we denote as the output. The dataflow in the HGRU unit is characterized by the following equations:
where are the trainable weight matrices, is the update gate to control the output and is the reset gate to balance the input and memory. As the GRU is built in the tangent space, logarithmic maps are needed (equations (10a), (10b)). Then, we feed the states into the GRU (equations (10c) to (10f)) and map the hidden state back to hyperbolic space (equation (10g)). As we can see, the final fuses structural, content, and temporal information.
4.5. Proposed Learning Algorithm
Uniting the above modules, we have the overall learning procedure as summarized in Algorithm 2. In line 5, we also consider include the historical state as the input as (Hajiramezanali et al., 2019) and for brevity, we ignore it here.
Note that we design the objective function from two aspects: temporal evolution and topological learning, corresponding to the following hyperbolic temporal consistency loss and hyperbolic homophily loss.
4.5.1. Hyperbolic Temporal Consistency Loss
In terms of the time perspective, intuitively, the embedding position in the latent space changes gradually over time, which ensures stability and generalization. We thus pose a hyperbolic temporal consistency constraint on two consecutive snapshots , to ensure the representation a certain temporal smoothness and longterm prediction ability, which is defined as:
(11) 
where the subscript denotes the loss is with respect to time snapshot , and is the distance of two points :
(12) 
4.5.2. Hyperbolic Homophily Loss
Graph homophily that linked nodes often belong to the same class or have similar attributes is a property shared by many realworld networks. The hyperbolic homophily loss
aims to maximize the probability of linked nodes through the hyperbolic feature and minimize the probability of no interconnected nodes.
is based on crossentropy where the probability is inferred by the FermiDirac function (Krioukov et al., 2010; Chami et al., 2019) which is formulated as:(13) 
where is the radius of , so points within that radius may have an edge with , and specifies the steepness of the logical function. Then, is given by:
(14) 
To efficiently compute the loss and gradient, we sample the same number of negative edges as there are positive edges for each timestamp, i.e., .
4.5.3. The Unified Model
As temporal consistency and homophily regularity mutually drive the evolution of the temporal graphs, we set the final loss function as:
(15) 
where is the hyperparameter to balance the temporal smoothness and homophily regularity.
Proposition 2 ().
Let be the number of nodes, be the number of training timestamps, be the neighbors of node , and be the number of links in timestamp . Then, minimizing the loss
is equivalent to (1) minimizing the hyperbolic distance of a node with its current and historically connected nodes, and maximizing with the sampled negative neighbors, which are weighted by ; (2) minimizing the distance between the same node over two consecutive timestamps, that is:
(16)  
Proof. See Appendix A.2.
Proposition 2 shows that our loss builds a messagepassing connection within its neighbors from different times and local structures. In other words, if two nodes are not connected directly, they may still have implicit interactions through their common neighbors even at different times. This enables the representation to encode more patterns directly and indirectly, which is essential for highquality representation and further prediction.
As we can see, the loss function is only related to distance in the Poincaré ball, and thus scales well to largescale datasets. Given two points , the gradient (Nickel and Kiela, 2017) of their distance in the Poincaré Ball is formulated as:
(17) 
where
. The computational and memory complexity of one backpropagation step depends linearly on the embedding dimension.
Components  Time Complexity 

HTA  
HTC  
HGNN  
4.5.4. Complexity Analysis
We analyze the time complexity of the main components of the proposed HTGN model in each timestamp and present a summary in Table 1, where and are the number of nodes and edges, and are respectively the dimensions of input feature and output feature, and denotes state memory length. Note that the above modules can be paralleled across all nodes and are computationally efficient. Furthermore, as we use a constant memory state bank, the extra storage cost is negligible. Numerical analysis on the scalability is presented in Section 6.2.
5. Experiments and analysis
In this section, we conduct extensive experiments with the aim of answering the following research questions (RQs):

RQ1. How does HTGN perform?

RQ2. What does each component of HTGN bring?
5.1. Experimental Setup
5.1.1. Datasets
To verify the generality of our proposed method, we choose a diverse set of networks for evaluation, including diseasespreading networks, DISEASE; academic coauthor networks, HepPh and COLAB; social networks, FB; email communication networks, Enron; and Internet router network, AS733, as recorded in Table 2. Notable, DISEASE is a synthetic dataset based on the SIR spreading model (Bjørnstad et al., 2002), which is also feasible for COVID19 path prediction. At the same time, we list the Gromov’s hyperbolicity (Narayan and Saniee, 2011; Jonckheere et al., 2008) and the average density . Gromov’s hyperbolicity is a notion from graph theory and measures the “treelikeness” of metric spaces. The lower hyperbolicity, the more treelike, with denoting a pure tree. The average density is defined as the ratio of the number of edges and all possible edges, describing how dense a graph is. In these datasets, HepPh and COLAB are relatively dense and FB is highly sparse. More details about the datasets are presented in Appendix A.3.1.
Datasets  DISEASE  HepPH  FB  AS733  Enron  COLAB 

#Snapshots  7  36  36  30  11  10 
#Test k  3  6  3  10  3  3 
#Nodes  2665  15,330  45,435  6,628  184  315 
#Total Edges  2664  976,097  180,011  13,512  790  943 
Density  0.41  1.37  0.04  0.2  3.37  0.94 
Hyperbolicity  0.0  1.0  2.0  1.5  1.5  2.0 

denotes the value is in units of 0.01.

The smaller indicates the dataset has a more evident hierarchical structure.
AUC  AP  
Dataset  DISEASE  HepPh  FB  AS733  Enron  COLAB  DISEASE  HepPh  FB  AS733  Enron  COLAB 
GAE  
VGAE  
EvolveGCN  
GRUGCN  
DySAT  
VGRNN  
HTGN (Ours)  
Gain (%)  +3.71  +9.98  +5.44  +3.12  +1.15  +2.30  +3.21  +4.25  +1.24  +1.75  +0.89  +1.67 
AUC  AP  
Dataset  DISEASE  HepPh  FB  AS733  Enron  COLAB  DISEASE  HepPh  FB  AS733  Enron  COLAB 
EvolveGCN  
GRUGCN  
DySAT  
VGRNN  
HTGN (Ours)  
Gain (%)  +3.71  +9.93  +5.82  +11.40  +3.20  +2.51  +3.21  +4.01  +0.78  +7.24  +2.5  +0.71 

In the DISEASE dataset, all edges in the test set are new, so the results of new link prediction are equivalent to the results of link prediction.
5.1.2. Baselines
We compare the performance of our proposed model against a diverse set of competing graph embedding methods. The first two are advanced static network embedding models: GAE and VGAE^{2}^{2}2https://github.com/tkipf/gae. GAE is simply composed of twolayer graph convolutions; VGAE additionally introduces variantial variables. Compared with the graph models tailored for temporal graphs, the static models ignore the temporal regularity. We use all the edges in the training shots for training and the remaining as the test set. We moreover compare with GRUGCN, conceptually the same version as in (Seo et al., 2018) and also a basic architecture of temporal graph embedding model, to show the effectiveness of Hyperbolic geometry and our proposed HTA module. More importantly, we also conduct experiments on several stateofart temporal graph embedding models: EvolveGCN^{3}^{3}3There are two versions of the EvolveGCN: EvolveGCNO and EvolveGCNH. We test both and report the best result. (Pareja et al., 2020), DySAT (Sankar et al., 2020), and VGRNN (Hajiramezanali et al., 2019) to further demonstrate the superiority of the proposed HTGN.
5.1.3. Evaluation Tasks and Metrics
We obtain node representations from HTGN which can be applied to various downstream tasks. In temporal graph embedding, link prediction is widely used for evaluation, as the addition or removal of edges over time leads to the network evolution. Here, we use the FermiDirac function defined in equation (13) to predict links between two nodes. Similar to VGRNN (Hajiramezanali et al., 2019), we evaluate our proposed models on two different dynamic link prediction tasks: dynamic link prediction and dynamic new link prediction. More specifically, given partially observed snapshots of a temporal graph , dynamic link prediction task is defined to predict the link in the next snapshots or next multistep snapshots and dynamic new link prediction task is to predict new links in that are not in .
Following the same setting as in VGRNN (Hajiramezanali et al., 2019), we choose the last snapshots as the test set and the rest of the snapshots as the training set. To thoroughly verify the effectiveness of the model, we select different lengths for testing and the corresponding values are listed in Table 2
. We test the models regarding their ability of correctly classifying true and false edges by computing average precision (AP) and area under the ROC curve (AUC) scores. We assume all known edges in the test snapshots as true and sample the same number of nonlinks as false. Note that we uniformly train both the baselines and HTGN by using early stopping based on the performance of the training set.
5.2. Experimental Results (RQ1)
The code of HTGN is publicly available here.^{4}^{4}4https://github.com/marlincodes/HTGN
We repeat each experiment five times and report the average value with the standard deviation on the test sets in Table
3 and Table 4, where the best results are in bold and the secondbest results are in italics for each dataset. It is observed that HTGN consistently and significantly outperforms the competing methods of both tasks across all six datasets, demonstrating the effectiveness of the proposed method. On the other hand, the runnersup go to the other temporal graph embedding methods, which confirms the importance of temporal regularity in temporal graph modeling. In the following, we discuss the results on link prediction and new link prediction, respectively.5.2.1. Link Prediction
Table 3 shows the experiments on the link prediction task. In summary, HTGN outperforms the competing methods significantly considering both AUC and AP scores, which shows that our proposed models have better generalization ability to capture temporal trends. For instance, HTGN achieves an average gain of 4.28 in AUC compared to the best baseline. Predicting the evolution of very sparse graphs (e.g., FB) or longterm sequences (e.g., HepPh) indeed are hard tasks. Notably, our proposed HTGN obtains remarkable gains for these datasets and successfully pushes the performance to a new level.
It is worthwhile mentioning that all edges in the test set are new in the DISEASE dataset, which then requires the model’s stronger inductive learning ability. Despite the difficulty, the proposed HTGN still outperforms the baselines by large margins and achieves notable results in both AUC and AP metrics.
5.2.2. New Link Prediction
New link prediction aims to predict the appearance of new links, which is more challenging. Note that static methods are not applicable to this task as the sequential order is omitted in the learning procedure of a static method, and GAE and VGAE are thus not evaluated. From Table 4, we are able to find similar observations to the link prediction task, demonstrating the superiority of the proposed HTGN. Specifically, we notice that the performance of each method drops by different degrees compared to the corresponding link prediction task, while our HTGN model produces more consistent results. For instance, the performance of the baselines degrades dramatically on AS733 (e.g., the secondbest, GRUGCN drops from 94.64 to 83.14), but our HTGN only declines about 2, which shows our proposed HTGN strong inductive ability.
5.3. Ablation Study (RQ2)
We further conduct an ablation study to validate the effectiveness of the main components of our proposed model. We name the HTGN variants as follows:

w/o HTC: HTGN without the temporal attention module, i.e., the HGRU unit directly takes the hidden state of the last timestamp as the input.

w/o HTA: HTGN without the hyperbolic temporal consistency, i.e., the model is trained by minimizing the hyperbolic homophily loss only.

w/o : HTGN without hyperbolic geometry where all modules and learning processes are builtin Euclidean space. Correspondingly, the HTA and HTC modules are converted to Euclidean versions.
We repeat each experiment five times and report the average AUC on the test set for the link prediction task, as shown in Table 5. We first make the wrapup observation that removing any of the components will cause performance degradation, which highlights the importance of each component. In the following, we take a closer look into the details about w/o HTC and w/o HTA at first. Discussion of w/o is present in section 6.1.
Dataset  DISEASE  HepPh  FB  AS733  Enron  DBLP  

0.0  1.0  2.0  1.5  1.5  2.0  
0.41  1.37  0.04  0.2  3.37  0.94  
w/o HTC  
Gain (%)  14.6  19.44  4.97  3.87  1.39  4.81  
(l)28  w/o HTA  
Gain (%)  7.35  0.29  0.88  0.77  0.71  0.74  
(l)28  w/o  
Gain (%)  21.35  9.36  3.94  3.82  2.05  6.01  
Benefit from the HTC module. The effect of the temporal consistency constraint is significant as the performance drops vastly if the HTC module is removed. The model degradation is significant even for longterm prediction tasks (i.e., HepPh and AS733), which confirms that the HTC module facilitates the proposed model to capture the highlevel temporal smoothness of an evolving network and ensures more stable and generalized prediction performance.
Benefit from the HTA module. As observed from Table 5, the performance decays by different extents if the HTA module is removed. In particular, the degradation is the most severe on the DISEASE dataset, which assembles node attributes. It confirms our assumption that HTA is able to collect the contextual information carried in the previous snapshots to further impel the learning of HTGN.
6. Discussion
In this section, we further analyze HTGN with the aim of answering the following research questions:

RQ3. What does hyperbolic geometry bring?

RQ4. How is the learning efficiency in large networks?
6.1. Merits of Hyperbolic Geometry (RQ3)
Hierarchical awareness. We remove the hyperbolic geometry and build the learning process in Euclidean space. The HTA and HTC modules are converted to the corresponding Euclidean versions. As shown in Table 5 in the row for , we know that the introduction of hyperbolic geometry significantly improves the performance. Particularly, for the pure treelike DISEASE dataset, removing the hyperbolic projection will cause the AUC to degrade about 21.35, and for another lowhyperbolicity dataset, HepPh,the deterioration is also significant with an AUC drop of about 9. It verifies that hyperbolic geometry enables the preservation of the hierarchical layout in the graph data naturally and assists in producing highquality node representations with smaller distortion.
Lowdimensional embedding. Benefiting from the exponential capacity of the hyperbolic space, we are permitted to use embeddings with lower dimensions to achieve notable performance. Taking the large dataset FB as an example, as shown in Figure 4, HTGN equipped with an 8dimension embedding space still outperforms the runnerup, VGRNN. With 4dimension embedding, HTGN is comparable with the 16dimension VGRNN. This is another benefit that hyperbolic space can bring to the temporal graph network, i.e., reducing the embedding space and the corresponding learning parameters, which is valuable for embedding largescale temporal graphs or deploying on lowmemory/storage devices, e.g., mobile phones and UAV embedded units.
6.2. Running Time Comparison (RQ4)
In terms of efficiency, a theoretical analysis has been presented in Section 4.5.4. Here, we further numerically verify the scalability of HTGN by comparing the running time with the two secondbest methods: VGRNN and DySAT. VGRNN is a GRNNbased method similar to HTGN, while DySAT uses attention to capture both the spatial and temporal dynamics. Figure 5
depicts the runtime per epoch of the three models on FB and HepPH, using a machine with GPU NVIDIA GeForce GTX TITAN X and 8 CPU cores. FB is a largescale network with 45,435 nodes, and HepPH has fewer nodes 15,330 but more connection links.
As observed, HTGN achieves substantially lower training times, e.g., the running time per epoch of HTGN is 1.5 seconds compared to 9.9s for VGRNN and 34.2s for DySAT. The main reason is that HTGN deploys a shared HGNN before feeding into HGRU while VGRNN utilizes different GNNs^{5}^{5}5https://github.com/VGraphRNN/VGRNN/blob/master/VGRNN_prediction.py. On the other hand, DySAT requires computing both temporal attention and structural attention, which is computationally heavy. For the more dense HepPH network, the computing cost of DySAT which utilizes a pure selfattentional architecture increases dramatically, demonstrating the efficient recurrent learning paradigm in dynamic graph embedding.
7. Conclusion
In this work, we introduce a novel hyperbolic geometrybased node representation learning framework, denoted as hyperbolic temporal graph network, HTGN, for temporal network modeling. In general, HTGN follows the concise and effective GRNN framework but leverages the power of hyperbolic graph neural network and facilitates hierarchical arrangement to capture the topological dependency. More specifically, two novel modules: hyperbolic temporal contextual selfattention (HTA) and hyperbolic temporal consistency (HTC), respectively extract attentive historical states and ensuring stability and generalization, are proposed to impel the success of HTGN. When evaluated on multiple realworld temporal graphs, our approach outperforms the stateoftheart temporal graph embedding baselines by a large margin. For future work, we will generalize our method to more challenging tasks and explore continuoustime learning to incorporate the finegrained temporal variations.
Acknowledgements
This work is partially supported by the National Key Research and Development Program of China (No. 2018AAA0100204) and CUHK 3133238, Research Sustainability of Major RGC Funding Schemes (RSFS). We would like to thank the anonymous reviewers for their constructive comments.
References
 Evolutionary network analysis: a survey. ACM Computing Surveys (CSUR) 47 (1), pp. 1–36. Cited by: §2, §3.1.
 Computing medians and means in hadamard spaces. SIAM Journal on Optimization 24 (3), pp. 1542–1566. Cited by: §4.2.

Dynamics of measles epidemics: estimating scaling of transmission rates using a time series sir model
. Ecological monographs 72 (2), pp. 169–184. Cited by: §A.3.1, §5.1.1.  Metric spaces of nonpositive curvature. Vol. 319, Springer Science & Business Media. Cited by: §A.1.
 Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.

Hyperbolic graph convolutional neural networks
. In NeurIPS, pp. 4868–4879. Cited by: §A.3.2, §1, §1, §2, §4.1, §4.2, §4.5.2.  Continuoustime dynamic graph learning via neural interaction processes. In CIKM, pp. 145–154. Cited by: §1.
 Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1, §3.1, §4.4.
 A hierarchical contextual attentionbased network for sequential recommendation. Neurocomputing 358, pp. 141–149. Cited by: §1, §4.3.
 Temporal cycleconsistency learning. In CVPR, pp. 1801–1810. Cited by: §1.
 Protein interface prediction using graph convolutional networks. In NeurIPS, pp. 6530–6539. Cited by: §1.
 Les éléments aléatoires de nature quelconque dans un espace distancié. In Annales de l’institut Henri Poincaré, Vol. 10, pp. 215–310. Cited by: §4.2.
 Hyperbolic neural networks. In NeurIPS, pp. 5345–5355. Cited by: §1, §3.2, footnote 1.

Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: §A.3.2.  Learning mixedcurvature representations in product spaces. In ICLR, Cited by: §1, §2.
 Hyperbolic attention networks. In ICLR, Cited by: §2.
 Variational graph recurrent neural networks. In NeurIPS, pp. 10701–10711. Cited by: §A.3.1, §A.3.2, §2, §3.1, §4.5, §5.1.2, §5.1.3, §5.1.3.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §4.4.
 Scaled gromov hyperbolic graphs. Journal of Graph Theory 57 (2), pp. 157–180. Cited by: §5.1.1.
 Variational graph autoencoders. Bayesian Deep Learning Workshop (NIPS 2016). Cited by: §2.
 Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §3.1.
 Hyperbolic geometry of complex networks. Physical Review E 82 (3), pp. 036106. Cited by: §1, §4.5.2.
 Global contextaware attention lstm networks for 3d action recognition. In CVPR, pp. 1647–1656. Cited by: §1, §4.3.
 Hyperbolic graph neural networks. In NeurIPS, pp. 8230–8241. Cited by: §1, §2, §4.1, §4.2.
 Characterizing and forecasting user engagement with inapp action graph: a case study of snapchat. In KDD, pp. 2023–2031. Cited by: §1.
 NIST digital library of mathematical functions. Annals of Mathematics and Artificial Intelligence 38 (1), pp. 105–119. Cited by: §A.1.
 Largescale curvature of networks. Physical Review E 84 (6), pp. 066108. Cited by: §5.1.1.
 Continuoustime dynamic network embeddings. In WWW, pp. 969–976. Cited by: §1.
 Poincaré embeddings for learning hierarchical representations. In NeurIPS, pp. 6338–6347. Cited by: §1, §2, §4.5.3.
 Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In ICML, pp. 3779–3788. Cited by: §1, §2.
 EvolveGCN: evolving graph convolutional networks for dynamic graphs. In AAAI, pp. 5363–5370. Cited by: §1, §2, §5.1.2.
 Representation tradeoffs for hyperbolic embeddings. In ICML, pp. 4460–4469. Cited by: §2.
 DySAT: deep neural representation learning on dynamic graphs via selfattention networks. In WSDM, pp. 519–527. Cited by: §1, §1, §2, §5.1.2.
 Structured sequence modeling with graph convolutional recurrent networks. In ICONIP, pp. 362–373. Cited by: §2, §3.1, §5.1.2.
 Foundations and modelling of dynamic networks using dynamic graph neural networks: a survey. arXiv preprint arXiv:2005.07496. Cited by: §2, §3.1.
 Dyrep: learning representations over dynamic graphs. In ICLR, Cited by: §1.
 Learning correspondence from the cycleconsistency of time. In CVPR, pp. 2566–2576. Cited by: §1.
 FeatureNorm: l2 feature normalization for dynamic graph embedding. In 2020 IEEE International Conference on Data Mining (ICDM), Vol. , pp. 731–740. External Links: Document Cited by: §1.
 Hierarchical graph representation learning with differentiable pooling. In NeurIPS, pp. 4800–4810. Cited by: §1.
 Hyperbolic graph attention network. In AAAI, Cited by: §1, §2, §4.2.
 Tgcn: a temporal graph convolutional network for traffic prediction. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1, §3.1.
 Graph geometry interaction learning. In NeurIPS, Vol. 33. Cited by: §2.
Appendix A Appendix
a.1. Geometry Initiations of Hyperbolic Space
Riemannian manifolds with different curvatures define different geometries, where curvature measures how much a geometric object deviates from a flat plane, or in the case of a curve, deviates from a straight line. Different from the wellknown Euclidean geometry which has zero curvature, hyperbolic space is a type of manifold with constant negative curvature and thus shows distinguishing properties.
There exist multiple equivalent models for hyperbolic space. The most commonly used in the machine learning community are the Poincaré (disk) model and the Lorentz (hyperboloid) model. The Lorentz model is wellsuited for Riemannian optimization and the Poincaré disk provides a very intuitive way for visualizing and interpreting hyperbolic embeddings. We here take Poincaré disk to illustrate some intuitions of hyperbolic space.
Figure 6 gives some visualizations of the geometry properties on the Poincaré disk. As observed on the left, the geodesic length of two points in different Poincaré disks is different and related to its curvature. When the radius decreases (i.e., space bends more or the absolute value of curvature increases), the distance between two given points will increase, and the line is closer to the origin. From the right, and are two lines with the same length though, is shorter from our Euclidean view. Then when two lines are with the same “Euclidean” length, the one closer to the border is actually longer in hyperbolic space and can accommodate more. We further give some mathematical expressions to illustrate the exponentially increased capacity of hyperbolic space.
According to (Lozier, 2003), the dimensional volume of a Euclidean ball of radius is:
(18) 
While the dimensional volume of a hyperbolic space of radius in dimensional hyperbolic space, referring to (Bridson and Haefliger, 2013), is given as:
(19) 
where is the volume of the tangent space centered on the origin of the Poincaré model of radius . In the 2dimensional case, we then have the explicit expression . For all , we have
(20) 
As concluded from equations (18) and (20), the volume of a ball in hyperbolic space grows exponentially with the radius, while the counterpart in Euclidean spaces expands polynomially. Meanwhile, the nodes of a tree also grows exponentially with the depth (e.g., a perfect binary tree with depth has nodes). A hyperbolic space can thus be regarded as a continuous analogous of trees and can be applied to naturally model data with hierarchical structures or treelike layout.
a.2. Proof of Proposition 2
Proof.
First, recall the equation of loss function :
(21)  
The first term can be arranged as:
(22)  
Similarly, the second term can be arranged as:
(23)  
Then, we have:
(24)  
where . Note that we sample the same number of negative edges as there are positive ones, hence, . In our experiments, is set to . Adding up the loss of all timestamps, we have:
(25)  
Next, we center the above loss to each node. For each node, the constraint comes from two aspects: (1) temporal homophily loss, which minimizes the hyperbolic distance between the node and its positive neighbors in all timestamps, and maximizes the distance between the node and the sampled negative neighbors, where the weights are determined by ; (2) consistency constraint between the same node over two consecutive timestamps, that is:
(26)  
∎
a.3. Experiment details
a.3.1. Data processing
Most of the data is in a timestamp format and we process it according to the physical meaning of the real world. The details are as follows:
Enron^{6}^{6}6https://www.cs.cornell.edu/~arb/data/emailEnron/ is constructed from emails exchanged by Ernon employees. The nodes represent the employees and the edges indicate the email interactions between them. We follow the same processing procedure as (Hajiramezanali et al., 2019) to obtain 10 snapshots. The network does not contain any node and edge information.
COLAB^{7}^{7}7https://github.com/VGraphRNN/VGRNN/tree/master/data is an academic cooperation network, including the academic cooperation of researchers from 2000 to 2009. Each node on the graph represents an author, and an edge denotes a coauthorship relation. We split the dataset by year following (Hajiramezanali et al., 2019) and obtain 10 snapshots.
FB^{8}^{8}8http://networkrepository.com/iafacebookwallwosndir.php is a social network graph of Facebook Wall posts where each node is a user and each edge is the interaction related to their wall posts. We take the activates over the last three years in the dataset as snapshots. The FB dataset is associated with a large number of users but very sparse connections.
HepPh^{9}^{9}9https://snap.stanford.edu/data/citHepPh.html is a citation network related to high energy physics phenomenology, which is collected from the eprint arXiv website. Each node represents a paper, and an edge represents one paper citing another. The data covers papers in the period between January 1993 to April 2003 (124 months in total). It is a directed graph network, but we learn and predict as if it was an undirected graph. According to the real physical meaning, we use three months of data per snapshot and use the last months as the full dataset in our work.
AS733^{10}^{10}10https://snap.stanford.edu/data/as733.html is an Internet router network, which is collected from the University of Oregon Route Views Project. This dataset contains instances and spans the time from November 8, 1997, to January 2, 2000, with an interval of days. We split the snapshots per day and select the last snapshots to use in this work. It is worth noting that this network is different from the citation networks where the nodes and edges only increase over time (i.e., no deleted edges or nodes), the AS733 data set also contains the removal of nodes and edges over time.
DISEASE^{11}^{11}11https://github.com/HazyResearch/hgcn/tree/master/data/disease_lp is a synthetic dataset based on the SIR disease spreading model (Bjørnstad et al., 2002), which also feasible for the COVID19 path, where each node represents a person, the node feature describes the symptom of a person, and the edges indicate the spreading relationship. We split the dataset by the time they appear, and there are a large number of unobserved nodes in the test set.
a.3.2. Parameter settings
Note that most of the benchmark datasets for dynamic graph embedding are only associated with topology. Enron and COLAB are associated with a small number of nodes, we use identity matrix as the node feature which is identical with the processing in
(Hajiramezanali et al., 2019). For the other datasets, i.e., HepPh, FB, and AS733, their node features are initialized by a dense vector with 128 dimensions using glorot’s method (Glorot and Bengio, 2010) and are set as trainable matrices. The DISEASE is associate with node feature and we directly use it in our work. We set the final embeddings dimension of all models as 16, and it needs to be clear that different embedding sizes can lead to different results, but our method can always achieve impressive results.We set the number of GRU layers as 1 for all models if there is a recurrent unit (e.g., RNN, GRU, LSTM, HGRU) for a fair comparison. In HTGN, we set the number of historical windows in HTA as for DISEASE and for the other datasets. We did not do any heavy parameter tuning since our main work is to verify whether HTA is an effective using module in HTGN. The hyperparameters of and in the FermiDirac function are set as and which is a common choice as (Chami et al., 2019).