PyTorch code for "Learning Temporal Attention in Dynamic Graphs with Bilinear Interactions"
Graphs evolving over time are a natural way to represent data in many domains, such as social networks, bioinformatics, physics and finance. Machine learning methods for graphs, which leverage such data for various prediction tasks, have seen a recent surge of interest and capability. In practice, ground truth edges between nodes in these graphs can be unknown or suboptimal, which hurts the quality of features propagated through the network. Building on recent progress in modeling temporal graphs and learning latent graphs, we extend two methods, Dynamic Representation (DyRep) and Neural Relational Inference (NRI), for the task of dynamic link prediction. We explore the effect of learning temporal attention edges using NRI without requiring the ground truth graph. In experiments on the Social Evolution dataset, we show semantic interpretability of learned attention, often outperforming the baseline DyRep model that uses a ground truth graph to compute attention. In addition, we consider functions acting on pairs of nodes, which are used to predict link or edge representations. We demonstrate that in all cases, our bilinear transformation is superior to feature concatenation, typically employed in prior work. Source code is available at https://github.com/uoguelph-mlrg/LDG.READ FULL TEXT VIEW PDF
Learning node representations on temporal graphs is a fundamental step t...
Graphs are a common model for complex relational data such as social net...
For many years, link prediction on knowledge graphs (KGs) has been a pur...
Several structure learning algorithms have been proposed towards discove...
Cycle-consistent training is widely used for jointly learning a forward ...
In a dynamic network, the neighborhood of the vertices evolve across
Learning how to predict the brain connectome (i.e. graph) development an...
PyTorch code for "Learning Temporal Attention in Dynamic Graphs with Bilinear Interactions"
, between them. For example, a social network graph may consist of a set of people (nodes), and the edges may indicate whether two people are friends. Recently, graph neural networks (GNNs) (e.g.,[5, 6, 7, 8, 9]) have emerged as a key modeling technique for learning representations of such data. These models use recursive neighborhood aggregation to learn latent features, , of nodes for some layer, , given node features, , of the previous layer, :
is an element-wise nonlinearity, such as ReLU. Different extensions of this method have demonstrated considerable success at tasks like graph/node classification and link prediction[11, 12, 13]. The local aggregation operator in (1) was derived from spectral graph convolution 
and motivated by the success of convolutional neural networks (CNNs) in dominating vision and audio tasks, such as image and speech recognition. However, CNNs are limited to Euclidean space, where translation is well-defined, while GNNs are more flexible and can also be applied to non-Euclidean data, such as graphs and sets.
The focus of GNNs thus far has been on static graphs— graphs with a fixed structure. However, a key component of network analysis is often to predict the state of an evolving graph at time . For example, knowledge of the evolution of person-to-person interactions during an epidemic facilitates analysis of how a disease spreads , and can be expressed in terms of the links between people in a dynamic graph. Other applications include predicting whether two people will become friends at time , predicting locations of players (nodes) or some interaction between them in team sports, such as basketball or soccer [15, 16], and others.
Previous approaches for representation learning over dynamic graphs, such as DyRep , have assumed the entire dynamic graph structure is known (i.e., no edges are missing and there are no redundant connections). Methods for learning latent graphs, such as Neural Relational Inference (NRI) , have focused primarily on the fixed graph setting, which does not support addition or deletion of nodes or the complex multimodal interactions between them.
Our approach simultaneously infers graph structure (as in NRI) while learning the dynamics of the graph via a GNN (as in DyRep). We explore the use of a learned representation of the underlying graph in lieu of the ground truth, inspired by NRI. We use a learned temporal attention matrix, which can also be interpreted as a graph, to drive graph dynamics. In this temporal attention matrix, we use a bilinear relationship instead of DyRep’s concatenation to permit more complex relationships between node representations. We then apply this method to the task of dynamic link prediction on the Social Evolution dataset .
Prior work [20, 21, 22, 23, 24, 16, 18] addressing the problem of learning from dynamic graphs has tended to develop methods that are very specific to the task at hand, with only a few shared ideas. This is primarily due to the difficulty of learning from temporal data in general and temporal graph-structured data in particular, which remains an open problem  that we address in this work.
Given an evolving graph , where is a discrete index at a continuous time point
, most prior work uses some variant of a recurrent neural network (RNN) to update node embeddings over time[25, 22, 24, 18, 16]. The weights of RNNs are typically shared across nodes [25, 22, 24, 18]; however, in the case of smaller graphs, a separate RNN per-node can be learned to better tune the model . RNNs are then combined with some function that aggregates features of nodes. While this can be done without explicitly using GNNs (e.g., using the sum or average of all nodes instead), GNNs impose an inductive relational bias specific to the domain, which typically improves performance [22, 18, 24, 16].
Closely related to our work, there are a few applications where the graph is considered to be either unknown or suboptimal, so it is inferred simultaneously with learning the model [22, 26, 18]. Among them, [22, 26] focus on visual data, while NRI  proposes a more general framework and, therefore, is adopted in this work.
supports two time scales of graph evolution (i.e. long-term connections to allow addition and removal of nodes and edges, and short-term connections to inform future changes to the graph);
operates in continuous time;
scales well to large graphs; and
is data-driven due to employing a GNN similar to (1).
These key advantages make DyRep favorable, compared to other methods discussed previously. In this work, we improve on DyRep by adding bilinear interactions, and we study the ability to learn a temporal attention matrix to inform connections using NRI  simultaneously with the DyRep model.
Here we describe relevant details of the DyRep model. A complete description of DyRep can be found in . DyRep is a representation framework for dynamic graphs, which assumes that graphs evolve according to two elementary processes:
Long-term association, in which nodes and edges are added or removed from the graph affecting the evolving adjacency matrix .
Communication, in which nodes communicate over a short time period, whether or not there is an edge between them.
For example, in a social network, association may be represented as one person adding another as a friend. A communication event may be an SMS message between two friends (an association edge exists) or an SMS message between people who are not friends yet (an association edge does not exist). These two processes interact to fully describe information transfer between nodes.
More formally, an event is a tuple of type , with corresponding to association and communication, between nodes and at continuous time point with time index . This interaction can be expressed as three recursively executed functions:
where are the current -dimensional embeddings of nodes and ; are node embeddings at previous time ; and are based on neural networks; is a function updating temporal attention over edges in the graph, which is implemented as a hard-coded algorithm in DyRep ; are the one-hop neighbors of the other node participating in the event at time (see details below following (5)); is the adjacency matrix at time ; is the conditional intensity of events between nodes and ; and is the concatenation operator.
This formulation is similar to recurrent networks with relational inductive bias, e.g., [28, 24, 16], but here, association and communication events are modelled in continuous time through a two time-scale deep temporal point process. The conditional intensity function of the point process, (2), which models the frequency at which events occur between nodes, acts in conjunction with a deep network that computes node embeddings (3a), (3b). Together with temporal attention (4), these components create a learned representation of a dynamic graph that can be used for tasks like dynamic link prediction. The relationship between learned node representations in turn drives graph dynamics through the conditional intensity function.
To better understand DyRep, it is important to expand (3a) and (3b), which define the evolution of node embeddings. Embeddings and are updated in the same way, so below we only expand it for (3a). In particular, for an event between nodes and , the embeddings of both nodes are updated based on the summation of the following three terms, followed by a nonlinearity
The first term of Equation (5) is the “Localized Embedding Propagation”, which is comprised of learned parameters
multiplied by a learned hidden representation. This representation,, is a function of temporal attention between node and all its one-hop neighbors . Using features of node ’s neighbors to update node ’s embeddings is important for creating a temporal edge by which node features propagate between the two nodes. The idea is that the more frequently the nodes communicate, the more similar their embeddings will be, with a weak dependency on whether or not there is a long-term edge between them, induced by attention. The second term is the “Self-Propagation”, which is comprised of learned parameters multiplied by the previous computed embedding of node , at the time index node last participated in an association or communication event . This term performs a recurrent update of the features of node . The third term is the “Exogenous Drive”, which is comprised of the learned parameters multiplied by the waiting time between the current event and the previous event involving node (), which captures other forces acting on the graph that may influence the embedding of node , such as global events involving many nodes.
Additionally, the component from Equation (4) merits further discussion, as it is a hard-coded attention layer that is computed pairwise. In particular, the temporal attention matrix is only updated if a communication event occurs (), and an association exists between the two nodes under consideration (), otherwise it is left unchanged. This function is computed as a softmax over the attention given by node to its one-hop neighbors at time . In this method, attention is only given to the node from the neighborhood of that contributes the most information. No learned parameters directly contribute to the computation of the temporal attention matrix, which limits information propagation.
In this paper, we extend the DyRep algorithm in two ways. First, we examine the benefits of a learned representation of the underlying graph in (4
) by using a variational autoencoder, in the style of. This permits learning of a sparse representation of the interactions between nodes instead of using a hard-coded function and known adjacency matrix (§ 4.1). Second, both the original DyRep work  and  use concatenation to make predictions for a pair of nodes (see (2) above and (6) in ), which only captures relatively simple relationships. We are interested in the effect of allowing a more expressive relationship to drive graph dynamics, and specifically to drive temporal attention (§§ 4.1 and 4.2). We demonstrate the utility of our model by applying it to the task of dynamic link prediction (§ 5).
Recently,  proposed Neural Relational Inference (NRI), showing that in some settings, models that use a learned representation, in which the original graph structure is discarded, can outperform models that use the ground truth graph. A learned sparse graph representation would keep the most salient features, i.e., only those connections that are necessary for the downstream task, whereas the underlying graph may have redundant connections. For example, in a human motion capture dataset, such as that explored by 
, the human body has connections from the hip to the knee, knee to ankle, etc. To predict how a person walks, the connection between, say, the foot and a specific toe, may be unnecessary. In other applications, the underlying graph might be unknown, so by learning it, we can reveal structural interactions between nodes, which can improve a fully-connected graph or other heuristic.
While NRI learns latent graph structure from observing node movement, we learn the graph by observing how nodes communicate. In this spirit, we repurpose the encoder of NRI, combining it with DyRep, which gives rise to our latent dynamic graph (LDG) model, described below in more detail (Fig. 1). We also summarize our notation in Table 1 at the end of this section.
DyRep’s encoder (4) requires a graph structure represented as an adjacency matrix . We propose to replace this with a sequence of learnable functions , borrowed from , that only require node embeddings as input:
Given an event between nodes and , our encoder takes the embedding of each node at the previous time step as an input, and returns an edge embedding between nodes and using two passes of node and edge mappings, denoted by superscripts and (Fig. 2):
where are two-layer, fully-connected neural networks, as in ; and are trainable parameters implementing bilinear layers. In detail:
(7): transforms embeddings of all nodes in a graph;
(8): is a “node to edge” mapping that returns an edge embedding for all pairs of nodes ;
(9): is an “edge to node” mapping that updates the embedding of node based on all edges connected to it;
(10): is similar to the “node to edge” mapping in the first pass, , but only the edge embedding between nodes and involved in the event is used.
The softmax function is applied to the edge embedding , which yields the edge type posterior as in NRI :
where are temporal one-hot attention weights sampled from the multirelational conditional multinomial distribution , hereafter denoted as for brevity; is the number of edge types (note that in DyRep ); and are parameters of the neural networks in (7)-(10). is then used to update node embeddings at the next time step, according to (3a) and (3b).
Replacing (4) with (6) means that it is not necessary to maintain an explicit representation of the ground-truth graph in the form of an adjacency matrix. The evolving graph structure is implicitly captured by . While represents temporal attention weights between nodes, it can be thought of as a graph evolving over time, which we call a Latent Dynamic Graph (LDG). This graph, as we show in our experiments, can have a particular semantic interpretation.
Bilinear layers have proven to be advantageous in settings like Visual Question Answering (e.g., ), where multi-modal embeddings interact. In our case, they permit a richer interaction between embeddings of different nodes, so in (8) and (10), we replace ’s linear layers by computing a bilinear interaction, rather than concatenating features.
|Point in continuous time|
|Time index of the previous event involving node|
|Time point of the previous event involving node|
|Index of an arbitrary node in the graph|
|Index of a node involved in the event|
|Adjacency matrix at time|
|One-hop neighbourhood of node v|
|Node embeddings at time|
|Embedding of node at time|
|Learned hidden representation of node after the first pass|
|Learned hidden representation of an edge between nodes and after the first pass|
|Learned hidden representation of node after the second pass|
|Learned hidden representation of an edge between nodes and involved in the event after the second pass (Fig. 2)|
|Attention at time with multirelational edges|
|Attention between node and its one-hop neighbors at time|
|Attention between nodes and at time for all edge types
|Conditional intensity of edges of type at time between nodes and|
|Trainable rate at which edges of type occur|
|Trainable compatibility of nodes and at time|
|Trainable interaction matrix between nodes and at time|
The two passes in (7)-(10) are important to ensure that attention values depend not only on the embeddings of nodes and , and , but also on how they interact with other nodes in the entire graph. With one pass, the values of would be predicted based only on local information, as only the previous node embeddings influence the new edge embeddings in the first pass (8).
Unlike DyRep, our temporal attention module has multiple edges between nodes, i.e., are one-hot vectors of length . We therefore modify the “Localized Embedding Propagation” term in (5), such that features are computed for each edge type and parameters act on concatenated features from all edge types, i.e., .
The conditional intensity function represents the instantaneous rate at which an event of type (i.e., association or communication) occurs between nodes and in the infinitesimally small interval . DyRep formulates the conditional intensity as a softplus function of the concatenated learned node representations :
where is the scalar trainable rate at which events of type occur, and is a trainable vector that represents the compatibility between nodes and at time . We replace concatenation in (12) with bilinear interaction:
where are trainable parameters, to allow more complex interactions between evolving node embeddings. We use this interaction to inform the sampling step illustrated in Fig. 2, and in the likelihood during training.
Given a minibatch with a sequence of events, we optimize the model by minimizing the following cost function:
where is the total negative log of the intensity rate for all events between nodes and (i.e., all nodes that experience events in the minibatch); and is the total intensity rate of all nonevents between nodes and in the minibatch. Since the sum in the second term is combinatorially intractable in many applications, we sample a subset of nonevents according to the Monte Carlo method as in , where we follow their approach and set .
The first two terms, and , were proposed in DyRep  and we use them to train our baseline models. The KL divergence term, adopted from NRI  to train our LDG models, regularizes the model to align predicted and prior distributions of attention over edges. Here, can, for example, be defined as in case of edge types.
Following , we consider uniform and sparse priors. In the uniform case, , so the KL term becomes the sum of entropies over events and over generated edges excluding self-loops ():
where entropy is defined as a sum over edge types : and denotes distribution .
In case of sparse and, for instance, edge types, we set , meaning that we generate
edges, but do not use the non-edge type corresponding to high probability, and leave onlysparse edges. In this case, the KL term becomes:
where . During training, we update after each
-th event and backpropagate through the entire sequence in a minibatch. To backpropagate through the process of sampling discrete edge values, we use the Gumbel reparametrization, as in . Training behaviour is illustrated in Fig. 3.
A communication event () is represented by the sending of an SMS message, or a Proximity or Call event from node to node ; an association event () is a CloseFriend record between the nodes. We also experiment with other associative connections (Fig. 4). As Proximity events are noisy, we filter them by the probability that the event occurred, which is available in the dataset annotations. The filtered dataset on which we report results includes 83 nodes with 43,517 training and 10,462 test communication events. We evaluate models only on communication events, since the number of association events is small, but we use both for training. As in , we use events from September 2008 to April 2009 for training, and from May to June 2009 for testing. Associative connections corresponding to the beginning and end of training events are illustrated in Fig. 4.
At test time, given tuple , we compute the conditional density of with all other nodes and rank them . We report Mean Average Ranking (MAR) and HIST@10: the proportion of times that a test tuple appears in the top 10.
We train models with the Adam optimizer , with a learning rate of , minibatch size events, and hidden units per layer, including those in the encoder (7)-(10). We consider two priors, , to train the encoder: uniform and sparse with edge types in each case. For the sparse case, we generate edges, but do not use the non-edge type corresponding to high probability and leave only sparse edges.
We run each experiment 10 times and report the average and standard deviation of MAR and HIST@10 in Table2
. We train for 5 epochs with early stopping. To run experiments with random graphs, we generateonce in the beginning and keep it fixed during training. For the models with learned temporal attention, we use random attention values for initialization, which is then updated during training (Fig. 4).
We report results of the baseline DyRep with CloseFriend and FacebookAllTaggedPhotos as an underlying graph (Table 2) and compare them to the models with learned temporal attention (latent dynamic graph, LDG). Models with learned attention perform better than DyRep’s hard-coded attention based on FacebookAllTaggedPhotos and some other ground truth graphs (see Fig. 5 for more comparisons), further confirming the finding from  that the underlying graph can be suboptimal. However, these models are still worse compared to CloseFriend. We also show that bilinear models consistently improve results and exhibit better training behaviour, including when compared to a larger linear model with an equivalent number of parameters (Fig. 3).
While the models with a uniform prior have better test performance than those with a sparse prior in some cases, sparse attention is typically more interpretable. This is because the model is forced to infer only a few edges, which must be strong since that subset defines how node features propagate (Table 3
). In addition, relationships between people in the dataset tend to be sparse. To estimate agreement of our learned temporal attention matrix with the underlying association connections, we take the matrixgenerated after the last event in the training set and compute the area under the ROC curve (AUC) between and each of the associative connections present in the dataset. These associations evolve over time, so we consider associations corresponding to the beginning (September 2008) and end (April 2009) of training events.
|DyRep (CloseFriend)||15.96 2.97||11.00 1.23||0.47 0.05||0.59 0.06|
|DyRep (FacebookAllTaggedPhotos)||20.67 5.75||15.01 1.95||0.29 0.21||0.38 0.14|
|LDG (learned, uniform)||16.90 2.77||14.90 2.45||0.30 0.14||0.42 0.22|
|LDG (learned, sparse)||18.78 3.25||14.33 1.99||0.30 0.13||0.48 0.12|
|LDG (random, uniform)||19.48 4.86||16.02 3.31||0.28 0.19||0.35 0.17|
|LDG (random, sparse)||21.22 4.38||17.12 2.57||0.26 0.10||0.37 0.08|
|Associative Connection||September 2008||April 2009|
AUC is used as opposed to other metrics, such as accuracy, to take into account sparsity of true positive edges, as accuracy would give overoptimistic estimates. We observe that LDG learns a graph that is most similar to CloseFriend, with AUC of 84%. This is an interesting phenomenon, given that we only observe how nodes communicate through many events between non-friends. Thus, the learned temporal attention matrix is capturing information related to the associative connections.
Node embeddings of bilinear models tend to form more distinct clusters, with frequently communicating nodes generally residing closer to each other after training (Fig. 6). We notice bilinear models tend to group nodes in more distinct clusters. Also, using the random edges approach clusters nodes well and embeds frequently communicating nodes close together, because the embeddings are mainly affected by the dynamics of communication events.
The application of GNNs to datasets with dynamic events and evolving graphs, as we have done here, is relatively new and demanding. To understand the nature of the Social Evolution Dataset, we report results on the test set, which were obtained simply by computing statistics from the training set (Fig. 5), which led to quite strong performance, e.g., MAR=23 by exploiting FacebookAllTaggedPhotos. Interestingly, based on the MAR results (Fig. 5, left), FacebookAllTaggedPhotos connections are more correlated with communication events than CloseFriend in the case of “no learn”. This can mean that friends are strong, longer term relationships that do not necessary involve frequent communication events. Based on the HITS@10 results (Fig. 5, right), our model performs better than or comparably to all associations, except for CloseFriend.
In the experiments based on dataset statistics discussed above, to predict a link for node at time , we randomly sample node from those associated with . In the Random case, we sample node from all nodes, except for . Moreover, we measured the frequency of events in the training set between each pair of nodes and used those values to rank nodes in the test set (Fig. 7). Surprisingly, this achieves almost perfect results, where MAR=0.30 and HITS@10=0.99 (or MAR=5.57 and HITS@10=0.83 for the full unfiltered dataset). These results imply that current models are unable to capture relatively simple regularities in dynamic data, although admittedly, we do not directly feed these statistics to the model. It also means we need datasets with more complicated dynamics inherent to many applications, where such simple regularities would have less predictive power.
We propose an extension of DyRep and NRI for dynamic link prediction. In addition to the advantage, in some cases, of learned temporal attention over use of the ground truth graph, we showed that bilinear layers can capture essential graph dynamics that concatenation alone cannot. Finally, we showed that a simple statistical analysis can outperform more complex models based on node embeddings. More diverse data exhibiting richer dynamics would allow for more meaningful dynamic graph analysis.
This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The authors also acknowledge support from the Canadian Institute for Advanced Research and the Canada Foundation for Innovation. The authors also thank Elahe Ghalebi and Brittany Reiche for their helpful comments.
Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(1):40–53, 2017.
Proceedings of the IEEE International Conference on Computer Vision, pages 1801–1810, 2017.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5308–5317, 2016.
The concrete distribution: A continuous relaxation of discrete random variables.In ICLR, 2017.