I Introduction
Discrete event sequences are ubiquitous around us  hospital visits, tweets, financial transactions, earthquakes, human activity. These sequences are primarily characterized by the event timestamps however, they can also contain markers and other additional information like type and location. Often these sparse observations have complex hidden dynamics responsible for them and when observed thorough the lens of Point Processes, these sequences can be interpreted as realization of an underlying stochastic process [Cox2018]. This underlying point process can be a simple stochastic process like a Poisson process or could be doubly stochastic and have dependencies on both time and history like the Hawkes process [hawkes].
Traditionally these processes have been described using simplified historical dependencies. The Hawkes process assumes additive mutual excitation however, these underlying assumptions are not valid for practical real world data. Recent works seek to leverage the advances in deep learning to relax these restrictions by defining a rich, flexible family of models to incorporate historical information
[rmtpp, nhp, thp, sahp, Shchur2020IntensityFree]. Recurrent Neural Network (RNN)
[rmtpp, nhp, Shchur2020IntensityFree] and Transformer based [thp, sahp] architectures have been explored to capture historical dependencies to better predict the time and type of future events compared to a prespeciﬁed Hawkes Process. However, shallow RNNs only capture simple state transitions and multilayer RNNs are diﬃcult to train. On the other hand, Transformer based approaches are highly expressive and interpretatable but the multiheaded selfattention mechanism consumes exorbitant amounts of memory for large sequences.Our novel approach simultaneously solves the problems faced by the above approaches. An entire discrete event sequence with marks can be interpreted as a superposition of multiple sequences each only containing events of a single unique marker. We process the evolution of each of these unique event type sequences with an LSTM [lstm]. These different LSTM hidden states are then used to generate a dynamic relational graph and model complex dependencies between event types using a Graph Attention Network [gat] as shown in figure 1. A key innovation of our architecture lies in attending over event types instead of all past events. This leads to reduction in space (memory) and time (number of computations) complexity as instead of calculating attention over all past inputs  which can get prohibitively expensive as the sequences get longer , we attend over event types to obtain a timevarying attention matrix with constant time and space complexity in the order of , independent of the sequence length. Additionally, this timevarying event attention matrix helps with interpretability as it encodes the evolution of the underlying relational graph between event types as new events are observed in the data stream. To the best of our knowledge, this is the first time a Recurrent Graph Network using expressive attention mechanism to encode a dynamic relational graph is used for learning Point Processes. The advantages of the proposed method over previous approaches is summarized in table I.
The major contributions of this paper are summarised as follows:

We present a Recurrent Graph Network (RGN) approach to learn point processes that uses strong inductive biases to better encode historical information to model the conditional intensity function.

We show that the proposed model performs better in loglikelihood and predictive tasks for multiple datasets.

We show that this formulation allows for interpretability of the underlying temporally evolving Relational Graph between event types.

We show that RGN has lower computational complexity and activation footprint than stateofthe art Transformer based approaches leading to computation and memory savings.
We perform experiments on standard point process datasets like  Retweets [Zhao_2015], Financial Transactions [rmtpp], StackOverflow [snapnets] and MIMICII [Johnson2016]. In addition to these, we also introduce a new dataset for benchmarking point process models extracted from StarCraft II game replays. The underlying complex stochastic process that generates this data is the high level strategy that the players adopt over the course of the game. The Starcraft II dataset can be found here^{1}^{1}1https://figshare.com/s/c028296e953788b25599 and the supplementary materials can be found here^{2}^{2}2https://figshare.com/s/7a10a6aa2c0752f2cca3.
Our experiments show that RGN improves over stateofthe art Transformer Hawkes Process (THP) on loglikelihood by upto , event type prediction accuracy by upto and event time error by upto .
Model 




RMTPP [rmtpp]  ✗  ✓  ✗  
NHP [nhp]  ✗  ✓  ✗  
THP [thp]  ✓  ✗  ✓  
This work  ✓  ✓  ✓ 
Ii Related Work
Du et al. [rmtpp] propose using an RNN (RMTPP) to encode the history to a hidden state that is used to define the conditional intensity . Mei & Eisner [nhp] proposed a novel RNN architecture  Continuoustime LSTM (CLSTM) which modifies the LSTM cell to incorporate a decay when an event is not observed to a steady state . Since CLSTM work in continuous time, there is no need to explicitly feed the timestamps of the events like RMTPP. However both models only contain shallow RNNs which only capture simple state transitions [279181] as multilayer RNNs are difficult to train [pmlrv28pascanu13] due to vanishing / exploding gradients. To alleviate these concerns Zuo et al. [thp] and Zhang et al. [sahp] make use of multiheaded self attention [mhsa] based Transformer architectures to better distill out historical influence. This allows using multilayer architectures to model complex interactions compared to RNN based architectures, however the
cost of self attention computation leads to prohibitively large memory footprint for longer sequences. Even if a single sequence in a minibatch is long, the other sequences have to be zero padded to match the longest length for batch processing, leading to a lot of wasted memory and redundant computation. In contrast, Schur et al.
[Shchur2020IntensityFree] draw upon advances in Normalizing Flows [nf] and propose a mixture distribution to define a model using the conditional density function instead of the conditional intensity function. This allows for the conditional density function to be multimodal allowing for much more complex distributions; however, the historical influence is again encoded with a simple RNN which leads to degraded performance due to lack to rich historical information.Chang et al. [10.1145/3340531.3411946]
propose a dynamic message passing neural network named TDIGMPNN which aims to learn continuoustime dynamic embeddings from sequences of interaction data. Although our proposed network also benefits from learning dynamic graph embeddings, the end goal of this work is to learn the underlying conditional intensity function that is responsible for observed data and not the timevarying interaction graph between different entities. Moreover, the datasets under consideration here do not contain any interaction data. The work closest to the proposed approach is ARNPPGAT
[closest]. ARNPPGAT divides the representations of users (marks) into two categories: longterm representation which is modelled using Graph Attention Layer (GAT Layer) and shortterm representation which is modelled using a Gated Recurrent Unit (GRU). Although we use similar blocks, the architecture of the model is drastically different. ARNPPGAT leverages GAT layer to model a timeindependent interaction between users (marks) while the historical context is modelled by a GRU. These embeddings are then combined for predicting the time and location of next event. On the other hand, RGN uses GAT layers at every timestep to model a dynamic relational graph between different event types. Moreover, we also use an additional loglikelihood loss term which encourages the model to learn the underlying stochastic process better.
Iii Background
Iiia Point Processes
A temporal point process is a stochastic process composed of a time series of events that occur instantaneously in continuous time [Cox2018, daley]. The realization of this point process is a set of strictly increasing arrival times . We are interested in a Marked Point Process where each event also has an associated marker
. In real world data, event probabilities can be affected by past events, thus the probability of an event occurring in an infinitesimally small window
is called the conditional intensity . denotes the history . The symbol represents the dependence on history. Given the conditional intensity function, the conditional density function can be obtained as shown in Eq. 1 below [rasmussen2018lecture].(1) 
IiiB Recurrent Networks
Recurrent Networks are autoregressive models that can incorporate past information along with the present input for prediction. RNNs and its modern variants LSTM
[lstm] and GRU [chung2014empirical] have shown remarkable success in sequence modelling tasks like  language modelling [seq2seq, DBLP:journals/corr/ChoMGBSB14], video processing [videolstm] and dynamical systems [saha2020physicsincorporated]. At each time step, an input is fed into the model with the past internal state information . The internal state is updated and used to predict the output .IiiC Graph Networks
Graph Networks (GN) [battaglia2018relational] describe a class of functions for relational reasoning over graph structured data. These take in a graph structured data denoted by the 3tuple . denotes the global attributes, denotes the set of node attributes and denotes the set of edges of a graph where is the edge attribute, is the receiver node and is the sender node.
A full GN block enables flexible representations in terms of representation of attributes and in terms of structure of the graph and can be used to describe a wide variety of architectures [gat, mhsa]. The main unit of computation is the GN block which performs graphtograph operations. The GN block takes a graph structure as an input, performs operations over the graph and outputs a graph.
A full GN block contains three update functions and three aggregate functions [battaglia2018relational]:
(2)  
(3)  
(4)  
(5)  
(6)  
(7) 
where, is the set of all updated edge attributes that have node as the receiver, is the set of updated node attributes and is the set of updated edge attributes. The update functions are shared across the nodes and edges allowing for a reduction in parameter count and encouraging a form of combinatorial generalization [battaglia2018relational]
. These can be any arbitrary functions or more generally parameterized using a Neural Network architecture like  Multi Layer Perceptron (MLP) or incorporate past information using a Recurrent Neural Network (RNN). On the other hand,
are permutation invariant aggregation functions which take a set of inputs and reduce them to a single aggregate element, for example or .Iv Proposed Architecture
Iva Recurrent Graph Network for Learning Point Processes
An entire discrete event sequence with marks can be interpreted as a superposition of multiple sequences each only containing events of a single unique marker. Using the input sequence, we build a fully connected graph with nodes, each containing a latent embedding for a specific type marker which allows us to understand how different event types influence future events. As we are building a relational graph from nongraph structured inputs, the relational information in the graph is not explicit and needs to be inferred. This relational graph has node attributes and edge weights evolving with time as new events are observed. Thus in a sequence , with events, we can interpret the relational graph undergoing transitions which can be processed naturally using a Recurrent Graph Network. We are interested in predicting the conditional intensity which is a global property, which makes the Recurrent Graph Network a graph focused Graph Network [battaglia2018relational].
At each event in an event sequence , the inputs to the model are the previous node attributes , and the input embedding . Once a new event is observed, there are three types of updates that take place:
IvA1 InputtoGraph Update
Each uniquemarked sequence is assigned a designated graph node for processing. When an event of a particular type is observed, we want to first update the node attribute corresponding to this marker. This is performed using an LSTM [lstm]. The LSTM corresponding to node updates its node attribute using the previous node attribute and using the following equations, while the rest of the node attributes remain unchanged in this update.
(8)  
(9)  
(10)  
(11)  
(12)  
(13) 
IvA2 GraphtoGraph Update
Once the node attribute corresponding of the observed event type node has been updated, information needs to be propagated to other nodes to update the node attributes. For this purpose, we propose using Graph Attention Network (GAT) [gat] as the attention mechanism can be used to assign edge weights not explicitly present in the relational graph using the node attributes. In this section, we express the Graph Attention Layer [gat] in terms of Graph Network [battaglia2018relational] operations. For notational simplicity, we drop the temporal dependence of node attributes at current time and represent them with .
Edge Update: The edge update function only uses the node attributes of the sender node and receiver node and outputs a scalar
, vector
tuple. NN represents a multilayer perceptron (MLP).(14)  
(15)  
(16) 
Edge Aggregation: and are then used by the aggregation function to aggregate the edges which have node as the receiver into . The scalar terms are normalized to obtain the attention scores, which are used as the weights for weighted elementwise summation of .
(17) 
Multiheaded Attention: The edge update and aggregation steps can be performed independently times simultaneously on the same input. We observed this improves the performance of the model similar to the results observed by Veličković et al. [gat] as it allows the network to jointly attend to input projections in various representation subspaces [mhsa]. All the different edge updates corresponding to different attention heads are concatenated into one single vector .
Node Update: Node attributes are updated by passing the multiheaded edge aggregation through an MLP .
(18) 
IvA3 Global Update
Once the node attributes are updated, we concatenate all the node attributes and pass it through an MLP to obtain a global attribute which is used to predict  conditional intensity for all event types , type of next event and time of the next event . This global attribute represents the history upto time . We add in subscript to convey the temporal dependence of this global attribute.
(19)  
(20)  
(21) 
We define our model in terms of conditional intensity due to its simplicity. As it is not a probability density, the only restriction is that it has to be nonnegative. Moreover, it need not sum up to unlike the conditional density. The conditional intensity of a Hawkes process is defined at every point , however, equation 19
only outputs the hidden representations at the timestamps in the observed sequence. To interpolate the conditional intensity
to the time when an event does not occur, we use the expression proposed by Zuo et al. [thp].(22)  
(23) 
The softplus function guarantees that the are nonnegative. The first term of equation 23 represents the contribution of the current event time towards the future and
is a hyperparameter. The second term is the history term that encodes how the history
of observed events influence the conditional intensity of a certain event type. The last term incorporates the base intensity of the point process in the absence of events.Updated node attributes now become the previous node attributes when the next event is processed by the model. The entire information flow is illustrated in Fig. 1.
IvB Input Embedding
Although RNN allows for a natural ordering in the processing of inputs, as the arrival of the events is not uniform, temporal information needs to be explicitly fed to the model. Directly feeding the time causes issues as The input value increases unbounded or the model may not see any sequence with one specific length which would hurt generalization. To overcome this issue, we follow the positional embedding proposed by Vaswani et al. [mhsa]. The trigonometric functions ensure that the input embedding is bounded and deterministic which enable generalization to longer sequences of unseen lengths.
IvC Learning Objectives
LogLikelihood:
We use maximum likelihood estimation (MLE) to learn the parameters of the model. For a sequence
, the loglikelihood of observing the sequence is given by:(24) 
The first term in the loglikelihood expression Eq. 24, is the loglikelihood of an event occurring at times , we would like this term to be as large as possible. Whereas, the second term signifies the loglikelihood of no events occurring in the times other than . We would like this term to be as small as possible.
As the sequences
in the dataset are assumed to be i.i.d, the loss function to be minimized is the negative of the sum of the loglikelihood over all sequences
.The integral in Eq. 24 does not have a closed form solution and needs to be numerically approximated. We use the Monte Carlo estimate [mc] given in the supplementary material.
Event Prediction: We are also interested in predicting the type of the next event, we additionally impose a crossentropy loss term that penalizes when the model mispredicts the type of the next event. For an event , let
be the onehot encoding of the event type
.Hence the next event prediction loss is given by: .Time Prediction: Apart from event prediction, we also want the predicted time of the next event to be close to the ground truth. Time of the next event is a continuous value that needs to be estimated thus we use an penalty to reduce the Mean Squared Error (MSE). The time prediction loss is given by: .
V Experiments
Va Datasets
Retweets(RT) [Zhao_2015]: This dataset contains sequences of tweets where each sample has a sequence of tweets of different users.
StackOverflow(SO) [snapnets]: The StackOverflow dataset contains sequences of awards that users were awarded for answering questions on the StackOverflow website. The markers for the events are the various different awards.
MIMICII [Johnson2016]: MIMICII dataset contains the visitation of various patients to a Hospital’s ICU. Each sample represents a patient and the events markers are the diagnosis.
Financial Transactions [rmtpp]: This dataset contains the transaction records for multiple stocks on a single day. The different events are buy and sell orders at various timestamps.
StarCraft II(SCII): We introduce a new dataset for benchmarking various discrete event models. Each sequence is a “buildorder” which represents a temporally ordered list of various buildings built by the players over the course of a game. The strategy of the players is the underlying complex stochastic process that generates this data. Each sequence in the dataset is an entire Protoss vs Protoss game. Additional details are presented in the supplementary material.
VB Setup
Here we describe the architectural choices we make for various update function. In , is simply a linear projection , is a single fully connected (FC) layer with leakyReLU [gat] nonlinearity. is a linear projection from to .
uses a single FC layer with ReLU nonlinearity whereas
, and are simple linear projections of size , and respectively.We are interested in the thee evaluation metrics  Loglikelihood, event class prediction accuracy and event time prediction error. We also evaluate these metrics for  Transformer Hawkes Process (THP)
[thp], Recurrent Marked Temporal Point Process (RMTPP) [rmtpp] and Neural Hawkes Process (NHP)[nhp] for comparison. For a fair comparison, we optimize these models using the same objective as ours and for each evaluation metric we pick the model parameters which lead to the best performance on the validation set. Additional training details are included in the supplementary material.VC Likelihood
Model  RT  SO  MIMICII  Financial  SCII 
RMTPP  8.8  0.73  0.39  1.71  1.09 
NHP  8.24  2.41  1.38  2.60  0.81 
THP  7.80  0.73  0.86  1.53  1.04 
This work  6.76  0.56  1.00  1.26  1.50 
In this section, we use loglikelihood as an evaluation metric as done in previous works [Shchur2020IntensityFree, sahp, thp, nhp, rmtpp]. A higher loglikelihood score would imply that the model can approximate the conditional intensity function well. Table II shows the loglikelihood results. We can see that RGN beats THP in loglikelihood across all the datasets.
VD Event Type Prediction
Model  RT  SO  Financial  MIMICII  SCII 
RMTPP  49.89  43.49  60.29  59.91  44.25 
NHP  48.82  42.81  61.20  76.21  41.14 
THP  50.43  43.23  61.67  80.14  43.33 
This work  54.42  45.46  61.69  77.34  47.31 
We are also interested in the predictive performance of the model to predict the type of next event. To improve the performance of the model on this evaluation metric, we added an extra event type prediction loss in addition to the standard loglikelihood objective. The results are presented in table III. Unlike the loglikelihood results, we see that although RGN has better predictive performance than THP on most of the datasets, it performs worse on the MIMICII dataset.
VE Event Time Prediction
To improve the performance of the model on this evaluation metric, we added an extra event time prediction loss in addition to the standard loglikelihood objective. Table IV shows the Root MeanSquared Error (RMSE) of the proposed approach compared to the baselines. The lower the RMSE, the better the performance of the model. We observe that RGN outperforms THP in four of the five datasets however THP again performs better on the MIMICII dataset.
Model  RT  SO  MIMICII  Financial  SCII 
RMTPP  16899.72  144.34  3.42  26.95  1.48 
NHP  17672.54  144.72  3.17  27.88  1.51 
THP  16616.13  140.20  0.87  25.75  1.39 
This work  15999.68  121.50  1.026  25.37  1.25 
VF Goodness of Fit
We would like to verify that the model describes the structure present in the data accurately however it is difficult to work with nonstationary and history dependent distributions present in real world data. The TimeRescaling Theorem [Papangelou1974] states that any point process with a conditional intensity function can be mapped to a Poisson Process with unit parameter. Alongside the TimeRescaling theorem, we can use the learnt conditional intensity function to obtain transformed variables
which are independent and should be exponentially distributed with rate 1. Since the cumulative distribution function (CDF) of the target distribution is known, we can use PP plots to measure the deviation from this ideal behaviour. PP plots plot the empirical CDF with the actual CDF which should be a straight line equally inclined to both the axes. Fig.
3 shows the PP plots for our Recurrent Graph Network (RGN) (above) and Transformer Hawkes Process (THP) (below) for three different datasets. It can be seen that for all the three cases, the PP plot for THP has substantially larger deviations from the expected straight line whereas Recurrent Graph Network (RGN) is remains close to the straight line. This shows that RGN better learns the structure in the data compared to stateoftheart THP.VG Model Interpretability
The ability to incorporate structure using Recurrent Graph Network helps in making the deep learning model more interpretable. The formulation of our model allows us to see the eventclass dependency i.e how a certain class depends on a certain class whenever an event occurs allowing us to study the dynamics of interclass influence as the event stream is observed. Moreover, incorporating multiheaded mechanism allows the model to attend to different information in different subspaces. Fig 2 shows the visualization of two attention heads in the model at three different event timestamps for the StarCraft II dataset. A lighter intensity corresponds to lower attention while a more pronounced color corresponds to higher attention score to a node. Other transformer based models [thp, sahp] have to average out the attention scores over all events to find attention over event classes thus losing out on crucial temporal information. Although the model allows us to study interclass dependency at event timestamps, we can easily extend it to continuous time by using splinebased interpolation [spline] or other interpolation techniques.
VH Ablation Studies
We conduct ablation experiments to understand the impact of number of attention heads in the multiheaded attention mechanism of Graph Attention Network. We keep the sizes of all the parameters of the model fixed, and only vary the number of attention heads. The evaluation metrics of interest are Loglikelihood, Event type prediction accuracy and Event time prediction error. In addition, we also study the case where the GAT module is completely removed to study the improvement that inclusion of the GAT module brings over a fully LSTM based model.
The result for the experiment on the StarCraft II dataset is shown in table V. As shown in figure 2, various attention heads in the model attend to different features and table V confirms our hypothesis that doing so improves the performance of the model especially on the loglikelihood metric. We also notice a stark drop in model performance when the graph attention module is removed.
# Attention Heads  LL  Type Accuracy  Time Error 
0  1.03  43.95  1.269 
1  1.38  47.00  1.261 
2  1.44  47.15  1.259 
4  1.49  47.18  1.260 
8  1.50  47.31  1.254 
VI Reduction in Computational Complexity
One of the major drawbacks of Transformer based approaches is the quadratic order complexity in space and time. For a sequence with length , the number of activations required to store the selfattention matrix is in the order of . This problem is further exacerbated when Transformer layers are stacked or multiple heads are used. Another issue that arises during implementation is due to sequence padding of the minibatch. All the sequences in the minibatch are padded with zeros to ensure they have the same length as the longest sequence in the minibatch. All the sequence attention matrices scale as the square of the longest sequence length in the minibatch leading to a lot of wasted memory and redundant computation. In contrast, the attention mechanism in RGN attends over event types rather than individual events. This makes the attention matrix scale in the order of instead of which leads to dramatic savings in memory consumed as usually . This also makes the memory consumed by model independent of the sequence length. Table VI shows the number of activations in the attention mechanism (in Millions) and total number of operations (MFLOP) for both the aforementioned models. We report the total number of operations performed by both models to find the historical embedding for all events when a sequence with average length is given from all the datasets. The model hyperparameters are described in the supplementary material. We can see that RGN has lower attention activations and computations compared to THP for all datasets except for MIMICII where the trend is reversed. This is due to the fact that the assumption of fails to hold for small sequences of average length and a large number of event classes present in the dataset.
Table VII shows the amount of GPU memory used by the models. We can see that for the Financial Transactions dataset, only a minibatch of size 1 can be fit into an Nvidia RTX2080Ti GPU with 11 Gigabytes of memory. No such issues were faces for RGN which has a very small footprint due to the small number of event types  in the dataset. The trend reverses for the MIMICII dataset where the large number of classes cause RGN to consume more memory than THP.
Dataset  Model 



RT  THP  89.88  0.41  
This Work  84.10  0.17  
SO  THP  3209.67  3.03  
This Work  409.23  1.83  
Financial  THP  103294.46  174.58  
This Work  1343.42  32.62  
MIMIC II  THP  2.00  0.017  
This Work  23.21  0.081  
SC II  THP  1621.84  1.49  
This Work  318.58  1.17 
Dataset  Batch Size  THP  This Work 
Retweets  64  2.08  1.03 
StackOverflow  16  9.75  3.65 
MIMICII  64  1.63  5.07 
Financial  1  7.15  1.02 
StarCraft II  16  10.29  3.63 
Vi Conclusion
In this paper, we present a Recurrent Graph Network to learn the underlying Marked Temporal Point Process from event streams. Our key innovation was incorporating structural information using a dynamic attention mechanism over event types rather than all past events in the sequence. This leads to improved performance on loglikelihood and event prediction metrics. Moreover, it also leads to reduction in time and space complexity across almost all datasets. The model improves interpretability by allowing us to understand the influence of one event type over the others as more events are observed by the model. We also present a new interesting benchmarking dataset for point process evaluation.