Temporal point processes (TPPs) provide an effective mathematical framework for modeling event sequence data. Event sequences are common in a large spectrum of areas, for example, patient visits to hospitals, online searches, user behavior on social media, credit card transactions, etc.
A TPP can be defined as a stochastic process whose realizations consist of a list of isolated events with their corresponding arrival times. The arrival times can either be real numbers from an index set (pre-defined from prior knowledge), or samples following an intensity function. The key issue in modeling of a TPP is finding an effective probablistic model that can capture the distribution over arrival times.
However, it is usually not trivial to come up with a simple yet powerful intensity function for this purpose. In this line of literature, various hand-crafted designs of intensity function were investigated (Kingman, 1992; Hawkes, 1971; Isham & Westcott, 1979). However, the unnecessary parametric intensity assumption made by such frameworks limits the capacity to model more complex processes.
Lately, modeling intensity functions using recurrent neural networks (RNNs) has received much attention(Du et al., 2016; Mei & Eisner, 2017; Jing & Smola, 2017; Mehrasa et al., 2019)
. However, all of these approaches rely on explicit parametric modeling of a point process distribution using the intensity, and it can be hard to find a good functional form when the underlying distribution is highly complex. Most recently, there have been efforts to model point processes without specifying the intensity.Xiao et al. (2017) introduced an intensity-free model that learns the distribution of point processes using Wasserstein distance by utilizing generative adversarial networks. Li et al. (2018)
proposed to model TPPs using reinforcement learning by treating future event prediction as actions taken by an agent, therefore learning the intensity function is equal to policy learning in reinforcement learning.
Following this trend, we propose another intensity-free point process model with a new perspective based on continuous normalizing flow. The proposed Point Process Flow (PPF) model utilizes a recurrent variational autoencoder to encode the history of a given event sequence and make probabilistic predictions on the following one, while preserving the non-parametric characteristics of point process distributions with normalizing flow. With such a setup, the predicted non-parametric point process distribution is capable of capturing complex time distributions of arbitrary form, leading to more accurate modeling of event sequences.
2 Proposed Method
2.1 Temporal Point Process
A temporal point process (TPP; Daley & Vere-Jones (2007)) is a stochastic process whose realization is a sequence of discrete events where represents the (absolute) starting time of the -th event. Let be the history of past events up to time (). A temporal point process is usually modeled by specifying the conditional intensity function , which encodes the expected rate of events happening in a small area around
. Using the intensity, the probability density function of the next event timing can be defined as:
In this work, we propose an intensity-free flow framework to model the timing of events in point process sequences. More specifically, we learn a non-parametric distribution over the timing of asynchronous event sequences by transforming a simple base probability density through continuous normalizing flow, i.e., a series of invertible transformations. With our model, we are able to generate arbitrarily complex point process distributions, making no assumption on the functional form of the distribution.
2.2 Flow for Non-Parametric Temporal Point Processes
Let the input be a sequence of asynchronous events where represents the starting time of the -th event. We define the inter-arrival time as the time difference between the starting time of events and . Our goal is to model the distribution over inter-arrival time given the past history of event inter-arrival times i.e., learning to model the conditional distribution .
Our goal is to construct the distribution over inter-arrival time by transforming a base distribution with a simple form through normalizing flow transformations. At time-step of the sequence, we assume that the inter-arrival time is generated as follows:
is a random variable sampled from the base distributionwith a simple form, eg. Normal Gaussian and the transformation is a bijection with the inverse of . Using the change of variable formula, we can write the distribution over inter-arrival time as:
where , and the scalar value is the Jacobian of at , which shows the changes in the density when moving from to . We dropped the determinant in the change of variable formula, because in our case, the inter-arrival time is a one-dimensional variable.
We build our model based on the recently proposed continuous normalizing flow (CNF) by Chen et al. (2018); Grathwohl et al. (2019). They proposed neural ODE, where the continuous dynamics of discrete function
are parameterized using an ordinary differential equation specified by a neural networkFollowing the neural ODE perspective, the changes in log-density can be computed using an integral of continuous time dynamics:
where is our target distribution .
The current formulation models the inter-arrival distribution of each time-step independent of past history. The timing of future events might depend on the previous events in a very complex way, so it is important to make use of the history information to model future events. To capture this dependency, we adapt our flow model by learning a time-dependent base distribution conditioned on the history. In the next section, we show how we employ our flow module in a probabilistic framework that encodes history in the generation process of base distributions of flow.
2.3 Base Distribution with Probabilistic Parameters
It is known that there is a trade-off between the complexity of the bijective transformation and the form of base distribution (Jaini et al., 2019). With the complexity of the bijective transformation fixed, a more flexible base distribution will lead to a more expressive model. Motivated by this, we combine our flow module with a variational auto-encoder (VAE; Kingma & Welling 2014) framework in order to achieve flexible base distributions and make conditional predictions.
To avoid confusion, at time-step , we use the notation for the random variable of the normalizing flow base distribution and to refer to the VAE latent space. We start by explaining the generation phase, i.e., how distributions over inter-arrival time are generated by stacking the normalizing flow module on top of the VAE backbone and then describing the training process.
Generation. Figure 1 illustrates an overview of the generation process. Here, we adapt a recurrent VAE framework consisting of a time-variant prior network parametrized by which takes the history of past actions and provides the latent distribution . Then, a sample of this distribution is passed to the VAE’s decoder which produces a non-parametric distribution over the inter-arrival time by first generating the normalizing flow base distribution and then transforming it through flow transformation . By applying the change of variable formula discussed in Equation 5, we can write the distribution over inter-arrival time as:
Training. At time-step of training, the VAE module takes the sequence of inter-arrival times to approximate the true distribution over the latent space via the help of the recurrent inference network which is parametrized with . A time-dependent prior network is also adapted to help the model to take use of history information in generation phase
. Both prior and posterior distributions are assumed to follow conditional multivariate Gaussian distributions with diagonal covariance:
At each time-step during training, a latent code is taken from the posterior and is passed to the decoder which aims to generate a distribution over inter-arrival time by first generating the base distribution of flow and then transforming it through flow transformations. The VAE backbone is jointly trained with the flow module by optimizing the variational lower bound using the re-parameterization trick (Kingma & Welling, 2014):
where the log-likelihood term is computed by Equation 6 using the predicted base distribution .
Implementation. We implement the inference network and the prior network
with LSTM networks by encoding sequences into hidden states. A multi-layer perceptron (MLP) maps the LSTM hidden states to the parameters of latent variable distributions. We adopt the common practice of assuming the latent variable to follow diagonal Gaussian distribution. The time decoder is a MLP which decodes the latent variable to the parameters of a one-dimensional Guassian distribution which is later transformed to a conditional distribution of inter-arrival times by the CNF.
To show the effectiveness of our non-parametric approach, we evaluated the performance of our model on both synthetic and real-world datasets and compared it with state-of-the-art models.
Synthetic Datasets We created three types of synthetic datasets as follows: (I) Inhomogeneous Poisson Process (IP) defines intensity as a function of time but independent of history. We simulate sequences of the IP with where , , and . (II) Self-exciting Process (SE) assumes that occurrence of an event increases the probability of other events happening in the near future. It is characterized with where in our data simulation , , and . (III) IP + SE is created by combining simulated data from Self-exciting Process and Inhomogeneous Process. For each process, we run the simulations for 60 steps and generate 20000 sequences.
Real-world Datasets.We also evaluated our models on real datasets that cover the areas of healthcare, social media, and human activity: (I) LinkedIn The LinkedIn data is collected from over 3000 LinkedIn accounts and record the times when the users changed their jobs. (II) MIMIC MIMIC-III (Medical Information Mart for Intensive Care III) (Johnson et al., 2016; Pollard, 2016) is a publicly available, large dataset containing the admission times to hospital of more than 40000 anonymous patients. (III) Breakfast The Breakfast (Kuehne et al., 2014) dataset contains 1712 videos with 48 classes of actions in breakfast preparation. For our model to learn a more meaningful latent space in this dataset, we extend our approach to model marked point process which predicts both the inter-arrival times of events and the category of next event. Accordingly, the log-likelihood of action prediction is added to training objective and evaluation criterion. For the Breakfast dataset, we use the standard train and test split proposed by Kuehne et al. (2014). All other datasets are splitted into train, validation, and test sets by 0.7, 0.1, 0.2 ratios.
Baseline. We compare our model with the recently proposed APP-VAE (Mehrasa et al., 2019), which is a latent variable framework for modeling marked temporal point processes. APP-VAE models the time distribution by learning the conditional intensity in a probabilistic framework. They model action-category data with multi-nomial distribution. We used their original setup to compare APP-VAE with our model on breakfast dataset, but for the rest of the datasets, we modified their model to predict the time distribution only.
Quantitative Comparison. We report the IWAE bound which is a lower bound of the real log-likelihood. To compute IWAE at time step , we draw 1500 samples from posterior distribution . We also report the mean absolute error (MAE) to evaluate the performance of our model in predicting future events. The MAE between the samples of the predicted time distribution and the ground-truth is reported. To compute MAE at time-step , we draw 100 samples from prior distribution and 15 samples from each predicted base distribution . The corresponding samples of predicted inter-arrival time distribution are obtained using Equation 2. For Breakfast dataset, in addition to MAE we also report the accuracy of predicting the category’s next action. Similarly, we use 100 samples and for each predicted distribution, we select the action category with maximum probability as the predicted class. For each time-step, the most frequently predicted type is reported as the model’s prediction. For all IWAE, MAE, and accuracy the average along all the time-steps of all sequences is reported. The results are shown in Tab. 1 and Tab. 2
. Our model (PPF) outperforms the APP-VAE model across all the datasets on IWAE. The results indicate the better capability of PPF at modeling point process sequence data, especially the real-world data with complicated underlying distributions. The better log-likelihood estimations on real world data is conformed by lower MAE which reflects the better quality of generated samples from PPF.
In this paper, we proposed PPF, an intensity-free framework that directly models the point process as a non-parametric distribution by utilizing normalizing flows. The proposed model is capable of capturing complex time distributions as well as performing stochastic future prediction.
- Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583, 2018.
- Daley & Vere-Jones (2007) Daryl J Daley and David Vere-Jones. An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media, 2007.
Du et al. (2016)
Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez,
and Le Song.
Recurrent marked temporal point processes: Embedding event history to vector.In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1555–1564. ACM, 2016.
- Grathwohl et al. (2019) Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, and David Duvenaud. Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJxgknCcK7.
- Hawkes (1971) Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 1971.
- Isham & Westcott (1979) Valerie Isham and Mark Westcott. A self-correcting point process. Stochastic Processes and their Applications, 1979.
- Jaini et al. (2019) Priyank Jaini, Ivan Kobyzev, Marcus Brubaker, and Yaoliang Yu. Tails of triangular flows. arXiv preprint arXiv:1907.04481, 2019.
- Jing & Smola (2017) How Jing and Alexander J Smola. Neural survival recommender. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 515–524. ACM, 2017.
- Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
- Kingma & Welling (2014) Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
- Kingman (1992) J.F.C. Kingman. Poisson Processes. Oxford Studies in Probability. Clarendon Press, 1992. ISBN 9780191591242.
- Kuehne et al. (2014) Hilde Kuehne, Ali Arslan, and Thomas Serre. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In
- Li et al. (2018) Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, and Le Song. Learning temporal point processes via reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS). 2018.
- Mehrasa et al. (2019) Nazanin Mehrasa, Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Sigal, and Greg Mori. A Variational Auto-Encoder Model for Stochastic Point Processes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Mei & Eisner (2017) Hongyuan Mei and Jason Eisner. The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Pollard (2016) Alistair EW Pollard, Tom J abd Johnson. The mimic-iii clinical database. http://dx.doi.org/10.13026/C2XW26, 2016.
- Xiao et al. (2017) Shuai Xiao, Mehrdad Farajtabar, Xiaojing Ye, Junchi Yan, Le Song, and Hongyuan Zha. Wasserstein learning of deep generative point process models. In Advances in Neural Information Processing Systems (NeurIPS), 2017.