1 Introduction
Temporal point processes (TPPs) provide an effective mathematical framework for modeling event sequence data. Event sequences are common in a large spectrum of areas, for example, patient visits to hospitals, online searches, user behavior on social media, credit card transactions, etc.
A TPP can be defined as a stochastic process whose realizations consist of a list of isolated events with their corresponding arrival times. The arrival times can either be real numbers from an index set (predefined from prior knowledge), or samples following an intensity function. The key issue in modeling of a TPP is finding an effective probablistic model that can capture the distribution over arrival times.
However, it is usually not trivial to come up with a simple yet powerful intensity function for this purpose. In this line of literature, various handcrafted designs of intensity function were investigated (Kingman, 1992; Hawkes, 1971; Isham & Westcott, 1979). However, the unnecessary parametric intensity assumption made by such frameworks limits the capacity to model more complex processes.
Lately, modeling intensity functions using recurrent neural networks (RNNs) has received much attention
(Du et al., 2016; Mei & Eisner, 2017; Jing & Smola, 2017; Mehrasa et al., 2019). However, all of these approaches rely on explicit parametric modeling of a point process distribution using the intensity, and it can be hard to find a good functional form when the underlying distribution is highly complex. Most recently, there have been efforts to model point processes without specifying the intensity.
Xiao et al. (2017) introduced an intensityfree model that learns the distribution of point processes using Wasserstein distance by utilizing generative adversarial networks. Li et al. (2018)proposed to model TPPs using reinforcement learning by treating future event prediction as actions taken by an agent, therefore learning the intensity function is equal to policy learning in reinforcement learning.
Following this trend, we propose another intensityfree point process model with a new perspective based on continuous normalizing flow. The proposed Point Process Flow (PPF) model utilizes a recurrent variational autoencoder to encode the history of a given event sequence and make probabilistic predictions on the following one, while preserving the nonparametric characteristics of point process distributions with normalizing flow. With such a setup, the predicted nonparametric point process distribution is capable of capturing complex time distributions of arbitrary form, leading to more accurate modeling of event sequences.
2 Proposed Method
2.1 Temporal Point Process
A temporal point process (TPP; Daley & VereJones (2007)) is a stochastic process whose realization is a sequence of discrete events where represents the (absolute) starting time of the th event. Let be the history of past events up to time (). A temporal point process is usually modeled by specifying the conditional intensity function , which encodes the expected rate of events happening in a small area around
. Using the intensity, the probability density function of the next event timing can be defined as:
(1) 
In this work, we propose an intensityfree flow framework to model the timing of events in point process sequences. More specifically, we learn a nonparametric distribution over the timing of asynchronous event sequences by transforming a simple base probability density through continuous normalizing flow, i.e., a series of invertible transformations. With our model, we are able to generate arbitrarily complex point process distributions, making no assumption on the functional form of the distribution.
2.2 Flow for NonParametric Temporal Point Processes
Problem Definition.
Let the input be a sequence of asynchronous events where represents the starting time of the th event. We define the interarrival time as the time difference between the starting time of events and . Our goal is to model the distribution over interarrival time given the past history of event interarrival times i.e., learning to model the conditional distribution .
Our goal is to construct the distribution over interarrival time by transforming a base distribution with a simple form through normalizing flow transformations. At timestep of the sequence, we assume that the interarrival time is generated as follows:
(2) 
where
is a random variable sampled from the base distribution
with a simple form, eg. Normal Gaussian and the transformation is a bijection with the inverse of . Using the change of variable formula, we can write the distribution over interarrival time as:(3)  
(4) 
where , and the scalar value is the Jacobian of at , which shows the changes in the density when moving from to . We dropped the determinant in the change of variable formula, because in our case, the interarrival time is a onedimensional variable.
We build our model based on the recently proposed continuous normalizing flow (CNF) by Chen et al. (2018); Grathwohl et al. (2019). They proposed neural ODE, where the continuous dynamics of discrete function
are parameterized using an ordinary differential equation specified by a neural network
Following the neural ODE perspective, the changes in logdensity can be computed using an integral of continuous time dynamics:(5) 
where is our target distribution .
The current formulation models the interarrival distribution of each timestep independent of past history. The timing of future events might depend on the previous events in a very complex way, so it is important to make use of the history information to model future events. To capture this dependency, we adapt our flow model by learning a timedependent base distribution conditioned on the history. In the next section, we show how we employ our flow module in a probabilistic framework that encodes history in the generation process of base distributions of flow.
2.3 Base Distribution with Probabilistic Parameters
It is known that there is a tradeoff between the complexity of the bijective transformation and the form of base distribution (Jaini et al., 2019). With the complexity of the bijective transformation fixed, a more flexible base distribution will lead to a more expressive model. Motivated by this, we combine our flow module with a variational autoencoder (VAE; Kingma & Welling 2014) framework in order to achieve flexible base distributions and make conditional predictions.
To avoid confusion, at timestep , we use the notation for the random variable of the normalizing flow base distribution and to refer to the VAE latent space. We start by explaining the generation phase, i.e., how distributions over interarrival time are generated by stacking the normalizing flow module on top of the VAE backbone and then describing the training process.
Generation. Figure 1 illustrates an overview of the generation process. Here, we adapt a recurrent VAE framework consisting of a timevariant prior network parametrized by which takes the history of past actions and provides the latent distribution . Then, a sample of this distribution is passed to the VAE’s decoder which produces a nonparametric distribution over the interarrival time by first generating the normalizing flow base distribution and then transforming it through flow transformation . By applying the change of variable formula discussed in Equation 5, we can write the distribution over interarrival time as:
(6) 
Training. At timestep of training, the VAE module takes the sequence of interarrival times to approximate the true distribution over the latent space via the help of the recurrent inference network which is parametrized with . A timedependent prior network is also adapted to help the model to take use of history information in generation phase
. Both prior and posterior distributions are assumed to follow conditional multivariate Gaussian distributions with diagonal covariance:
(7)  
(8) 
At each timestep during training, a latent code is taken from the posterior and is passed to the decoder which aims to generate a distribution over interarrival time by first generating the base distribution of flow and then transforming it through flow transformations. The VAE backbone is jointly trained with the flow module by optimizing the variational lower bound using the reparameterization trick (Kingma & Welling, 2014):
(9)  
where the loglikelihood term is computed by Equation 6 using the predicted base distribution .
Implementation. We implement the inference network and the prior network
with LSTM networks by encoding sequences into hidden states. A multilayer perceptron (MLP) maps the LSTM hidden states to the parameters of latent variable distributions. We adopt the common practice of assuming the latent variable to follow diagonal Gaussian distribution. The time decoder is a MLP which decodes the latent variable to the parameters of a onedimensional Guassian distribution which is later transformed to a conditional distribution of interarrival times by the CNF.
3 Evaluation
To show the effectiveness of our nonparametric approach, we evaluated the performance of our model on both synthetic and realworld datasets and compared it with stateoftheart models.
Synthetic Datasets We created three types of synthetic datasets as follows: (I) Inhomogeneous Poisson Process (IP) defines intensity as a function of time but independent of history. We simulate sequences of the IP with where , , and . (II) Selfexciting Process (SE) assumes that occurrence of an event increases the probability of other events happening in the near future. It is characterized with where in our data simulation , , and . (III) IP + SE is created by combining simulated data from Selfexciting Process and Inhomogeneous Process. For each process, we run the simulations for 60 steps and generate 20000 sequences.
Realworld Datasets.We also evaluated our models on real datasets that cover the areas of healthcare, social media, and human activity: (I) LinkedIn The LinkedIn data is collected from over 3000 LinkedIn accounts and record the times when the users changed their jobs. (II) MIMIC MIMICIII (Medical Information Mart for Intensive Care III) (Johnson et al., 2016; Pollard, 2016) is a publicly available, large dataset containing the admission times to hospital of more than 40000 anonymous patients. (III) Breakfast The Breakfast (Kuehne et al., 2014) dataset contains 1712 videos with 48 classes of actions in breakfast preparation. For our model to learn a more meaningful latent space in this dataset, we extend our approach to model marked point process which predicts both the interarrival times of events and the category of next event. Accordingly, the loglikelihood of action prediction is added to training objective and evaluation criterion. For the Breakfast dataset, we use the standard train and test split proposed by Kuehne et al. (2014). All other datasets are splitted into train, validation, and test sets by 0.7, 0.1, 0.2 ratios.
Baseline. We compare our model with the recently proposed APPVAE (Mehrasa et al., 2019), which is a latent variable framework for modeling marked temporal point processes. APPVAE models the time distribution by learning the conditional intensity in a probabilistic framework. They model actioncategory data with multinomial distribution. We used their original setup to compare APPVAE with our model on breakfast dataset, but for the rest of the datasets, we modified their model to predict the time distribution only.
Quantitative Comparison. We report the IWAE bound which is a lower bound of the real loglikelihood. To compute IWAE at time step , we draw 1500 samples from posterior distribution . We also report the mean absolute error (MAE) to evaluate the performance of our model in predicting future events. The MAE between the samples of the predicted time distribution and the groundtruth is reported. To compute MAE at timestep , we draw 100 samples from prior distribution and 15 samples from each predicted base distribution . The corresponding samples of predicted interarrival time distribution are obtained using Equation 2. For Breakfast dataset, in addition to MAE we also report the accuracy of predicting the category’s next action. Similarly, we use 100 samples and for each predicted distribution, we select the action category with maximum probability as the predicted class. For each timestep, the most frequently predicted type is reported as the model’s prediction. For all IWAE, MAE, and accuracy the average along all the timesteps of all sequences is reported. The results are shown in Tab. 1 and Tab. 2
. Our model (PPF) outperforms the APPVAE model across all the datasets on IWAE. The results indicate the better capability of PPF at modeling point process sequence data, especially the realworld data with complicated underlying distributions. The better loglikelihood estimations on real world data is conformed by lower MAE which reflects the better quality of generated samples from PPF.
4 Conclusion
In this paper, we proposed PPF, an intensityfree framework that directly models the point process as a nonparametric distribution by utilizing normalizing flows. The proposed model is capable of capturing complex time distributions as well as performing stochastic future prediction.
References
 Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583, 2018.
 Daley & VereJones (2007) Daryl J Daley and David VereJones. An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media, 2007.

Du et al. (2016)
Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel GomezRodriguez,
and Le Song.
Recurrent marked temporal point processes: Embedding event history to vector.
In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1555–1564. ACM, 2016.  Grathwohl et al. (2019) Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, and David Duvenaud. Scalable reversible generative models with freeform continuous dynamics. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJxgknCcK7.
 Hawkes (1971) Alan G Hawkes. Spectra of some selfexciting and mutually exciting point processes. Biometrika, 1971.
 Isham & Westcott (1979) Valerie Isham and Mark Westcott. A selfcorrecting point process. Stochastic Processes and their Applications, 1979.
 Jaini et al. (2019) Priyank Jaini, Ivan Kobyzev, Marcus Brubaker, and Yaoliang Yu. Tails of triangular flows. arXiv preprint arXiv:1907.04481, 2019.
 Jing & Smola (2017) How Jing and Alexander J Smola. Neural survival recommender. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 515–524. ACM, 2017.
 Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Liwei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Kingma & Welling (2014) Diederik P Kingma and Max Welling. AutoEncoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
 Kingman (1992) J.F.C. Kingman. Poisson Processes. Oxford Studies in Probability. Clarendon Press, 1992. ISBN 9780191591242.

Kuehne et al. (2014)
Hilde Kuehne, Ali Arslan, and Thomas Serre.
The Language of Actions: Recovering the Syntax and Semantics of
GoalDirected Human Activities.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2014.  Li et al. (2018) Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, and Le Song. Learning temporal point processes via reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS). 2018.
 Mehrasa et al. (2019) Nazanin Mehrasa, Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Sigal, and Greg Mori. A Variational AutoEncoder Model for Stochastic Point Processes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 Mei & Eisner (2017) Hongyuan Mei and Jason Eisner. The Neural Hawkes Process: A Neurally SelfModulating Multivariate Point Process. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
 Pollard (2016) Alistair EW Pollard, Tom J abd Johnson. The mimiciii clinical database. http://dx.doi.org/10.13026/C2XW26, 2016.
 Xiao et al. (2017) Shuai Xiao, Mehrdad Farajtabar, Xiaojing Ye, Junchi Yan, Le Song, and Hongyuan Zha. Wasserstein learning of deep generative point process models. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
Comments
There are no comments yet.