Intensity-Free Learning of Temporal Point Processes

09/26/2019 ∙ by Oleksandr Shchur, et al. ∙ Technische Universität München 0

Temporal point processes are the dominant paradigm for modeling sequences of events happening at irregular intervals. The standard way of learning in such models is by estimating the conditional intensity function. However, parameterizing the intensity function usually incurs several trade-offs. We show how to overcome the limitations of intensity-based approaches by directly modeling the conditional distribution of inter-event times. We draw on the literature on normalizing flows to design models that are flexible and efficient. We additionally propose a simple mixture model that matches the flexibility of flow-based models, but also permits sampling and computing moments in closed form. The proposed models achieve state-of-the-art performance in standard prediction tasks and are suitable for novel applications, such as learning sequence embeddings and imputing missing data.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Implementation of "Intensity-Free Learning of Temporal Point Processes"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visits to hospitals, purchases in e-commerce systems, financial transactions, posts in social media — various forms of human activity can be represented as discrete events happening at irregular intervals. The framework of temporal point processes is a natural choice for modeling such data. By combining temporal point process models with deep learning, we can design algorithms able to learn complex behavior from real-world data.

Designing such models, however, usually involves trade-offs along the following dimensions: flexibility (can the model approximate any distribution?), efficiency (can the likelihood function be evaluated in closed form?), and ease of use (is sampling and computing summary statistics easy?). Existing methods (Du et al., 2016; Mei and Eisner, 2017; Omi et al., 2019) that are defined in terms of the conditional intensity function typically fall short in at least one of these categories.

Instead of modeling the intensity function, we suggest treating the problem of learning in temporal point processes as an instance of conditional density estimation. By using tools from neural density estimation (Bishop, 1994; Rezende and Mohamed, 2015), we can develop methods that have all of the above properties. To summarize, our contributions are the following:

  • We connect the fields of temporal point processes and neural density estimation. We show how normalizing flows can be used to define flexible and theoretically sound models for learning in temporal point processes.

  • We propose a simple mixture model that performs on par with the state-of-the-art methods. Thanks to its simplicity, the model permits closed-form sampling and moment computation.

  • We show through a wide range of experiments how the proposed models can be used for prediction, conditional generation, sequence embedding and training with missing data.

2 Background

Fully NN
Closed-form likelihood
Closed-form sampling
Table 1: Comparison of neural temporal point process models that encode history with an RNN.

Definition. A temporal point process (TPP) is a random process whose realizations consist of a sequence of strictly increasing arrival times . A TPP can equivalently be represented as a sequence of strictly positive inter-event times . Representations in terms of and are isomorphic — we will use them interchangeably throughout the paper. The traditional way of specifying the dependency of the next arrival time on the history is using the conditional intensity function . Here, the symbol reminds us of dependence on

. Given the conditional intensity function, we can obtain the conditional probability density function (PDF) of the time

until the next event by integration (Rasmussen, 2011) as .

Learning temporal point processes. Conditional intensity functions provide a convenient way to specify point processes with a simple predefined behavior, such as self-exciting (Hawkes, 1971) and self-correcting (Isham and Westcott, 1979) processes. Intensity parametrization is also commonly used when learning a model from the data: Given a parametric intensity function and a sequence of observations , the parameters can be estimated by maximizing the log-likelihood: .

The main challenge of such intensity-based approaches lies in choosing a good parametric form for . This usually involves the following trade-off: For a ”simple” intensity function (Du et al., 2016; Huang et al., 2019), the integral has a closed form, which makes the log-likelihood easy to compute. However, such models usually have limited expressiveness. A more sophisticated intensity function (Mei and Eisner, 2017) can better capture the dynamics of the system, but computing log-likelihood will require approximating the integral using Monte Carlo.

Recently, Omi et al. (2019)

proposed fully neural network intensity function (FullyNN) — a flexible, yet computationally tractable model for TPPs. The key idea of their approach is to model the cumulative conditional intensity function

using a neural network, which allows to efficiently compute the log-likelihood. Still, in its current state, the model has downsides: it doesn’t define a valid PDF, sampling is expensive, and the expectation cannot be computed in closed form111A more detailed discussion of the FullyNN model follows in Section 4 and Appendix C..

This work. We show that the drawbacks of the existing approaches can be remedied by looking at the problem of learning in TPPs from a different angle. Instead of modeling the conditional intensity , we suggest to directly learn the conditional distribution . Modeling distributions with neural networks is a well-researched topic, that, surprisingly, is not usually discussed in the context of TPPs. By adopting this alternative point of view, we are able to develop new theoretically sound and effective methods (Section 3), as well as better understand the existing approaches (Section 4).

3 Models

We develop several approaches for modeling the distribution of inter-event times. First, we assume for simplicity that each inter-event time is conditionally independent of the history, given the model parameters (that is, ). In Section 3.1, we show how state-of-the-art neural density estimation methods based on normalizing flows can be used to model . Then in Section 3.2, we propose a simple mixture model that can match the performance of the more sophisticated flow-based models, while also addressing some of their shortcomings. Finally, we discuss how to make depend on the history in Section 3.3.

3.1 Modeling with normalizing flows

The core idea of normalizing flows (Tabak and Turner, 2013; Rezende and Mohamed, 2015)

is to define a flexible probability distribution by transforming a simple one. Assume that

has a PDF . Let for some differentiable invertible transformation (where )222All definitions can be extended to for . We consider the one-dimensional case since our goal is to model the distribution of inter-event times .. We can obtain the PDF of using the change of variables formula as . By stacking multiple transformations , we obtain an expressive probability distribution . To draw a sample , we need to draw and compute the forward transformation . To get the density of an arbitrary point , it is necessary to evaluate the inverse transformation and compute . Modern normalizing flows architectures parametrize the transformations using extremely flexible functions , such as polynomials (Jaini et al., 2019) or neural networks (Krueger et al., 2018). The flexibility of these functions comes at a cost — while the inverse exists, it typically doesn’t have a closed form. That is, if we use such a function to define one direction of the transformation in a flow model, the other direction can only be approximated numerically using iterative root-finding methods (Ho et al., 2019).

In the context of TPPs, our goal is to model the distribution of inter-event times. In order to be able to learn the parameters of using maximum likelihood, we need to be able to evaluate the density at any point . For this we need to define the inverse transformation . First, we set to convert a positive into . Then, we stack multiple layers of parametric functions that can approximate any transformation. We consider two choices for : deep sigmoidal flow (DSF) from Krueger et al. (2018) and sum-of-squares (SOS) polynomial flow from Jaini et al. (2019)


where are the transformation parameters, is the number of components, is the polynomial degree, and . We denote the two variants of the model based on and building blocks as DSFlow and SOSFlow respectively. Finally, after stacking multiple , we apply a sigmoid transformation to convert into .

For both models, we can evaluate the inverse transformations , which means the model can be efficiently trained via maximum likelihood. The density defined by either DSFlow or SOSFlow model is extremely flexible and can approximate any distribution (Section 3.4). However, for some use cases, this is not sufficient. For example, we may be interested in the expected time until the next event, . In this case, flow-based models are not optimal, since for them does not in general have a closed form. Moreover, the forward transformation cannot be computed in closed form since the functions and cannot be inverted analytically. Therefore, sampling from is also problematic and requires iterative root finding.

This raises the question: Can we design a model for that is as expressive as the flow-based models, but in which sampling and computing moments is easy and can be done in closed form?

3.2 Modeling with mixture distributions

Model definition. While mixture models are commonly used for clustering, they can also be used for density estimation. Mixtures work especially well in low dimensions (McLachlan and Peel, 2004), which is the case in TPPs, where we model the distribution of one-dimensional inter-event times . Since the inter-event times

are positive, we choose to use a mixture of log-normal distributions to model

. The PDF of a log-normal mixture is defined as


where are the mixture weights, are the mixture means, and

are the standard deviations. Because of its simplicity, the log-normal mixture model has a number of attractive properties.

Moments. Since each component has a finite mean, the mean of the entire distribution can be computed as , i.e., a weighted average of component means. Higher moments can be computed based on the moments of each component (Frühwirth-Schnatter, 2006).

Seq. emb.
Figure 1: Model architecture. Parameters of are generated based on the conditional information .
Figure 2: Normalizing flows define a flexible distribution via transformations.

Sampling. While flow-based models from Section 3.1 require iterative root-finding algorithms to generate samples, sampling from a mixture model can be done in closed form:


is a one-hot vector of size

. In some applications, such as reinforcement learning

(Upadhyay et al., 2018), we might be interested in computing gradients of the samples w.r.t. the model parameters. The samples drawn using the procedure above are differentiable with respect to the means and scales . By using the Gumbel-softmax trick (Jang et al., 2017) when sampling , we can obtain gradients w.r.t. all the model parameters (Appendix D.6

). Such reparametrization gradients have lower variance and are easier to implement than the score function estimators typically used in other works

(Mohamed et al., 2019). Other flexible models (such as multi-layer flow models from Section 3.1) do not permit sampling through reparametrization, and thus are not well-suited for the above-mentioned scenario. In Section 5.4, we show how reparametrization sampling can also be used to train with missing data by performing imputation on the fly.

3.3 Incorporating the conditional information

History. A crucial feature of temporal point processes is that the time until the next event may be influenced by all the events that happened before. A standard way of capturing this dependency is to process the event history

with a recurrent neural network (RNN) and embed it into a fixed-dimensional vector

(Du et al., 2016).

Conditioning on additional features. The distribution of the time until the next event might depend on factors other than the history. For instance, distribution of arrival times of customers in a restaurant depends on the day of the week. As another example, if we are modeling user behavior in an online system, we can obtain a different distribution for each user by conditioning on their metadata. We denote such side information as a vector . Such information is different from marks (Rasmussen, 2011), since (a) the metadata may be shared for the entire sequence and (b) only influences the distribution , not the objective function.

In some scenarios, we might be interested in learning from multiple event sequences. In such case, we can assign each sequence a learnable sequence embedding vector . By optimizing , the model can learn to distinguish between sequences that come from different distributions. The learned embeddings can then be used for visualization, clustering or other downstream tasks.

Obtaining the parameters. We model the conditional dependence of the distribution on all of the above factors in the following way. The history embedding , metadata and sequence embedding are concatenated into a context vector . Then, we obtain the parameters of the distribution as an affine function of . For example, for the mixture model we have


where the and transformations are applied to enforce the constraints on the distribution parameters, and are learnable parameters. Such model resembles the mixture density network architecture (Bishop, 1994). The whole process is illustrated in Figure 2. We obtain the parameters of the flow-based models in a similar way (see Appendix D).

3.4 Discussion

Universal approximation. The SOSFlow and DSFlow models can approximate any probability density on arbitrarily well (Jaini et al., 2019, Theorem 3), (Krueger et al., 2018, Theorem 4). It turns out, a mixture model has the same universal approximation (UA) property.

Theorem 1 (DasGupta, 2008, Theorem 33.2). Let be a continuous density on . If is any density on and is also continuous, then, given and a compact set , there exists a finite mixture of the form such that .

This results shows that, in principle, the mixture distribution is as expressive as the flow-based models. Since we are modeling the conditional density, we additionally need to assume for all of the above models that the RNN can encode all the relevant information into the history embedding . This can be accomplished by invoking the universal approximation theorems for RNNs (Siegelmann and Sontag, 1992; Schäfer and Zimmermann, 2006).

Note that this result, like other UA theorems of this kind (Cybenko, 1989; Daniels and Velikova, 2010), does not provide any practical guarantees on the obtained approximation quality, and doesn’t say how to learn the model parameters. Still, UA intuitively seems like a desirable property of a distribution. This intuition is supported by experimental results. In Section 5.1, we show that models with the UA property consistently outperform the less flexible ones.

Interestingly, Theorem 1 does not make any assumptions about the form of the base density . This means we could as well use a mixture of distribution other than log-normal. However, other popular distributions on

have drawbacks: log-logistic does not always have defined moments and gamma distribution doesn’t permit straightforward sampling with reparametrization.

Intensity function.

For both flow-based and mixture models, the conditional cumulative distribution function (CDF)

and the PDF are readily available. This means we can easily compute the respective intensity functions (see Appendix A). However, we should still ask whether we lose anything by modeling instead of . The main arguments in favor of modeling the intensity function in traditional models (e.g. self-exciting process) are that it’s intuitive, easy to specify and reusable (Upadhyay and Rodriguez, 2019).

“Intensity function is intuitive, while the conditional density is not.” — While it’s true that in simple models (e.g. in self-exciting or self-correcting processes) the dependence of on the history is intuitive and interpretable, modern RNN-based intensity functions (as in Du et al. (2016); Mei and Eisner (2017); Omi et al. (2019)) cannot be easily understood by humans. In this sense, our proposed models are as intuitive and interpretable as other existing intensity-based neural network models.

is easy to specify, since it only has to be positive. On the other hand, must integrate to one.” — As we saw, by using either normalizing flows or a mixture distribution, we automatically enforce that the PDF integrates to one, without sacrificing the flexibility of our model.

Reusability: If we merge two independent point processes with intensitites and , the merged process has intensity .” — An equivalent result exists for the CDFs and of the two independent processes. The CDF of the merged process is obtained as (derivation in Appendix A).

As we just showed, modeling instead of does not impose any limitation on our approach. Moreover, a mixture distribution is flexible, easy to sample from and has well-defined moments, which favorably compares it to other intensity-based deep learning models.

4 Related work

Neural temporal point processes. Fitting simple TPP models (e.g. self-exciting (Hawkes, 1971) or self-correcting (Isham and Westcott, 1979) processes) to real-world data may lead to poor results because of model misspecification. Multiple recent works address this issue by proposing more flexible neural-network-based point process models. These neural models are usually defined in terms of the conditional intensity function. For example, Mei and Eisner (2017) propose a novel RNN architecture that can model sophisticated intensity functions. This flexibility comes at the cost of inability to evaluate the likelihood in closed form, and thus requiring Monte Carlo integration.

Du et al. (2016) suggest using an RNN to encode the event history into a vector . The history embedding is then used to define the conditional intensity, for example, using the constant intensity model (Li et al., 2018; Huang et al., 2019) or the more flexible exponential intensity model (Du et al., 2016; Upadhyay et al., 2018). By considering the conditional distribution

of the two models, we can better understand their properties. Constant intensity corresponds to an exponential distribution, and exponential intensity corresponds to a Gompertz distribution (see Appendix

B). Clearly, these unimodal distributions cannot match the flexibility of a mixture model (as can be seen in Figure 8).

Omi et al. (2019) introduce a flexible fully neural network (FullyNN) intensity model, where they model the cumulative intensity function with a neural net. The function converts

into an exponentially distributed random variable with unit rate

(Rasmussen, 2011), similarly to how normalizing flows model by converting into a random variable with a simple distribution. However, due to a suboptimal choice of the network architecture, the PDF of the FullyNN model does not integrate to 1, and the model assigns non-zero probability to negative inter-event times (see Appendix C). In contrast, SOSFlow and DSFlow always define a valid PDF on . Moreover, similar to other flow-based models, sampling from the FullyNN model requires iterative root finding.

Several works consider alternatives to the maximum likelihood objective for training TPPs. Examples include noise-contrastive estimation

(Guo et al., 2018), Wasserstein distance (Xiao et al., 2017, 2018; Yan et al., 2018), and reinforcement learning (Li et al., 2018; Upadhyay et al., 2018). This line of research is orthogonal to our contribution, and the models proposed in our work can be combined with the above-mentioned training procedures.

Neural density estimation. There exist two popular paradigms for learning flexible probability distributions using neural networks: In mixture density networks (Bishop, 1994), a neural net directly produces the distribution parameters; in normalizing flows (Tabak and Turner, 2013; Rezende and Mohamed, 2015), we obtain a complex distribution by transforming a simple one. Both mixture models (Schuster, 2000; Eirola and Lendasse, 2013; Graves, 2013) and normalizing flows (Oord et al., 2016; Ziegler and Rush, 2019) have been applied for modeling sequential data. However, surprisingly, none of the existing works make the connection and consider these approaches in the context of TPPs.

5 Experiments

We evaluate the proposed models on the established task of event time prediction (with and without marks) in Sections 5.1 and 5.2. In the remaining experiments, we show how the log-normal mixture model can be used for incorporating extra conditional information, training with missing data and learning sequence embeddings. We use 6 real-world datasets containing event data from various domains: Wikipedia (article edits), MOOC (user interaction with online course system), Reddit (posts in social media) (Kumar et al., 2019), Stack Overflow (badges received by users), LastFM (music playback) (Du et al., 2016), and Yelp (check-ins to restaurants). We also generate 5 synthetic datasets (Poisson, Renewal, Self-correcting, Hawkes1, Hawkes2), as described in Omi et al. (2019). Detailed descriptions and summary statistics of all the datasets are provided in Appendix E.

5.1 Event time prediction using history

Setup. We consider two normalizing flow models, SOSFlow and DSFlow (Equation 1), as well a log-normal mixture model (Equation 2), denoted as LogNormMix. As baselines, we consider RMTPP (i.e. Gompertz distribution / exponential intensity from Du et al. (2016)) and FullyNN model by Omi et al. (2019). Additionally, we use a single log-normal distribution (denoted LogNormal) to highlight the benefits of the mixture model. For all models, an RNN encodes the history into a vector . The parameters of are then obtained using (Equation 3). We exclude the NeuralHawkes model from our comparison, since it is known to be inferior to RMTPP in time prediction (Mei and Eisner, 2017), and, unlike other models, doesn’t have a closed-form likelihood.

Each dataset consists of multiple sequences of event times. The task is to predict the time until the next event given the history

. For each dataset, we use 60% of the sequences for training, 20% for validation and 20% for testing. We train all models by minimizing the negative log-likelihood (NLL) of the inter-event times in the training set. To ensure a fair comparison, we try multiple hyperparameter configurations for each model and select the best configuration using the validation set. Finally, we report the NLL loss of each model on the test set. All results are averaged over 10 train/validation/test splits. Details about the implementation, training process and hyperparameter ranges are provided in Appendix

D. For each real-world dataset, we report the difference between the NLL loss of each method and the LogNormMix model (Figure 3). We report the differences, since scores of all models can be shifted arbitrarily by scaling the data. Absolute scores (not differences) in a tabular format, as well as results for synthetic datasets are provided in Appendix F.1.

Results. Simple unimodal distributions (Gompertz/RMTPP, LogNormal) are always dominated by the more flexible models with the universal approximation property (LogNormMix, DSFlow, SOSFlow, FullyNN). Among the simple models, LogNormal provides a much better fit to the data than RMTPP/Gompertz. The distribution of inter-event times in real-world data often has heavy tails, and the Gompertz distributions fails to capture this behavior. We observe that the two proposed models, LogNormMix and DSFlow consistently achieve the best loss values.

Figure 3: NLL loss for event time prediction without marks (left) and with marks (right). NLL of each model is standardized by subtracting the score of LogNormMix. Lower score is better. Despite its simplicity, LogNormMix consistently achieves excellent loss values.

5.2 Learning with marks

Setup. We apply the models for learning in marked temporal point processes. Marks are known to improve performance of simpler models (Du et al., 2016), we want to establish whether our proposed models work well in this setting. We use the same setup as in the previous section, except for two differences. The RNN takes a tuple as input at each time step, where

is the mark. Moreover, the loss function now includes a term for predicting the next mark:

(implementation details in Appendix F.2).

Results. Figure 3 (right) shows the time NLL loss (i.e. ) for Reddit and MOOC datasets. LogNormMix shows dominant performance in the marked case, just like in the previous experiment. Like before, we provide the results in tabular format, as well as report the marks NLL loss in Appendix F.

5.3 Learning with additional conditional information

Setup. We investigate whether the additional conditional information (Section 3.3) can improve performance of the model. In the Yelp dataset, the task is predict the time until the next check-in for a given restaurant. We postulate that the distribution is different, depending on whether it’s a weekday and whether it’s an evening hour, and encode this information as a vector . We consider 4 variants of the LogNormMix model, that either use or don’t use and the history embedding .

Results. Figure 7 shows the test set loss for 4 variants of the model. We see that additional conditional information boosts performance of the LogNormMix model, regardless of whether the history embedding is used.

5.4 Missing data imputation

In practical scenarios, one often has to deal with missing data. For example, we may know that records were not kept for a period of time, or that the data is unusable for some reason. Since TPPs are a generative model, they provide a principled way to handle the missing data through imputation.

Setup. We are given several sequences generated by a Hawkes process, where some parts are known to be missing. We consider 3 strategies for learning from such a partially observed sequence: (a) ignore the gaps, maximize log-likelihood of observed inter-event times (b) fill the gaps with the average estimated from observed data, maximize log-likelihood of observed data, and (c) fill the gaps with samples generated by the model, maximize the expected log-likelihood of the observed points. The setup is demonstrated in Figure 4. Note that in case (c) the expected value depends on the parameters of the distribution, hence we need to perform sampling with reparametrization to optimize such loss. A more detailed description of the setup is given in Appendix F.4.

Model NLL No imputation 1.01 0.20 Mean imputation 0.81 0.27 Sampling with reparametrization 0.36 0.05
Figure 4: By sampling the missing values from during training, LogNormMix learns the true underlying data distribution. Other imputation strategies overfit the partially observed sequence.

Results. The 3 model variants are trained on the partially-observed sequence. Figure 4 shows the NLL of the fully observed sequence (not seen by any model at training time) produced by each strategy. We see that strategies (a) and (b) overfit the partially observed sequence. In contrast, strategy (c) generalizes and learns the true underlying distribution. The ability of the LogNormMix model to draw samples with reparametrization was crucial to enable such training procedure.

5.5 Sequence embedding

Different sequences in the dataset might be generated by different processes, and exhibit different distribution of inter-event times. We can ”help” the model distinguish between them by assigning a trainable embedding vector to each sequence in the dataset. It seems intuitive that embedding vectors learned this way should capture some notion of similarity between sequences.

Learned sequence embeddings. We learn a sequence embedding for each of the sequences in the synthetic datasets (along with other model parameters). We visualize the learned embeddings using t-SNE (Maaten and Hinton, 2008) in Figure 7 colored by the true class. As we see, the model learns to differentiate between sequences from different distributions in a completely unsupervised way.

Generation. We fit the LogNormMix model to two sequences (from self-correcting and renewal processes), and, respectively, learn two embedding vectors and . After training, we generate 3 sequences from the model, using , and as sequence embeddings. Additionally, we plot the learned conditional intensity function of our model for each generated sequence (Figure 7). The model learns to map the sequence embeddings to very different distributions.

Figure 5: Conditional information improves performance.
Figure 6: Sequences generated based on different embeddings.
Figure 7: Sequence embeddings learned by the model.

6 Conclusions

We use tools from neural density estimation to design new models for learning in TPPs. We show that a simple mixture model is competitive with state-of-the-art normalizing flows methods, as well as convincingly outperforms other existing approaches. By looking at learning in TPPs from a different perspective, we were able to address the shortcomings of existing intensity-based approaches, such as insufficient flexibility, lack of closed-form likelihoods and inability to generate samples analytically. We hope this alternative viewpoint will inspire new developments in the field of TPPs.


  • E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman (2018) Pyro: Deep Universal Probabilistic Programming.

    Journal of Machine Learning Research

    Cited by: footnote 6.
  • C. M. Bishop (1994) Mixture density networks. Cited by: §1, §3.3, §4.
  • O. Celma (2010) Music recommendation. In Music recommendation and discovery, pp. 43–85. Cited by: footnote 7.
  • G. Cybenko (1989)

    Approximation by superpositions of a sigmoidal function

    Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: §3.4.
  • H. Daniels and M. Velikova (2010) Monotone and partially monotone neural networks. IEEE Transactions on Neural Networks 21 (6), pp. 906–917. Cited by: §3.4.
  • A. DasGupta (2008) Asymptotic theory of statistics and probability. Springer Science & Business Media. Cited by: §3.4.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density estimation using Real NVP. International Conference on Learning Representations. Cited by: §D.2, §D.4.
  • N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. Gomez-Rodriguez, and L. Song (2016) Recurrent marked temporal point processes: embedding event history to vector. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: Appendix B, §D.3, §F.2, §1, §2, §3.3, §3.4, §4, §5.1, §5.2, §5, footnote 9.
  • E. Eirola and A. Lendasse (2013) Gaussian mixture models for time series modelling, forecasting, and interpolation. In International Symposium on Intelligent Data Analysis, pp. 162–173. Cited by: §4.
  • S. Frühwirth-Schnatter (2006) Finite mixture and markov switching models. Springer Science & Business Media. Cited by: §3.2.
  • W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud (2018) Backpropagation through the void: optimizing control variates for black-box gradient estimation. International Conference on Learning Representations. Cited by: §D.6.
  • A. Graves (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §4.
  • R. Guo, J. Li, and H. Liu (2018) INITIATOR: noise-contrastive estimation for marked temporal point process. In

    International Joint Conference on Artificial Intelligence

    Cited by: §4.
  • A. G. Hawkes (1971) Spectra of some self-exciting and mutually exciting point processes. Biometrika 58 (1), pp. 83–90. Cited by: §2, §4.
  • J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel (2019) Flow++: improving flow-based generative models with variational dequantization and architecture design. International Conference on Machine Learning. Cited by: §3.1.
  • H. Huang, H. Wang, and B. Mak (2019) Recurrent poisson process unit for speech recognition. In AAAI Conference on Artificial Intelligence, Cited by: §2, §4.
  • V. Isham and M. Westcott (1979) A self-correcting point process. Stochastic Processes and Their Applications 8 (3), pp. 335–347. Cited by: §2, §4.
  • P. Jaini, K. A. Selby, and Y. Yu (2019) Sum-of-squares polynomial flow. In International Conference on Machine Learning, Cited by: §3.1, §3.1, §3.4.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with Gumbel-softmax. International Conference on Learning Representations. Cited by: §D.6, §3.2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. The International Conference on Learning Representations. Cited by: §F.1.
  • D. Krueger, C. Huang, A. Lacoste, and A. Courville (2018) Neural autoregressive flows. In International Conference on Machine Learning, Cited by: §3.1, §3.1, §3.4.
  • S. Kumar, X. Zhang, and J. Leskovec (2019) Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD international conference on Knowledge discovery and data mining, Cited by: §5, footnote 8.
  • S. Li, S. Xiao, S. Zhu, N. Du, Y. Xie, and L. Song (2018) Learning temporal point processes via reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §4, §4.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.5.
  • G. McLachlan and D. Peel (2004) Finite mixture models. John Wiley & Sons. Cited by: §3.2.
  • H. Mei and J. M. Eisner (2017) The neural hawkes process: a neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §3.4, §4, §5.1.
  • S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih (2019) Monte carlo gradient estimation in machine learning. arXiv preprint arXiv:1906.10652. Cited by: §D.6, item c), §3.2.
  • T. Omi, N. Ueda, and K. Aihara (2019) Fully neural network based model for general temporal point processes. Cited by: Appendix C, §D.3, §E.1, §E.1, §1, §2, §3.4, §4, §5.1, §5.
  • A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §4.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in PyTorch

    In NIPS Autodiff Workshop, Cited by: footnote 3.
  • J. G. Rasmussen (2011) Temporal point processes: the conditional intensity function. Lecture Notes, Jan. Cited by: Appendix A, Appendix C, §2, §3.3, §4.
  • D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. International Conference on Machine Learning. Cited by: §1, §3.1, §4.
  • A. M. Schäfer and H. G. Zimmermann (2006) Recurrent neural networks are universal approximators. In International Conference on Artificial Neural Networks, pp. 632–640. Cited by: §3.4.
  • M. Schuster (2000) Better generative models for sequential data problems: bidirectional recurrent mixture density networks. In Advances in Neural Information Processing Systems, Cited by: §4.
  • H. T. Siegelmann and E. D. Sontag (1992) On the computational power of neural nets. In

    Proceedings of the fifth annual workshop on Computational learning theory

    pp. 440–449. Cited by: §3.4.
  • E. G. Tabak and C. V. Turner (2013) A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 (2), pp. 145–164. Cited by: §3.1, §4.
  • G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein (2017) Rebar: low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, Cited by: §D.6.
  • U. Upadhyay, A. De, and M. G. Rodriguez (2018) Deep reinforcement learning of marked temporal point processes. In Advances in Neural Information Processing Systems, Cited by: Appendix B, §D.3, §F.1, §3.2, §4, §4.
  • U. Upadhyay and M. G. Rodriguez (2019) Temporal point processes. Lecture notes for Human-Centered ML. External Links: Link Cited by: Appendix A, §3.4.
  • A. Wienke (2010) Frailty models in survival analysis. Chapman and Hall/CRC. Cited by: Appendix B.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §D.6.
  • S. Xiao, M. Farajtabar, X. Ye, J. Yan, L. Song, and H. Zha (2017) Wasserstein learning of deep generative point process models. In Advances in Neural Information Processing Systems, Cited by: §4.
  • S. Xiao, hongteng Xu, J. Yan, M. Farajtabar, X. Yang, L. Song, and H. Zha (2018) Learning conditional generative models for temporal point processes. In AAAI Conference on Artificial Intelligence, Cited by: §4.
  • J. Yan, X. Liu, L. Shi, C. Li, and H. Zha (2018) Improving maximum likelihood estimation of temporal point process via discriminative and adversarial learning. In International Joint Conference on Artificial Intelligence, Cited by: §4.
  • Z. M. Ziegler and A. M. Rush (2019) Latent normalizing flows for discrete sequences. International Conference on Machine Learning. Cited by: §4.

Appendix A Intensity function of flow and mixture models

CDF and conditional intensity function of proposed models. The cumulative distribution function (CDF) of a normalizing flow model can be obtained in the following way. If has a CDF and , then the CDF of is obtained as

Since for both SOSFlow and DSFlow we can evaluate in closed form, is easy to compute.

For the log-normal mixture model, CDF is by definition equal to

where is the CDF of a standard normal distribution.

Given the conditional PDF and CDF, we can compute the conditional intensity and the cumulative intensity for each model as

where is the arrival time of most recent event before (Rasmussen, 2011).

Merging two independent processes. We replicate the setup from Upadhyay and Rodriguez (2019) and consider what happens if we merge two independent TPPs with intensity functions and (and respectively, cumulative intensity functions and ). According to Upadhyay and Rodriguez (2019), the intensity function of the new process is . Therefore, the cumulative intensity function of the new process is

Using the previous result, we can obtain the CDF of the merged process as

The PDF of the merged process is obtained by simply differentiating the CDF w.r.t. .

This means that by using either normalizing flows or mixture distributions, and thus directly modeling PDF / CDF, we are not losing any benefits of the intensity parametrization.

Appendix B Discussion of constant & exponential intensity models

Constant intensity model as exponential distribution. The conditional intensity function of the constant intensity model (Upadhyay et al., 2018) is defined as , where is the history embedding produced by an RNN, and is a learnable parameter. By setting , it’s easy to see that the PDF of the constant intensity model corresponds to an exponential distribution.

Exponential intensity model as Gompertz distribution. PDF of a Gompertz distribution (Wienke, 2010) is defined as

for . The two parameters and define its shape and rate, respectively. For any choice of its parameters, Gompertz distribution is unimodal and light-tailed.

The conditional intensity function of the exponential intensity model (Du et al., 2016) is defined as , where is the history embedding produced by an RNN, and are learnable parameters. By defining , we obtain the PDF of the exponential intensity model (Du et al., 2016, Equation 12) as

By setting and we see that the exponential intensity model is equivalent to a Gompertz distribution.

Discussion. Figure 8 shows densities that can be represented by exponential and Gompertz distributions. Even though the history embedding produced by an RNN may capture rich information, the resulting distribution for both models has very limited flexibility, is unimodal and light-tailed. In contrast, a flow-based or a mixture model is significantly more flexible and can approximate any density.

Figure 8: Different choices for modeling : exponential distribution (left), Gompertz distribution (center), log-normal mixture (right). Mixture distribution can approximate any density while being tractable and easy to sample from.

Appendix C Discussion of the FullyNN model


The main idea of the approach by Omi et al. (2019) is to model the integrated conditional intensity function

using a feedforward neural network with non-negative weights


where , is the history embedding, , , are non-negative weight matrices, and , , , are the remaining model parameters.

FullyNN as a normalizing flow

Let , that is

We can view as a transformation that maps to

We can now use the change of variables formula to obtain the conditional CDF and PDF of .

Alternatively, we can obtain the conditional intensity as

and use the fact that .

Both approaches lead to the same conclusion

However, the first approach also provides intuition on how to draw samples from the resulting distribution — an approach known as the inverse method (Rasmussen, 2011)

  1. Sample

  2. Obtain by solving for (using e.g. bisection method)

Similarly to other flow-based models, sampling from the FullyNN model cannot be done exactly and requires a numerical approximation.

Shortcomings of the FullyNN model

  1. The PDF defined by the FullyNN model doesn’t integrate to 1.

    By definition of the CDF, the condition that the PDF integrates to 1 is equivalent to , which in turn is equivalent to . However, because of saturation of activations (i.e. ) in Equation 4

    Therefore, the PDF doesn’t integrate to 1.

  2. The FullyNN model assigns a non-zero amount of probability mass to the interval, which violates the assumption that inter-event times are strictly positive.

    Since the inter-event times are assumed to be strictly positive almost surely, it must hold that , or equivalently . However, we can see that

    which means that the FullyNN model permits negative inter-event times.

Appendix D Implementation details

d.1 Shared architecture

We implement SOSFlow, DSFlow and LogNormMix, together with baselines: RMTPP (Gompertz distribution), exponential distribution and a FullyNN model. All of them share the same pipeline, from the data preprocessing to the parameter tuning and model selection, differing only in the way we calculate . This way we ensure a fair evaluation. Our implementation uses Pytorch.333 (Paszke et al., 2017)

From arival times we calculate the inter-event times . Since they can contain very large values, RNN takes log-transformed and centered inter-event time and produces . In case we have marks, we additionally input — the index of the mark class from which we get mark embedding vector . In some experiments we use extra conditional information, such as metadata and sequence embedding , where is the index of the sequence.

As illustrated in Section 3.3 we generate the parameters of the distribution from

using an affine layer. We apply a transformation of the parameters to enforce the constraints, if necessary.

All decoders are implemented using a common framework relying on normalizing flows. By defining the base distribution and the inverse transformation we can evaluate the PDF at any , which allows us to train with maximum likelihood (Section 3.1).

d.2 Log-normal mixture

The log-normal mixture distribution is defined in Equation 2. We generate the parameters of the distribution (subject to and ), using an affine transformation (Equation 3). The log-normal mixture is equivalent to the following normalizing flow model

By using the affine transformation before the

transformation, we obtain a better initialization, and thus faster convergence. This is similar to the batch normalization flow layer

(Dinh et al., 2017), except that and are estimated using the entire dataset, not using batches.

Forward direction samples a value from a Gaussian mixture, applies an affine transformation and applies . In the bacward direction we apply log-transformation to an observed data, center it with an affine layer and compute the density under the Gaussian mixture.

d.3 Baselines

We implement FullyNN model (Omi et al., 2019) as described in Appendix C, using the official implementation as a reference444

. The model uses feed-forward neural network with non-negative weights (enforced by clipping values at

after every gradient step). Output of the network is a cumulative intensity function from which we can easily get intensity function as a derivative w.r.t.  using automatic differentiation in Pytorch. We get the PDF as .

We implement RMTPP / Gompertz distribution (Du et al., 2016)555 and the exponential distribution (Upadhyay et al., 2018) models as described in Appendix B.

All of the above methods define the distribution . Since the inter-event times may come at very different scales, we apply a linear scaling , where is estimated from the data. This ensures a good initialization for all models and speeds up training.

d.4 Deep sigmoidal flow

A single layer of DSFlow model is defined as

with parameters (subject to and ). We obtain the parameters of each layer using Equation 3.

We define through the inverse transformation , as described in Section 3.1.

We use the the batch normalization flow layer (Dinh et al., 2017) between every pair of consecutive layers, which significantly speeds up convergence.

d.5 Sum-of-squares polynomial flow

A single layer of SOSFlow model is defined as

There are no constraints on the polynomial coefficients . We obtain similarly to Equation 3 as , where is the context vector.

We define by through the inverse transformation , as described in Section 3.1.

Same as for DSFlow, we use the the batch normalization flow layer between every pair of consecutive layers. When implementing SOSFlow, we used Pyro666 (Bingham et al., 2018) for reference.

d.6 Reparametrization sampling

Using a log-normal mixture model allows us to sample with reparametrization which proves to be useful, e.g. when imputing missing data (Section 5.4). In a score function estimator (Williams, 1992) given a random variable , where are parameters, the gradient of the expected value of function is or equivalently

. This is an unbiased estimator of the gradients but it suffers from high variance. In contract, by introducing the reparametrization trick:

, the gradient of the expected value of is now . This typically has lower variance (Mohamed et al., 2019). In both cases we need to estimate the expected value using Monte Carlo.

To sample with reparametrization from the mixture model we use the Straight-Through Gumbel Estimator (Jang et al., 2017). We first obtain a relaxed sample , where each is sampled i.i.d. from a Gumbel distribution with zero mean and unit scale, and is the temperature parameter. Finally, we get a one-hot sample . While a discrete is used in the forward pass, during the backward pass the gradients will flow through the differentiable .

The gradients obtained by the Straight-Through Gumbel Estimator are slightly biased, which in practice doesn’t have a significant effect on the model’s performance. There exist alternatives (Tucker et al., 2017; Grathwohl et al., 2018) that provide unbiased gradients, but are more expensive to compute.

Appendix E Dataset statistics

e.1 Synthetic data

Synthetic data is generated according to Omi et al. (2019) using well known point processes. We sample sequences for each process, each sequence containing events.

Poisson. Conditional intensity function for a homogeneous (or stationary) Poisson point process is given as . Constant intensity corresponds to exponential distribution.

Renewal. A stationary process defined by a log-normal probability density function , where we set the parameters to be and . Sequences appear clustered.

Self-correcting. Unlike the previous two, this point process depends on the history and is defined by a conditional intensity function . After every new event the intensity suddenly drops, inhibiting the future points. The resulting point patterns appear regular.

Hawkes. We use a self-exciting point process with a conditional intensity function given as . As per Omi et al. (2019), we create two different datasets: Hawkes1 with , , and ; and Hawkes2 with , , , , and . For the imputation experiment we use Hawkes1 to generate the data and remove some of the events.

e.2 Real-world data

In addition we use real-world datasets that are described bellow. Table 2 shows their summary. All datasets have a large amount of unique sequences and the number of events per sequence varies a lot. Using marked temporal point processes to predict the type of an event is feasible for some datasets (e.g. when the number of classes is low), and is meaningless for other.

LastFM.777Celma (2010) The dataset contains sequences of songs that selected users listen over time. Artists are used as an event type.

Reddit.888 et al., 2019) On this social network website users submit posts to subreddits. In the dataset, most active subreddits are selected, and posts from the most active users on those subreddits are recodered. Each sequence corresponds to a list of submissions a user makes. The data contains unique subreddits that we use as classes in mark prediction.

Stack Overflow.999 preprocessed according to Du et al. (2016) Users of a question-answering website get rewards (called badges) over time for participation. A sequence contains a list of rewards for each user. Only the most active users are selected and only those badges that users can get more than once.

MOOC.footnote 8 Contains the interaction of students with an online course system. An interaction is an event and can be of various types ( unique types), e.g. watching a video, solving a quiz etc.

Wikipedia.footnote 8 A sequence corresponds to edits of a Wikipedia page. The dataset contains most edited pages and users that have an activity (number of edits) above a certain threshold.

Yelp.101010 We use the data from the review forum and consider the reviews for the 300 most visited restaurants in Toronto. Each restaurant then has a corresponding sequence of reviews over time.

Dataset name Number of sequences Number of events
LastFM 929 1268385
Reddit 10000 672350
Stack Overflow 6633 480414
MOOC 7047 396633
Wikipedia 1000 157471
Yelp 300 215146
Table 2: Dataset statistics.

Appendix F Additional discussion of the experiments

f.1 Event time prediction using history

Detailed setup. Each dataset consists of multiple sequences of inter-event times. We consider 10 train/validation/test splits of the sequences (of sizes ). We train all model parameters by minimizing the negative log-likelihood (NLL) of the training sequences, defined as . After splitting the data into the 3 sets, we break down long training sequences into sequences of length at most 128. Optimization is performed using Adam (Kingma and Ba, 2015) with learning rate

. We perform training using mini-batches of 64 sequences. We train for up to 2000 epochs (1 epoch = 1 full pass through all the training sequences). For all models, we compute the validation loss at every epoch. If there is no improvement for 100 epochs, we stop optimization and revert to the model parameters with the lowest validation loss.

We select hyperparameter configuration for each model that achieves the lowest average loss on the validation set. For each model, we consider different values of regularization strength . Additionally, for SOSFlow we tune the number of transformation layers and for DSFlow .

Additional discussion. In this experiment, we only condition the distribution on the history embedding . We don’t learn sequence embeddings since they can only be learned for the training sequences, and not fore the validation/test sets.

There are two important aspects related to the NLL loss values that we report. First, the absolute loss values can be arbitrarily shifted by rescaling the data. Assume, that we have a distribution that models the distribution of . Now assume that we are interested in the distribution of (for ). Using the change of variables formula, we obtain . This means that by simply scaling the data we can arbitrarily offset the log-likelihood score that we obtain. Therefore, the absolute values of of the (negative) log-likelihood for different models are of little interest — all that matters are the differences between them.

The loss values are dependent on the train/val/test split. Assume that model 1 achieves loss values on two train/val/test splits, and model 2 achieves on the same splits. If we first aggregate the scores and report the average , , it may seem that the difference between the two models is not significant. However, if we first compute the differences and then aggregate we see a different picture. Therefore, we use the latter strategy in Figure 3. For completeness, we also report the numbers obtained using the first strategy in Table 3.

As a baseline, we also considered the constant intensity / exponential distribution model (Upadhyay et al., 2018). However, we excluded the results for it from Figure 3, since it consistently achieved the worst loss values and had high variance. We still include the results for the constant intensity model in Table 3. We also performed all the experiments on the synthetic datasets (Appendix E.1). The results are shown in Table 4. Same as for real-world datasets, LogNormMix and DSFlow achieve the best results.

Reddit LastFM MOOC Stack Overflow Wikipedia Yelp
LogNormMix 10.19 0.078 -2.88 0.147 6.03 0.092 14.44 0.013 8.39 0.079 13.02 0.070
DSFlow 10.20 0.074 -2.88 0.148 6.03 0.090 14.44 0.019 8.40 0.090 13.09 0.065
SOSFlow 10.27 0.106 -2.56 0.133 6.27 0.058 14.47 0.049 8.44 0.120 13.21 0.068
FullyNN 10.23 0.072 -2.84 0.179 6.83 0.152 14.45 0.014 8.40 0.086 13.04 0.073
LogNormal 10.38 0.077 -2.60 0.140 6.53 0.016 14.62 0.013 8.52 0.078 13.44