A toolbox of Hawkes processes
Many real-world applications require robust algorithms to learn point processes based on a type of incomplete data --- the so-called short doubly-censored (SDC) event sequences. We study this critical problem of quantitative asynchronous event sequence analysis under the framework of Hawkes processes by leveraging the idea of data synthesis. Given SDC event sequences observed in a variety of time intervals, we propose a sampling-stitching data synthesis method --- sampling predecessors and successors for each SDC event sequence from potential candidates and stitching them together to synthesize long training sequences. The rationality and the feasibility of our method are discussed in terms of arguments based on likelihood. Experiments on both synthetic and real-world data demonstrate that the proposed data synthesis method improves learning results indeed for both time-invariant and time-varying Hawkes processes.READ FULL TEXT VIEW PDF
We propose an effective method to solve the event sequence clustering
Point processes are becoming very popular in modeling asynchronous seque...
The superposition of temporal point processes has been studied for many
We target modeling latent dynamics in high-dimension marked event sequen...
This paper proposes a method for modeling event sequences with ambiguous...
The robust detection of statistical dependencies between the components ...
A parametric point process model is developed, with modeling based on th...
A toolbox of Hawkes processes
Real-world interactions among multiple entities are often recorded as asynchronous event sequences, such as user behaviors in social networks, job hunting and hopping among companies, and diseases and their complications. The entities or event types in the sequences often exhibit self-triggering and mutually-triggering patterns. For example, a tweet of a twitter user may trigger further responses from her friends (Zhao et al., 2015). A disease of a patient may trigger other complications (Choi et al., 2015). Hawkes processes, an important kind of temporal point process model (Hawkes & Oakes, 1974), have capability to describe the triggering patterns quantitatively and capture the infectivity network of the entities.
Despite the usefulness of Hawkes processes, robust learning of Hawkes processes often needs many event sequences with events occurring over a long observation window. Unfortunately, the observation window is likely to be very short and sequence-specific in many important practical applications, i.e., within an imagined universal window, each sequence is only observed with a corresponding short sub-interval of it, and the events outside this sub-interval are not observed — we call them short doubly-censored (SDC) event sequences. Existing learning algorithms of Hawkes processes directly applied to SDCs may suffer from over-fitting, and what is worse, the triggering patterns between historical events and current ones are lost, so that the triggering patterns learned from SDC event sequences are often unreliable. This problem is a thorny issue in several practical applications, especially in those having time-varying triggering patterns. For example, the disease networks of patients should evolve with the increase of age. However, it is very hard to track and record people’s diseases on a life-time scale. Instead, we can only obtain their several admissions (even only one admission) in a hospital during one or two years, which are just SDC event sequences. Therefore, it is highly desirable to propose a method to learn Hawkes processes having a longtime support from a collection of SDC event sequences
An illustration of our sampling-stitching data synthesis method. For each SDC sequence, i.e., incomplete disease history of a person in his lifetime, we design a mechanism to select other SDC sequences as predecessors/successors and synthesize a long sequence. Then, we can estimate the unobserved triggering patterns among diseases, i.e., the red dashed arrows, and construct a dynamical disease network changing over age.
In this paper, we propose a novel and simple data synthesis method to enhance the robustness of learning algorithms for Hawkes processes. Fig. 1 illustrates the principle of our method. Given a set of SDC event sequences, we sample predecessor for each event sequence from potential candidates and stitch them together as new training data. In the sampling step, the distribution of predecessor (and successor) is estimated according to the similarities between current sequence and its candidates, and the similarity is defined based on the information of time stamps and (optional) features of event sequences. We analyze the rationality and the feasibility of our data synthesis method and discuss the necessary condition for using the method. Experimental results show that our data synthesis method indeed helps to improve the robustness of various learning algorithms for Hawkes processes. Especially in the case of time-varying Hawkes processes, applying our method in the learning phase achieves much better results than learning directly from SDC event sequences, which is meaningful for many practical applications, e.g., constructing dynamical disease network, and learning long-term infectivity among different IT companies.
An event sequence can be represented as , where time stamps ’s are in an observation window and events ’s are in a set of event types . A point process is a random process model taking event sequences as instances, where and is the number of type- events occurring at or before time . A point process can be characterized via its conditional intensity function , where and is the set of history. It represents the expected instantaneous happening rate of events given historical record (Daley & Vere-Jones, 2007). The intensity is often modeled with certain parameters to capture the phenomena of interests, i.e., self-triggering (Hawkes & Oakes, 1974) or self-correcting (Xu et al., 2015). Based on , the likelihood of an event sequence is
Hawkes Processes. Hawkes processes (Hawkes & Oakes, 1974) have a particular form of intensity:
where is the exogenous base intensity independent of the history while is the endogenous intensity capturing the influence of historical events on type- ones at time (Xu et al., 2016a). Here, is called impact function. It quantifies the influence of the type- event at time to the type- event at time . Hawkes processes provide us with a physically-meaningful model to capture the infectivity among various events, which are used in social network analysis (Zhou et al., 2013b; Zhao et al., 2015), behavior analysis (Yang & Zha, 2013; Luo et al., 2015) and financial analysis (Bacry et al., 2013). However, the methods in these references assume that the impact function is shift-invariant (i.e., , ), which limits their applications on longtime scale. Recently, the time-dependent Hawkes process (TiDeH) in (Kobayashi & Lambiotte, 2016)
and the neural network-based Hawkes process in(Mei & Eisner, 2016) learn very flexible Hawkes processes with complicated intensity functions. Because they highly depend on the size and the quality of data, they may fail in the case of SDC event sequences.
Learning from Imperfect Observations. In practice, we need to learn sequential models from imperfect observations (e.g., interleaved (Xu et al., 2016b), aggregated (Luo et al., 2016) and extremely-short sequences (Xu et al., 2016c)
). Multiple imputation (MI)(Rubin, 2009) is a general framework to build surrogate observations from the current model. For time series, bootstrap method (Efron, 1982; Politis & Romano, 1994; Gonçalves & Kilian, 2004) and its variants (Paparoditis & Politis, 2001; Guan & Loh, 2007) have been used to improve learning results when observations are insufficient. In survival analysis, many techniques have been made to deal with truncated and censored data (Turnbull, 1974; De Gruttola & Lagakos, 1989; Klein & Moeschberger, 2005; Van den Berg & Drepper, 2016). For point processes, the global (Streit, 2010) or local (Fan, 2009) likelihood maximization estimators (MLE) are used to learn Poisson processes. Nonparametric approaches for non-homogeneous Poisson processes use the pseudo MLE (Sun & Kalbfleisch, 1995) or full MLE (Wellner & Zhang, 2000). The bootstrap methods above are also used to learn point processes (Cowling et al., 1996; Guan & Loh, 2007; Kirk & Stumpf, 2009). To learn Hawkes processes robustly, structural constraints, e.g., low-rank (Luo et al., 2015) and group-sparse regularizers (Xu et al., 2016a), are introduced. However, all of these methods do not consider the case of SDC event sequences for Hawkes processes.
Suppose that the original complete event sequences are in a long observation window. However, the observation window in practice might be segmented into several intervals , and we can only observe SDC sequences in the -th interval, . Although we can still apply maximum likelihood estimator to learn Hawkes processes, i.e.,
the SDC event sequences would lead to over-fitting problem and the loss of triggering patterns. Can we do better in such a situation? In this work, we propose a data synthesis method based on a sampling-stitching mechanism, which extends SDC event sequences to longer ones and enhances the robustness of learning algorithms.
Denote the -th SDC event sequence in the -th interval as . Because its predecessor is unavailable, if we learn the parameters of our model via (3) directly, we actually impose a strong assumption on our data that there is no event happening before (or previous events are too far away from to have influences on ). Obviously, this assumption is questionable — it is likely that there are influential events happening before . A more reasonable strategy is enumerating potential predecessors and maximizing the expected log-likelihood over the whole observation window:
Here represents the expectation of function
with random variableyielding to a distribution . means all possible history before , and is the likelihood of stitched sequence .
The stitched sequence can be generated via sampling SDC sequence from previous st, …, -th intervals and stitching to . The sampling process yields to the probabilistic distribution of the stitched sequences. Given , we can compute its similarity between its potential predecessor in as
Here, is a predefined similarity function with parameter . is the feature of , which is available in some applications. Note that the availability of feature is optional — even if the feature of sequence is unavailable, we can still define the similarity measurement purely based on time stamps. The normalized
provides us with the probability thatappears before , i.e., . Then, we can sample according to the categorical distribution, i.e., .
We can apply such a sampling-stitching mechanism times iteratively to the SDC sequences in both backward and forward directions and get long stitched event sequences. Specifically, we represent a stitched event sequence as , , , whose probability is
Note that our data synthesis method naturally contains two variants. When the starting (the ending) point of time window is unavailable, we use the time stamp of the first (the last) event of SDC sequence instead. Additionally, we can relax the constraint in (5) and allow a SDC sequence to have an overlap with its predecessor/successor. In this case, we preserve the overlap part randomly either from itself or its predecessor/successor before applying our sampling-stitching method. These two variants ensure that our data synthesis method is doable in practice, which are used in the following experiments on real-world data.
After applying our data synthesis method, we obtain many stitched event sequences, which can be used as instances for estimating . Specifically, taking advantage of stitched sequences, we can rewrite the learning problem in (4) approximately as
which is actually the minimum cross-entropy estimation. represents the “true” probability that the stitched sequence happens, which is estimated via the predefined similarity measurement and the sampling mechanism. The likelihood represents the “unnatural” probability that the stitched sequence happens, which is estimated based on the definition in (1). Our data synthesis method takes advantage of the information of time stamps and (optional) features and makes suitable for practical situations. For example, the likelihood of a sequence generally reduces with the increase of observation time window. The proposed probability yields to the same pattern — according to (6), the longer a stitched sequence is, the smaller its probability becomes.
The set of all possible stitched sequences, i.e., the in (7), is very large, whose cardinality is . In practice, we cannot and do not need to enumerate all possible combinations. An empirical setting is making the number of stitched sequences comparable to that of original SDC event sequences, i.e., generating stitched sequences. In the following experiments, we just apply () trials and generate stitched sequences for each original SDC event sequence, which achieves a trade-off between computational complexity and performance.
It should be noted that our data synthesis method is only suitable for those complicated point processes whose historical events have influences on current and future ones. Specifically, we analyze the feasibility of our method for several typical point processes.
Poisson Processes. Our data synthesis method cannot improve learning results if the event sequences are generated via Poisson processes. For Poisson processes, the happening rate of future events is independent of historical events. In other words, the intensity function of each interval can be learned independently based on the SDC event sequences. The stitched sequences do not provide us with any additional information.
Hawkes Processes. For Hawkes processes, whose intensity function is defined as (2), our data synthesis method can enhance the robustness of learning algorithm generally. In particular, consider a “long” event sequence generated via a Hawkes process in the time window . If we divide the time window into intervals, i.e., and , the intensity function corresponding to the second interval can be written as
If the events in the first interval are unobserved, we just have a SDC event sequence, and the second term in (8) is unavailable. Learning Hawkes processes directly from the SDC event sequence ignores the information of the second term, which has a negative influence on learning results. Our data synthesis method leverages the information from other potential predecessors and generates multiple candidate long sequences. As a result, we obtain multiple intensity functions sharing the second interval and maximize the weighted sum of their log-likelihood functions (i.e., an estimated expectation of the log-likelihood of the real long sequence), as (7) does.
Compared with learning from SDC event sequences directly, applying our data synthesis method can improve learning results in general, unless the term is ignorable. Specifically, we can model the impact functions of Hawkes processes based on basis representation:
Here, we decompose impact functions into two parts: 1) Infectivity represents the infectivity of event type to at time .111When and , we obtain the simplest time-invariant Hawkes process. Relaxing the shift-invariant assumption, i.e., and is Gaussian, we obtain a flexible time-varying Hawkes process model. 2) Triggering kernel measures the time decay of infectivity. It means that the infectivity of a historical event to current one reduces exponentially with the increase of temporal distance between them. When is very large, decays rapidly with the increase of , and the events happening long ago can be ignored. In such a situation, our data synthesis method is unable to improve learning results.
Hawkes process is a kind of physically-interpretable model for many natural and social phenomena. The proposed model in (9) reflects many common properties of real-world event sequences. First, the infectivity among various event types often changes smoothly in practice: in social networks, the interaction between two users changes smoothly, which is not established or blocked suddenly; in disease networks, the infectivity among diseases should change smoothly with the increase of patient’s age. Applying Gaussian basis representation guarantees the smoothness of infectivity function. Second, the triggering kernel measures the decay of infectivity over time. According to existing work, the decay of infectivity is exponential approximately, which has been verified in many real-world data (Zhou et al., 2013a; Kobayashi & Lambiotte, 2016; Choi et al., 2015). For learning Hawkes processes from SDC event sequences, we combine our data synthesis method with an EM-based learning algorithm of Hawkes processes. Applying our data synthesis method, we obtain a set of stitched event sequences and their appearance probabilities , where and is calculated based on (5). According to (7, 9), we can learn target Hawkes process via
represents the parameters of our model. The vector
and the tensorare nonnegative. Based on (1, 9), the log-likelihood function is
where , , and . represents the regularizer of parameters, whose weight is . Following existing work in (Luo et al., 2015; Zhou et al., 2013a; Xu et al., 2016a), we assume the infectivity connections among different event types to be sparse and impose a -norm regularizer on the coefficient tensor , i.e., .
We can solve the problem via an EM algorithm. Specifically, when sparse regularizer is applied, we take advantage of ADMM method, introducing auxiliary variable and dual variable for and rewriting the objective function in (10) as
Here controls the weights of regularization terms, which increases with the number of EM iterations. computes the trace of matrix. Then, we can update , , and alternatively.
Update and : Given the parameters in the -th iteration, we apply Jensen’s inequality to and obtain a surrogate objective function for and :
where and , and is calculated based on and . Then, we can update and via solving and . Both of these two equations have closed-form solution:
Update : Given and , we can update via solving the following optimization problem:
Applying soft-thresholding method, we have
where is the soft-thresholding function.
Update : Given and , we can further update dual variable as
In summary, Algorithm 1 shows the scheme of our learning method. Note that the algorithm can be applied to SDC event sequences directly via ignoring ’s.
To demonstrate the usefulness of our data synthesis method, we combine it with various learning algorithms of Hawkes processes and learn different models accordingly from SDC event sequences. For time-invariant Hawkes processes, we consider two learning algorithms — our EM-based learning algorithm and the least squares (LS) algorithm in (Eichler et al., 2016). For time-varying Hawkes processes, we apply our EM-based learning algorithm. In the following experiments, we use Gaussian basis functions: with center and bandwidth . The number and the bandwidth of basis can be set according to the basis selection method proposed in (Xu et al., 2016a). Additionally, we set , , and in our algorithm. Given SDC event sequences, we learn Hawkes processes in three ways: 1) learning directly from SDC event sequences; 2) applying the stationary bootstrap method in (Politis & Romano, 1994) to generate more synthetic SDC event sequences and learning from these sequences accordingly; 3) learning from stitched sequences generated via our data synthesis method. For real-world data, whose SDC sequences do not have predefined starting and ending time stamps, we applied the variants of our method mentioned in the end of Section 3.1.
The synthetic SDC event sequences are generated via the following method: complete event sequences are simulated in the time window based on a -dimensional Hawkes process. The base intensity are randomly generated in the range . The parameter of triggering kernel, , is set to be . For time-invariant Hawkes processes, we set the infectivity to be constants randomly generated in the range . For time-varying Hawkes processes, we set , where are randomly generated in the range . Given these complete event sequences, we select sequences as testing set while the remaining sequences as training set. To generate SDC event sequences, we segment time window into intervals, and just randomly preserve the data in one interval for each training sequences. We test all methods in trials and compare them on the relative error between real parameters and their estimation results , i.e., , and the log-likelihood of testing sequences.
Time-invariant Hawkes Processes. Fig. 2 shows the comparisons on log-likelihood and relative error for various methods. In Fig. 2(a) we can find that compared with the learning results based on complete event sequences, the results based on SDC event sequences degrade a lot (lower log-likelihood and higher relative error) because of the loss of information. Our data synthesis method improves the learning results consistently with the increase of training sequences, which outperforms its bootstrap-based competitor (Politis & Romano, 1994) as well. To demonstrate the universality of our method, besides our EM-based algorithm, we apply our method to the Least Squares (LS) algorithm (Eichler et al., 2016). Fig. 2(b) shows that our method also improves the learning results of the LS algorithm in the case of SDC event sequences. Both the log-likelihood and the relative error obtained from the stitched sequences approach to the results learned from complete sequences.
Time-varying Hawkes Processes. Fig. 3
shows the comparisons on log-likelihood and relative error for various methods. Similarly, the learning results are improved because of applying our method — higher log-likelihood and lower relative error are obtained and their standard deviation (the error bars associated with curves) is shrunk. In this case, applying our method twice achieves better results than applying once, which verifies the usefulness of the iterative framework in our sampling-stitching algorithm. Besides objective measurements, in Fig.4 we visualize the infectivity functions . It is easy to find that the infectivity functions learned from stitched sequences (red curves) are comparable to those learned from complete event sequences (yellow curves), which have small estimation errors of the ground truth (black curves).
Note that our iterative framework is useful, especially for time-varying Hawkes processes, when the number of stitches is not very large. In our experiments, we fixed the maximum number of synthetic sequences. As a result, Figs. 2 and 3 show that the likelihoods first increase (i.e., stitching once or twice) and then decrease (i.e., stitching more than three times) while the relative errors have opponent changes w.r.t. the number of stitches. These phenomena imply that too many stitches introduce too much unreliable interdependency among events. Therefore, we fix the number of stitches to in the following experiments.
Besides synthetic data, we also test our method on real-world data, including the LinkedIn data collected by ourselves and the MIMIC III data set (Johnson et al., 2016).
LinkedIn Data. The LinkedIn data we collected online contain job hopping records of LinkedIn users in IT companies. For each user, her/his check-in time stamps corresponding to different companies are recorded as an event sequence, and her/his profile (e.g., education background, skill list, etc.) is treated as the feature associated with the sequence. For each person, the attractiveness of a company is always time-varying. For example, a young man may be willing to join in startup companies and increase his income via jumping between different companies. With the increase of age, he would more like to stay in the same company and achieve internal promotions. In other words, the infectivity network among different companies should be dynamical w.r.t. the age of employees. Unfortunately, most of the records in LinkedIn are short and doubly-censored — only the job hopping events in recent years are recorded. How to construct the dynamical infectivity network among different companies from the SDC event sequences is still an open problem.
Applying our data synthesis method, we can stitch different users’ job hopping sequences based on their ages (time stamps) and their profile (feature) and learn the dynamical network of company over time. In particular, we select users with relatively complete job hopping history (i.e., the range of their working experience is over years) as testing set. The remaining users are randomly selected as training set. The log-likelihood of testing set in trials is shown in Fig. 5(a). We can find that the log-likelihood obtained from stitched sequences is higher than that obtained from original SDC sequences or that obtained from the sequences generated via the bootstrap method (Politis & Romano, 1994), and its standard deviation is bounded stably. Fig. 6(a) visualizes the adjacent matrix of infectivity network. The properties of the network verifies the rationality of our results: 1) the diagonal elements of the adjacent matrix are larger than other elements in general, which reflects the fact that most employees would like to stay in the same company and achieve a series of internal promotions; 2) with the increase of age, the infectivity network becomes sparse, which reflects the fact that users are more likely to try different companies in the early stages of their careers.
MIMIC III Data. The MIMIC III data contain admission records of over patients in the Beth Israel Deaconess Medical Center between 2001 and 2012. For each patient, her/his admission time stamps and diseases (represented via the ICD-9 codes (Deyo et al., 1992)) are recorded as an event sequence, and her/his profile (including gender, race and chronic history) is represented as binary feature of the sequence. As aforementioned, some work (Choi et al., 2015) has been done to extract time-invariant disease network from admission records. However, the real disease network should be time-varying w.r.t. the age of patient. Similar to the LinkedIn data, here we only obtain SDC event sequences — the ranges of most admission records are just or years.
Applying our data synthesis method, we can leverage the information from different patients and stitch their sequences based on their ages and their profile. Focusing on common diseases in categories, we select patients’ admission records randomly as training set and patients with relatively complete records as testing set. Fig. 5(b) shows that applying our data synthesis method indeed helps to improve log-likelihood of testing data. Compared with our bootstrap-based competitor, our data synthesis method gets more obvious improvements. Furthermore, we visualize the adjacent matrix of dynamical network of disease categories in Fig. 6(b). We can find that: 1) with the increase of age the disease network becomes dense, which reflects the fact that the complications of diseases are more and more common when people become old; 2) the networks show that neoplasms and the diseases of circulatory, respiratory, and digestive systems have strong self-triggering patterns because the treatments of these diseases often include several phases and require patients to make admission multiple times; 3) for kids and teenagers, their disease networks (i.e., “Age 0” and “Age 10” networks) are very sparse, and their common diseases mainly include neoplasms and the diseases of circulatory, respiratory, and digestive systems; 4) for middle-aged people, the reasons for their admissions are diverse and complicated so that their disease networks are dense and include many mutually-triggering patterns; 5) for longevity people, their disease networks (i.e., “Age 80” and “Age 90” networks) are relatively sparser than those of middle-aged people, because their admissions are generally caused by elderly chronic diseases.
Additionally, we visualize the dynamical networks of the diseases of circulatory systems in Fig. 7, and find some interesting triggering patterns. For example, for kids (“Age 0” network), the typical circulatory diseases are “diseases of mitral and aortic valves” (ICD-9 396) and “cardiac dysrhythmias” (ICD-9 427), which are common for premature babies and the kids having congenital heart disease. For the old (“Age 80” network), the network becomes dense. We can find that 1) as a main cause of death, “heart failure” (ICD-9 428) is triggered via multiple other diseases, especially “secondary hypertension” (ICD-9 405); 2) “secondary hypertension” is also likely to cause “other and ill-defined cerebrovascular disease” (ICD-9 437); 3) “Hemorrhoids” (ICD-9 455), as a common disease with strong self-triggering pattern, will cause frequent admissions of patients. In summary, the analysis above verifies the rationality of our result — the dynamical disease networks we learned indeed reflect the properties of human’s health trajectory.
In this paper, we propose a novel data synthesis method to learn Hawkes processes from SDC event sequences. With the help of temporal information and optional features, we measure the similarities among different SDC event sequences and estimate the distribution of potential long event sequences. Applying a sampling-stitching mechanism, we successfully synthesize a large amount of long event sequences and learn point processes robustly. We test our method for both time-invariant and time-varying Hawkes processes. Experiments show that our data synthesis method improves the robustness of learning algorithms for various models. In the future, we plan to provide more theoretical and quantitative analysis to our data synthesis method and apply it to more applications.
Acknowledgements. This work is supported in part via NSF DMS-1317424, NIH R01 GM108341, NSFC 61628203, U1609220 and the Key Program of Shanghai Science and Technology Commission 15JC1401700.
A thinned block bootstrap variance estimation procedure for inhomogeneous spatial point patterns.Journal of the American Statistical Association, 102(480):1377–1386, 2007.
Learning mixtures of markov chains from aggregate data with structural constraints.Transactions on Knowledge and Data Engineering, 28(6):1518–1531, 2016.