1 Introduction
Event sequence data naturally arises in analyzing the temporal behavior of real world subjects (Cleeremans and McClelland, 1991). These sequences often contain rich information, which can predict the future evolution of the subjects. For example, the timestamps of tweets of a twitter user reflect his activeness and certain state of mind, and can be used to show when he will tweet next time (Kobayashi and Lambiotte, 2016). The job hopping history of a person usually suggests when he will hop next time (Xu et al., 2017b). Unlike usual sequential data such as text data, event sequences are always asynchronous and tend to be noisy (Ross et al., 1996). Therefore specialized algorithms are needed to learn from such data.
In this paper, we are interested in short sequences, a type of sequence data that commonly appears in many realworld applications. Such data is usually short for two possible reasons. One is that the event sequences are short in nature, such as the job hopping history. Another is the observation window is narrow. For example, we are interested in the criminal incidents of an area after a specific regulation is published. Moreover, this kind of data usually appears as a collection of sequences, such as the timestamps of many user’s tweets. Our goal is to extract information that can predict the occurrence of future events from a large collection of such short sequences.
Many existing literature considers mediumlength or long sequences. They first model a sequence as a parametric point process, e.g., Poisson process, Hawkes process or their neural variants, and apply maximum likelihood estimation to find the optimal parameters
(Ogata, 1999; Rasmussen, 2013). However, for short sequences, their lengths are insufficient for reliable inference. One remedy is that we treat the collection of short sequences as independent identically distributed realizations of the same point process, since many subjects, e.g., Twitter users, often share similar behaviors. This makes the inference manageable. However, the learned pattern can be highly biased against certain individuals, especially the nonmainstream users, since this method ignores the heterogeneity within the collection.An alternative is to recast the problem as a multitask learning problem (Zhang and Yang, 2017) – we target at multisequence analysis for multisubjects. For each sequence, we consider a point process model that slightly deviates from a common point process model, i.e., , where is the common model that captures the main effect, is the model for the th sequence, and is the relatively small deviation. Such an assumption that there exists a universal common model cross all subjects, however, is still strong, since the subjects’ patterns can differ dramatically. For example, the job hopping history of a software engineer and a human resource manager should have distinct characteristics. Furthermore, such method ignores the relationship of the subjects that usually can be revealed by side information. For example, a social network often shows community pattern (Girvan and Newman, 2002) – across the communities the variation of the subjects is large, while within the communities the variation is small. The connections in the social network, such as "follow" or retweet relationship in Twitter data, can provide us valuable information to identify such community pattern, but the aforementioned methods do not take into account such understanding to help analyzing subjects’ behavior.
To this end, we propose a HAwkes Relational Meta LEarning method for Short Sequence (HARMLESS), which can adaptively learn from a collection of short sequence. More specifically, in a social network, each user often has multiple identities (Airoldi et al., 2008). For example, a Twitter user can be both a military fan and a tech fan. Both his tweet history and social connections are based on his identities. Motivated by above facts, we model each sequence as a hierarchical Bayesian mixture of Hawkes processes – the weights of each Hawkes process are determined jointly by the hidden pattern of sequences and the relational information, e.g., social graphs.
We then propose a variational meta expectation maximization algorithm to efficiently perform inference. Different from existing fully bayesian inference methods
(Box and Tiao, 2011; Rasmussen, 2013; Xu and Zha, 2017), we make no assumption on the prior distribution of the parameters of Hawkes process. Instead, when inferring for the Hawkes process parameters of the same identity for all the subjects, we perform a modelagnostic adaptation from a common model for this identity (Finn et al. (2017), see section 3 for more details). This is more flexible since it does not restrict to a specific form. We apply HARMLESS to both synthetic and real short event sequences, and achieve competitive performance.Notations
: Throughout the paper, the unbold letters denote vectors or scalars, while the bold letters denote the corresponding matrices or sequences. We refer the
th entry of vector as . We refer the th subject as subject .2 Preliminaries
We briefly introduce Hawkes Process and ModelAgnostic Meta Learning.
Hawkes processes (Hawkes, 1971) is a doubly stochastic temporal point process with conditional intensity function defined as
where , is the nonnegative impact function with parameter , is the base intensity, and are the timestamps of the events occurring in a time interval . Function indicates how past events affect current intensity. Existing works usually use prespecified impact functions in parametric form, e.g., the exponential function in Rasmussen (2013); Zhou et al. (2013) and the powerlaw function in Zhao et al. (2015).
Hawkes process captures an important property of realworld events – selfexciting, i.e., the past events always increase the chance of arrivals of new events. For example, selling a significant quantity of a stock can precipitate a trading flurry. As a result, Hawkes process has been widely used in many areas, e.g., behavior analysis (Yang and Zha, 2013; Luo et al., 2015), financial analysis (Bacry et al., 2012), and social network analysis (Blundell et al., 2012; Zhou et al., 2013).
ModelAgnostic Meta Learning (MAML, Finn et al., 2017) considers a set of tasks , where each of the tasks only contains a very small amount of data which is not enough to train a model. We want to exploit the shared structure of the tasks, to obtain models that can perform well on each of the tasks. Specifically, MAML seeks to train a common model for all tasks. From optimization perspective, MAML solves the following problem,
(1) 
where is an operator,
is the loss function of task
, is the parameter of the common model, and is the step size. Here, represents one or a small number of gradient update of . For example, in cases of one gradient step, we take . This optimization problem aims to find the common model that is expected to produce maximally effective behavior on that task after performing update .Solving (1) using gradient descent involves computing the Hessian matrices, which is computationally prohibitive. To alleviate the computational burden, First Order MAML (FOMAML) (Finn et al., 2017) and Reptile (Nichol et al., 2018) are then proposed. FOMAML drops the second order term in the gradient of (1
). Reptile further simplifies the computation by relaxing the original update with Hessian as a multistep stochastic gradient descent updates. All three algorithms can be written in the form of (
1) with operator defined differently for different methods. Due to space limit, we defer the definition of to Appendix B.3 HAwkes Relational Meta LEarning for Short Sequences (HARMLESS)
We next introduce the meta learning method for analyzing short sequences. Suppose we are given a collection of sequences . We also know some extra relational information about the subjects. For example, in social networks, we can have information on who is friend of whom; in criminal data, we have the locations of the crimes, and crimes happen near each other often have Granger causality. Such relational information can be described as a graph , where is the node set, is the edge set. Denote its adjacency matrix as .
Such social graphs often exhibit community patterns (Girvan and Newman, 2002; Xie et al., 2013). Within the communities the variation of subjects are small, while across the communities the variation is large. Moreover, the communities are overlapping with each other, i.e., each subject may belong to multiple communities and thus have multiple identities. The behaviors of the subject is based on the identities. Motivated by this observation, we first assign each subject a sumtoone identity proportion vector , whose
th entry represents the probability of subject
having the th identity. In this way, we associate each subject with multiple identities rather than a single identity so that its different aspects is captured, which is more natural and flexible.For the th identity of subject , we adopt Hawkes process to model the timestamps of the associated events. Denote the conditional intensity function of as . For a Hawkes process , the likelihood (Laub et al., 2015) of a sequence to appear in time interval is
(2) 
Here, the parameter is adapted from a common model with parameter using a relatively small modelagnostic adaptation, which we will elaborate in next section.
The identity of the th subject is then a combination of the identities with identity proportion , and the models for individual sequences are essentially mixtures of Hawkes process models. Denote . The likelihood for the th sequence is
(3) 
Moreover, the connections of the subjects are also based on their identities. More specifically, for each connection to happen, one subject needs to approach another subject , where the identities of subjects are based on respectively. Based on this observation, we adopt a Mixed Membership stochastic Blockmodel (MMB) (Airoldi et al., 2008) to model the connections of the subjects. For each subjects pair , denote the identity of subject when subject approaches subject
, and the identity of subject when is approached by as . The probability of represent the th identity is , and the probability of represent the th identity is . The probability of whether subject and have a connection is then a function dependent on this two identities  the random variable representing the existence of connectionfollows Bernoulli distribution with parameter
, where is a learnable parameter.Generative process: The above model can be summarized as the following generative process.

For each node ,

Draw a dimensional identity proportion vector .

Sample the th sequence from the mixture of Hawkes processes described in (3).


For each pair of nodes and ,

Draw identity indicator for the initiator

Draw identity indicator for the receiver

Sample whether there is an edge between and , .

Here, the observed variables are and . The parameters are , , and . The latent variables are , , and . The graph model is shown in Figure 1.
4 Variational Meta Expectation Maximization
We now introduce our variational meta expectation maximization algorithm. This algorithm incorporates modelagnostic adaptation into variational expectation maximization. In the rest of the paper, we denote , , .
To ease the computation we add one more latent variable . For the th sequence, we sample . We regard as a Hawkes process with parameter . Note that this is equivalent to the mixture of Hawkes process described in previous section, since . This can ease the computation because now the update for has close form.
Variational E step. The goal is to find an approximation of the following posterior distribution
We aim to find a distribution that minimizes the KullbackLeibler (KL) divergence to the above posterior distribution. This can be achieved by maximizing the Evidence Lower BOund (ELBO, Blei et al., 2017),
(4) 
where is a properly chosen distribution space. We adopt as the meanfield variational family, i.e.,
where
is the Probability Density Function (PDF) of
, is the Probability Mass Function (PMF) of , is the PMF of , is the PMF of , and , , , are variational parameters. By some derivation (see Appendix C for detail), the updates for the variational parameters for solving problem (4) are(5)  
(6)  
(7)  
(8) 
where , and is the digamma function.
Meta inference for and . Recall that the Hawkes parameter of the th identity of subject is . Instead of specifying that is sampled from a prior distribution, we adapt the th common model to sequence using MAMLtype updates,
(9) 
Since MAMLtype algorithms only perform one or few updates from the common model, the adapted individual models with parameter within one community is close to each other, which meets our expectation that the withincommunity variation should be small.
The gradient descent step on the loglikelihood of can then be written as
(10) 
where is the step size. In this algorithm, we only need to estimate the common models with parameter , instead of all individual models. After we obtain , the individual models can be easily obtained from Equation (9).
M step. We perform maximum likelihood estimation to and , The updates are as follows,
(11)  
(12) 
where is the step size. The detailed derivation can be found in Appendix C.
5 Experiments
We first briefly introduce oue experiment settings.
Impact function. Following Rasmussen (2013); Zhou et al. (2013), we choose exponential impact function . The conditional intensity function is
(13) 
where and are parameters. Note that each Hawkes process model only contains three parameters, , , and . This is because we target at short sequence. To avoid overfitting, each individual models cannot have too many parameters.
Regularized likelihood function. Substitute Eq. (13) into Eq. (2), we have
To keep the parameters nonnegative, in practice we replace with a regularized loglikelihood in update (10),
(14) 
where is the parameter of the th Hawkes process of the th identity, is a regularization coefficient.
Evaluation metric. We hold out the last timestamp of each sequence, and split the holdout timestamps into a validation set and a test set. Another option to do validation and test on event sequence data is to hold out the last two timestamps – we first use the former ones to do validation, then train a new model together with the validation timestamps, and finally report the test result based on the later ones. However, this is not suitable here. This is because the sequences we adopt for experiments are usually very short, sometimes even no more than 5 events in one sequence. As a result, the models trained without or with validation timestamps, e.g., using 3 or 4 timestamps, can be significantly different, which makes the validation procedure very unreliable.
We report the LogLikelihood (LL) of the test set. More specifically, for each sequence and parameter , the likelihood of next arrival is
The reported score is the averaged over subjects. More details can be found in Appendix D.
To estimate of the variance of the estimated loglikelihood, we adopt a multisplit procedure for evaluation. First, we train
candidate models with different hyperparameters. Then we repeat the following procedure for times: 1). Randomly split a validation set and a test set; 2). Pick a model with highest loglikelihood on the validation set from the candidate models; 3). Compute the loglikelihood on the test set. Accordingly, we obtainestimates of the loglikelihood. We then report the mean and standard error of the
estimates.Baselines. We adopt four baselines as follows.
MLESep: We consider each sequence as a realization of an individual Hawkes process. We perform Maximum Likelihood Estimation (MLE) on each sequence separately, and obtain models for sequences.
MLECom: We consider all sequences as realizations of the same Hawkes process and learn a common model by MLE.
DMHP (Xu and Zha, 2017): We model sequences as a mixture of Hawkes processes with a Dirichlet distribution as the prior distribution of the mixtures.
MTL: We perform multitask learning as described in Section 1. More specifically, we adopt Hawkes process model for and . Denote the parameters of and as and , respectively. We solve
where is the norm regularizer of to promote the difference between and to be small, is a tuning parameter, and is the function defined in Eq. (14).
Parameter Tuning. The detailed tuning procedure and detailed settings of each experiment can be found in Appendix E.
Ground Truth  
5.1 Synthetic Data
Data generation. We generate a dataset of 50 nodes with communities. For each community, we generate Hawkes meta parameters
using the following uniform distributions:
We set , i.e., the entries of is all one. Then for the th node, the identity proportion is sampled from and the membership indicator from the corresponding categorical distribution . Based on , we then generate the Hawkes parameters by adding small perturbation to :
The sequence is then sampled based on Hawkes process with parameter in time interval . To ease the tuning we normalize the sequences by dividing by the largest timestamp. We set , for any , and . We sample the graph edges based on . Denote . The generated graphs are visualized in the second column of Table 1.
Visualization of communities. We visualize the communities learned by HARMLESS (MAML) in Table 1. Denote as the number of communities specified in HARMLESS. We adopt colors corresponding to the communities in the graph. The color of each node shown in the Table 1 is the linear combinations of the RGB values of the colors weighted by identity proportions .
HARMLESS produces reasonable identities even if is misspecified. If , some of the communities would merge. If , some of the communities would split.
Benefit of joint training. To validate the benefit of joint training on graphs and sequences, we compare HARMLESS result with a two step procedure: We first train an MMB model and obtain the identities, and train HARMLESS (MAML) with fixed identities. In Figure 2 we plot the obtained loglikelihood with respect to .
HARMLESS (MAML) consistently achieves larger loglikelihood than the two step procedure. This suggests joint training of graphs and the sequences indeed improve the prediction of future events.
Loglikelihood with respect to . We also include the results of the baselines and HARMLESS (FOMAML) in Figure 2. The performance of HARMLESS is consistently better than the baselines. Besides, we find the performance HARMLESS (Reptile) is very dependent on the dataset. For this synthetic dataset, Reptile cannot perform well.
Dataset  911Calls  MathOverflow  StackOverflow  
MLESep  
MLECom  
DMHP  
MTL  
HARMLESS (MAML)  
HARMLESS (FOMAML)  
HARMLESS (Reptile) 
5.2 Real Data
We adopt four real datasets.
911Calls dataset: The 911Calls dataset^{1}^{1}1Data is provided by montcoalert.org. contains emergency phone call records of fire, traffic and other emergencies for Montgomery County, PA. The county is divided into disjoint areas, each of which has a unique ZIP Code. For each area, the timestamps of emergency phone calls in this area are recorded as an event sequence. We consider each area as a subject, and two subjects are connected if they are adjoint. We finally obtain subjects and connections among them. The average length of the sequences is .
LinkedIn dataset: The LinkedIn dataset (Xu et al., 2017b) contains job hopping records of the users. For each user, her/his checkin timestamps corresponding to different companies are recorded as an event sequence. We consider each user as a subject, and two subjects are connected if the difference in timestamps of two user joined the same company is less than 2 weeks. After removing the singleton subjects, we have subjects and connections among them. The average length of the sequences is .
MathOverflow dataset: The MathOverflow dataset (Paranjape et al., 2017) contains records of the users posting and answering math questions. We adopt the records from May 2, 2014 to March 6, 2016. For each user, her/his timestamps of answering questions are recorded as an event sequence. We consider each user as a subject, and two subjects are connected if one user answers another user’s question. After removing the singleton subjects, we have subjects and connections among them. The average length of the sequences is .
StackOverflow dataset: StackOverflow is a question and answer site similar to MathOverflow. We adopt the records from November 8, 2015 to December 1, 2015. We construct the sequences and graphs in the same way as MathOverflow. After removing the singleton subjects, we have users and connections among them. The average length of the sequences is .
Result: The loglikelihood is summarized in Table 2
. Note due to Markov chain Monte Carlo is needed for
DMHP, we cannot get reasonable result for large dataset, i.e., StackOverflow. HARMLESS performs consistently better than the baselines. Since the standard error of the results of 911Calls dataset are large, we also performed a paired t test. The test shows the difference in loglikelihood between
MLECom, i.e., best of the baselines, and HARMLESS (FOMAML), i.e., best of HARMLESS series, is statistically significant (with value).Method  LogLikelihood 
HARMLESS (MAML)  
HARMLESS (FOMAML)  
HARMLESS (Reptile)  
Remove inner heterogeneity ()  
Remove inner heterogeneity ()  
Remove grouping (MAML)  
Remove grouping (FOMAML)  
Remove grouping (Reptile)  
Remove graph (MAML)  
Remove graph (FOMAML)  
Remove graph (Reptile) 
5.3 Ablation Study
We then perform ablation study using LinkedIn dataset. Three sets of ablation study are considered here:
Remove inner heterogeneity: We model each community of sequences using the same parameters, i.e., we set .
Remove grouping: We set , so that the whole graph is one community. This equivalent to apply the MAMLtype algorithms on the sequences directly.
Remove graph: We do not consider the graph information, i.e., we remove , , and from the panel in Figure 1.
The results in Table 3 suggest that MAMLtype adaptation, graph information, and using multiple identities all contribute to the good performance of HARMLESS.
6 Discussions
The setting of meta learning. The goal of conventional settings of meta learning is to train a model on a set of tasks, so that it can quickly adapt to a new task with only few training samples. Therefore, people divide the tasks into meta training set and meta test set, where each of the task contains a training set and a test set. The meta model is trained on the meta training set, aiming to minimize the test errors, and validated on the meta test set (Vinyals et al., 2016; Santoro et al., 2016)
. This setting is designed for supervised learning or reinforcement learning tasks that has accuracy or reward as a clear evaluation metric. Extracting information from the event sequences, however, is essentially an unsupervised learning task. Therefore, we do not separate meta training set and meta test set. Instead, we pull the collection of tasks together, and aim to extract shared information of the collection to help the training of models on individual tasks. Here, each short sequence is a task. We exploit the shared pattern of the collection of the sequences to obtain the models for individual sequences.
Community Pattern. The target of Mixed Membership stochastic Blockmodels (MMB) is to identify the communities in a social graph, e.g., the classes in a school. However, real social graphs cannot always be viewed as ErdősRényi (ER) graphs assumed by MMB. As argued in Karrer and Newman (2011), for realworld networks, MMB tends to assign nodes with similar degrees to same communities, which is different from the popular interpretation of the community pattern. This property, however, is actually very helpful in our case. As an example, Twitter users that are more active tend to have similar behavior: They tend to make more connections and post tweets more frequently. In contrast, users with very different node degrees often have the tweets histories of different characteristics, and thus should be assigned to different identities. Such property of MMB allows the identities in HARMLESS to represent this nontraditional community patterns in nonER graphs, i.e., it assigns subjects with various activeness to different communities.
Mixture of Hawkes processes. Many existing works adopt mixture of Hawkes process to model sequences that are generated from complicated mechanisms (Yang and Zha, 2013; Li and Zha, 2013; Xu and Zha, 2017). Those works are different from HARMLESS since they do not consider the hierarchical heterogeneity of the sequences, and do not consider the relational information.
Variants of Hawkes process. Some attempts have been made to further enhance the flexibility of Hawkes processes. For example, the timedependent Hawkes process (TiDeH) in Kobayashi and Lambiotte (2016)
and the neural networkbased Hawkes process (NSMMPP) in
Mei and Eisner (2017)learn very flexible Hawkes processes with complicated intensity functions. Those models usually have more parameters than vanilla Hawkes processes. For longer sequences, HARMLESS can also be naturally extended to TiDeHs or NSMMPP. However, this work focuses on short sequences. These methods are not useful here, since they have too many degrees of freedom.
References

Achab et al. (2017)
Achab, M., Bacry, E., Gaïffas, S.,
Mastromatteo, I. and Muzy, J.F. (2017).
Uncovering causality from multivariate hawkes integrated cumulants.
The Journal of Machine Learning Research
, 18 6998–7025.  Airoldi et al. (2008) Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. Journal of machine learning research, 9 1981–2014.
 Bacry et al. (2012) Bacry, E., Dayri, K. and Muzy, J.F. (2012). Nonparametric kernel estimation for symmetric hawkes processes. application to high frequency financial data. The European Physical Journal B, 85 157.
 Bauwens and Hautsch (2009) Bauwens, L. and Hautsch, N. (2009). Modelling financial high frequency data using point processes. In Handbook of financial time series. Springer, 953–979.
 Bengio et al. (1990) Bengio, Y., Bengio, S. and Cloutier, J. (1990). Learning a synaptic learning rule. Université de Montréal, Département d’informatique et de recherche ?
 Blei et al. (2017) Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112 859–877.
 Blundell et al. (2012) Blundell, C., Beck, J. and Heller, K. A. (2012). Modelling reciprocating relationships with hawkes processes. In Advances in Neural Information Processing Systems.
 Box and Tiao (2011) Box, G. E. and Tiao, G. C. (2011). Bayesian inference in statistical analysis, vol. 40. John Wiley & Sons.
 Chalmers (1991) Chalmers, D. J. (1991). The evolution of learning: An experiment in genetic connectionism. In Connectionist Models. Elsevier, 81–90.
 Cleeremans and McClelland (1991) Cleeremans, A. and McClelland, J. L. (1991). Learning the structure of event sequences. Journal of Experimental Psychology: General, 120 235.
 Eichler et al. (2017) Eichler, M., Dahlhaus, R. and Dueck, J. (2017). Graphical modeling for multivariate hawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38 225–242.
 Farajtabar et al. (2017) Farajtabar, M., Yang, J., Ye, X., Xu, H., Trivedi, R., Khalil, E., Li, S., Song, L. and Zha, H. (2017). Fake news mitigation via point process based intervention. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org.
 Farajtabar et al. (2016) Farajtabar, M., Ye, X., Harati, S., Song, L. and Zha, H. (2016). Multistage campaigning in social networks. In Advances in Neural Information Processing Systems.
 Finn et al. (2017) Finn, C., Abbeel, P. and Levine, S. (2017). Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org.
 Finn et al. (2018) Finn, C., Xu, K. and Levine, S. (2018). Probabilistic modelagnostic metalearning. In Advances in Neural Information Processing Systems.
 Fox et al. (2016) Fox, E. W., Short, M. B., Schoenberg, F. P., Coronges, K. D. and Bertozzi, A. L. (2016). Modeling email networks and inferring leadership using selfexciting point processes. Journal of the American Statistical Association, 111 564–584.
 Girvan and Newman (2002) Girvan, M. and Newman, M. E. (2002). Community structure in social and biological networks. Proceedings of the national academy of sciences, 99 7821–7826.
 Grant et al. (2018) Grant, E., Finn, C., Levine, S., Darrell, T. and Griffiths, T. (2018). Recasting gradientbased metalearning as hierarchical bayes. arXiv preprint arXiv:1801.08930.
 Hansen et al. (2015) Hansen, N. R., ReynaudBouret, P., Rivoirard, V. et al. (2015). Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21 83–143.
 Hawkes (1971) Hawkes, A. G. (1971). Spectra of some selfexciting and mutually exciting point processes. Biometrika, 58 83–90.
 Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C. and Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14 1303–1347.
 Karrer and Newman (2011) Karrer, B. and Newman, M. E. (2011). Stochastic blockmodels and community structure in networks. Physical review E, 83 016107.
 Kobayashi and Lambiotte (2016) Kobayashi, R. and Lambiotte, R. (2016). Tideh: Timedependent hawkes process for predicting retweet dynamics. In Tenth International AAAI Conference on Web and Social Media.

Koch et al. (2015)
Koch, G., Zemel, R. and Salakhutdinov, R. (2015).
Siamese neural networks for oneshot image recognition.
In
ICML deep learning workshop
, vol. 2.  Laub et al. (2015) Laub, P. J., Taimre, T. and Pollett, P. K. (2015). Hawkes processes. arXiv preprint arXiv:1507.02822.
 Li and Zha (2013) Li, L. and Zha, H. (2013). Dyadic event attribution in social networks with mixtures of hawkes processes. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM.
 Linderman and Adams (2014) Linderman, S. and Adams, R. (2014). Discovering latent network structure in point process data. In International Conference on Machine Learning.

Luo et al. (2015)
Luo, D., Xu, H., Zhen, Y., Ning, X.,
Zha, H., Yang, X. and Zhang, W. (2015).
Multitask multidimensional hawkes processes for modeling event
sequences.
In
TwentyFourth International Joint Conference on Artificial Intelligence
. 
Maclaurin et al. (2015)
Maclaurin, D., Duvenaud, D. and Adams, R. (2015).
Gradientbased hyperparameter optimization through reversible learning.
In International Conference on Machine Learning.  Mei and Eisner (2017) Mei, H. and Eisner, J. M. (2017). The neural hawkes process: A neurally selfmodulating multivariate point process. In Advances in Neural Information Processing Systems.
 Munkhdalai and Yu (2017) Munkhdalai, T. and Yu, H. (2017). Meta networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org.
 Nichol et al. (2018) Nichol, A., Achiam, J. and Schulman, J. (2018). On firstorder metalearning algorithms. arXiv preprint arXiv:1803.02999.
 Nichol and Schulman (2018) Nichol, A. and Schulman, J. (2018). Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999.
 Ogata (1999) Ogata, Y. (1999). Seismicity analysis through pointprocess modeling: A review. In Seismicity patterns, their statistical significance and physical meaning. Springer, 471–507.
 Paranjape et al. (2017) Paranjape, A., Benson, A. R. and Leskovec, J. (2017). Motifs in temporal networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM.
 Rasmussen (2013) Rasmussen, J. G. (2013). Bayesian inference for hawkes processes. Methodology and Computing in Applied Probability, 15 623–642.
 Ravi and Beatson (2018) Ravi, S. and Beatson, A. (2018). Amortized bayesian metalearning.
 Ravi and Larochelle (2016) Ravi, S. and Larochelle, H. (2016). Optimization as a model for fewshot learning.
 ReynaudBouret et al. (2010) ReynaudBouret, P., Schbath, S. et al. (2010). Adaptive estimation for hawkes processes; application to genome analysis. The Annals of Statistics, 38 2781–2822.
 Ross et al. (1996) Ross, S. M., Kelly, J. J., Sullivan, R. J., Perry, W. J., Mercer, D., Davis, R. M., Washburn, T. D., Sager, E. V., Boyce, J. B. and Bristow, V. L. (1996). Stochastic processes, vol. 2. Wiley New York.
 Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D. and Lillicrap, T. (2016). Metalearning with memoryaugmented neural networks. In International conference on machine learning.
 Snell et al. (2017) Snell, J., Swersky, K. and Zemel, R. (2017). Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems.

Sung et al. (2018)
Sung, F., Yang, Y., Zhang, L., Xiang, T.,
Torr, P. H. and Hospedales, T. M. (2018).
Learning to compare: Relation network for fewshot learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.  Tran et al. (2015) Tran, L., Farajtabar, M., Song, L. and Zha, H. (2015). Netcodec: Community detection from individual activities. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM.
 Trivedi et al. (2018) Trivedi, R., Farajtabar, M., Biswal, P. and Zha, H. (2018). Dyrep: Learning representations over dynamic graphs.
 Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D. et al. (2016). Matching networks for one shot learning. In Advances in neural information processing systems.
 Xie et al. (2013) Xie, J., Kelley, S. and Szymanski, B. K. (2013). Overlapping community detection in networks: The stateoftheart and comparative study. Acm computing surveys (csur), 45 43.
 Xu et al. (2017a) Xu, H., Luo, D., Chen, X. and Carin, L. (2017a). Benefits from superposed hawkes processes. arXiv preprint arXiv:1710.05115.
 Xu et al. (2017b) Xu, H., Luo, D. and Zha, H. (2017b). Learning hawkes processes from short doublycensored event sequences. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org.
 Xu and Zha (2017) Xu, H. and Zha, H. (2017). A dirichlet mixture model of hawkes processes for event sequence clustering. In Advances in Neural Information Processing Systems.
 Yang and Zha (2013) Yang, S.H. and Zha, H. (2013). Mixture of mutually exciting processes for viral diffusion. In International Conference on Machine Learning.
 Zarezade et al. (2017) Zarezade, A., Khodadadi, A., Farajtabar, M., Rabiee, H. R. and Zha, H. (2017). Correlated cascades: Compete or cooperate. In ThirtyFirst AAAI Conference on Artificial Intelligence.
 Zhang and Yang (2017) Zhang, Y. and Yang, Q. (2017). A survey on multitask learning. arXiv preprint arXiv:1707.08114.
 Zhao et al. (2015) Zhao, Q., Erdogdu, M. A., He, H. Y., Rajaraman, A. and Leskovec, J. (2015). Seismic: A selfexciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.
 Zhou et al. (2013) Zhou, K., Zha, H. and Song, L. (2013). Learning social infectivity in sparse lowrank networks using multidimensional hawkes processes. In Artificial Intelligence and Statistics.
Appendix A Related Works
Hawkes Process Hawkes process has long been used to model event sequences (Hawkes, 1971), such as earthquake aftershock sequences (Ogata, 1999), financial transactions (Bauwens and Hautsch, 2009), and events on social networks (Fox et al., 2016; Farajtabar et al., 2017). Its variant, mixture of Hawkes processes model, has also been proved effective in many area (Yang and Zha, 2013; Li and Zha, 2013; Xu and Zha, 2017). In most cases, the learning methodology is variational inference or maximum likelihood estimation (Rasmussen, 2013; Zhou et al., 2013; Zhao et al., 2015). Other possible methods includes leastsquaresbased method (Eichler et al., 2017), WienerHopfbased methods (Bacry et al., 2012), and cumulantsbased methods (Achab et al., 2017).
Instead of predefine an impact function here, some nonparametric methods use discretization or kernelestimation when learning models (ReynaudBouret et al., 2010; Zhou et al., 2013; Hansen et al., 2015). Those methods usually target small datasets, and do not need a good scalability. Recently, some attempts have been made to further enhance the flexibility of Hawkes processes. The timedependent Hawkes process (TiDeH) in Kobayashi and Lambiotte (2016) and the neural networkbased Hawkes process in Mei and Eisner (2017) learn very flexible Hawkes processes with complicated intensity functions. Those methods usually target very long and multidimensional sequences, instead of short sequences.
Existing works targeting short sequences is usually in specific cases (Xu et al., 2017a, b), such as the data is censored. However, there is no work targeting general short sequences as we do here.
There are lines of research that involves both point processes and graphs. One is using point process to find the latent graph (Blundell et al., 2012; Linderman and Adams, 2014; Tran et al., 2015). Another one is considering the interaction of the nodes as point process and use it to construct a dynamic graph, instead of the event happens on nodes as we consider here (Farajtabar et al., 2016; Zarezade et al., 2017; Trivedi et al., 2018). These works have vary different aims from our work.
Meta Learning Meta learning has been studied since last century (Bengio et al., 1990; Chalmers, 1991). Some works focus on learning the hyperparameters, such as learning rates or initial conditions (Maclaurin et al., 2015). Some works aim to learn a metric so that a simple K nearest neighbors can perform well under such a metric (Koch et al., 2015; Vinyals et al., 2016; Sung et al., 2018; Snell et al., 2017). Some works design specific deep neural networks so that the information of different tasks are memorized and thus the model can easily generalize to new tasks (Santoro et al., 2016; Munkhdalai and Yu, 2017; Ravi and Larochelle, 2016).
ModelAgnostic Meta Learning (MAML) method (Finn et al., 2017) opens another line of research, i.e., it designs an optimization scheme so that the model can fast adapt to new tasks. Reptile (Nichol and Schulman, 2018), a variant of MAML, is proposed to simplify the computation of MAML. None of those works, however, considers the relational information between tasks like our method, which is critical in modeling short sequences.
One interesting line of followup works of MAML is connecting MAML with Bayesian inference (Finn et al., 2018; Ravi and Beatson, 2018; Grant et al., 2018). Since HARMLESS combines a Bayesian model with MAML, it has the potential to be rewritten into a pure Bayesian model that has better quantification of uncertainty. We left this for future work.
Appendix B Definition of Operator
As we mentioned earlier,
is the loss function for MAML, FOMAML, and Reptile algorithm with different definition of the operator .
For simplicity, here we define the operator of one gradient step. The cases of few gradient steps can be defined analogously.
For MAML, is defined as .
For First Order MAML (FOMAML), is also defined as . The difference is that the output of the operator just a value, not a function of , i.e., when we solve the gradient of , the gradient does not backpropagate into .
For Reptile, the algorithm of reptile is as follows Nichol and Schulman (2018).
From the algorithm we can see, operator is defined as . Similar as FOMAML, computing the gradient also does not backpropagate into .
Appendix C Derivation of Variational EM
Preparation After adding latent variable
, the joint distribution is
where
Note that in this section we represent as onehot vector, while in the main paper we use scalar representing the identities.
The posterior distribution is defined as
We aim to find a distribution , such that the KullbackLeibler (KL) divergence between the above posterior distribution and is minimized. This can be achieved by maximize the Evidence Lower BOund (ELBO),
Variational family We adopt the meanfield variational family, i.e.,
We pick as PDF of , as PDF of , as PDF of , as PDF of .
Update for Again, our goal is to maximize
Now we focus on , and treat , and as given. We want to maximize
Take the derivative,
Substitute the expressions of the distributions, after some derivation we get the update for as
(15) 
Update for Similarly, we have
Take the derivative,
After some derivation, we have
(16)  
(17) 
where is the digamma function.
Update for and The derivation of update for and is very similar to the update for , so we will not elaborate on that. Readers who are interested might also refer to Airoldi et al. (2008). The updates are
(18)  
(19) 
Update for We update using gradient ascent. We first pick the terms that is relevant to ,
So the gradient ascent update is,
(20) 
Update for and From Airoldi et al. (2008), we have the update for and as follows
(21)  
(22) 
Appendix D Derivation of Evaluation Metric
In this section, we give more details on the evaluate metrics. Specifically, we show how to compute the NLL of the test set. Given a sequence , we would like to predict the timestamp of . Here, we use the probability of the arrival at time and no arrival in given history before
Comments
There are no comments yet.