1 Introduction
Realworld event sequences are often modeled based on temporal point processes. Specifically, a temporal point process with event types can be represented as a counting process , where each is the number of type events happening at or before time and . As a special kind of point process, Hawkes process (Hawkes, 1971) formulates the expected instantaneous happening rate of type events, or called intensity function, as
where contains historical events before time , is the base intensity capturing exogenous fluctuations of the type event, and is the impact function measuring the infectivity of the type event to the type event type over time. Therefore, we denote an event sequence yielding to a Hawkes process as , with basic intensity and impact functions .
As an extension of Hawkes process, the mixture model of Hawkes processes (MixHP) is capable of describing the clustering structure of different event sequences and capturing the dependency among events within each cluster. MixHP has been used to model realworld event sequences, , patient admissions (Xu & Zha, 2017), social behaviors (Yang & Zha, 2013), and user logs of information systems (Luo et al., 2015). Suppose that the sequences in are generated via different Hawkes processes, ,
(1) 
where represents the distribution of the Hawkes processes. Accordingly, the likelihood of a sequence is represented as
(2) 
Here, is the likelihood of the sequence conditioned on the th Hawkes process .
Given observed sequences, we can apply maximum likelihood estimation (MLE) to learn the target mixture model, as shown in Figure 1(a). However, in practice this learning strategy often suffers from insufficienct data. For example, in the admission record dataset MIMICIII (Johnson et al., 2016), most patients only have admissions or fewer in ten years, while there are over 600 kinds of diseases (, types of events). Learning from such short sequences leads to serious overfitting. For a single Hawkes process, such a problem can be mitigated by various data augmentation techniques, , randomly stitching (Xu et al., 2017) or superposing (Xu et al., 2018b) the original sequences. Unfortunately, for mixture models of Hawkes processes, these techniques cannot be applied directly, because in general the stitching/superposing result of two sequences from different clusters does not yield to a Hawkes process, which may cause serious model misspecification.
To overcome the challenges above, we propose a novel adversarial selfpaced learning (ASPL) method and train it iteratively to robustly learn mixture models of Hawkes processes. As shown in Figure 1(c), in each iteration we actively generate candidates of “easy” sequences based on data augmentation methods (, random superposition and stitching). Then, based on MLE with an adversarial selfpaced (Bengio et al., 2009; Kumar et al., 2010) regularizer, we use these candidate sequences to learn the target model and select “easy” sequences for the next iteration.
The proposed learning method is based on two facts. First, the MixHP model learned from the augmented sequences is always misspecified to some degree, because most of the augmented sequences are not drawn from a Hawkes process. As a result, the augmented sequences still obeying Hawkes processes are adversarial samples of the misspecified model. Second, the easiness of a sample is dependent on the model imposed on it — an easy sample of the target MixHP model can be an adversarial one of the misspecified model. Accordingly, our method selects the adversarial sequences of the current misspecified model to construct the easy sequence set for the target model, with this performed in an iterative manner. With an increase in iterations, the potential easy sequences become dominant in the training set and the misspecifed model is revised and approaches to the target one.
2 Adversarial SelfPaced Learning
2.1 Data augmentation and model misspecification
For a single Hawkes process, the overfitting problem caused by insufficient data can be mitigated based on data augmentation techniques. In particular, the Hawkes process has an interesting superposition property:
Proposition 2.1 ((Xu et al., 2018b)).
Give independent Hawkes processes with shared impact functions, , and for , their superposition becomes a single Hawkes process, , and .
Additionally, for a stationary Hawkes process, its impact function satisfies , implying that the infectivity of a historical event to current one decays rapidly with respect to the time interval between them, , . Therefore, given two short sequences belonging to the same Hawkes process, superposing or stitching them can generate a denser or longer sequence for the target Hawkes process model. These two data augmentation strategies have been applied to learn Hawkes processes from imperfect observations (Xu et al., 2018b, 2017, a), which indeed improve learning results.
However, as shown in Figure 1(d), when the sequences generated by different Hawkes processes with different impact functions, their superposition/stitching result will not yield a Hawkes process any more. Therefore, most of the augmented sequences do not obey the target mixture model of Hawkes processes, learning from which leads to a misspecified MixHP model, while those really obeying Hawkes processes are the minority of the augmented sequences, will be ignored and treated as adversarial samples (Lowd & Meek, 2005; Barreno et al., 2006; Liu & Chawla, 2009; Huang et al., 2011) of the misspecified model.
2.2 The easiness of sequence
Although directly applying traditional data augmentation techniques (, superposing and stitching) is not helpful to learn mixture models of Hawkes processes, the augmented sequences have different levels of fitness with respect to the misspecified model, which provides a reasonable measurement for the easiness of the sequences, and makes selfpaced learning possible. Specifically, the likelihood of a sequence under a model reflects the fitness of the model to the sequence. Given a sequence with events, we define the easiness of a sequence with respect to the model :
(3) 
(3
) indicates that an easy sample of a mixture model needs to fit one of the clustering component with high probability (even if the probability of the component itself is low). The higher the likelihood is, the easier the sequence is under the given model. Dividing by
, the easiness of the sequences with different lengths becomes comparable. Because (3) is not differentiable, in practice we can use “LogSumExp” operation to achieve a smooth maximum. Accordingly, (3) can be rewritten as(4) 
In this case, we define adversarial sequences of our model as those with lowest easiness.
2.3 Proposed learning algorithm
The key idea of our learning method is that the easy sequences of the target mixture model can be the adversarial ones of the current estimated model. When learning a mixture model with potential risk of misspecification based on augmented sequences, we want to find its adversarial sequences and add them into the “easy” sequence set of the target mixture model. The easy sequences are considered in the next training iteration, which are used to revise the misspecified model.
In the th learning iteration, given the augmented sequences and the easy sequence set generated in the previous iteration, we update the current mixture model and select new easy sequences from simultaneously, by solving the following maxmin optimization problem.
(5) 
Here,
is a binary vector, whose element
indicates whether is an easy sequence of the proposed model. The first term represents the loglikelihood of the current model given the whole sequence set, while the second term is the proposed adversarial selfpaced regularizer, that measures the easiness of eachand selects the adversarial sequences of the current model as the easy sequences of the target model. Hyperparameter
controls the significance of the proposed regularizer, and controls the acceptance rate of easy sequences.We decompose (5) into two subproblems and solve them via alternating optimization. In each learning iteration, we solve the following two subproblems:
1) Update current model:
(6) 
where is the indicator learned in the previous step.
2) Select new easy sequences for target model:
(7) 
Maximizing the easiness term in (6) makes the model fit the selected easy sequences and suppresses the influence of those “nonHawkes” sequences. When selecting new easy sequences, on the contrary, we keep the sequences with low easiness with respect to current model for the following learning iterations. Algorithm 1 shows the scheme of the proposed learning method.
As mentioned in line 8 of Algorithm 1, we set according to the learning result of (6). Given current mixture model and learned distribution of clustering component , for all , we first sort in ascending order, and then select the top augmented sequences. Because the proportion of adversarial sequences in can be approximated as , the number of easy sequences should not be larger than . We use to estimate the real and set . Accordingly, , where is the th sorted sequence.
Dataset  Setting  MMHP  DMHP  DMHP  SPLMixHP  ASPLMixHP  ASPLMixHP  

Stitch  Stitch  Superpose  
loglike  loglike  loglike  loglike  loglike  loglike  
MIMICIII  903 / 226  10 yrs  8  10  3.460.71  2.850.29  2.900.20  2.660.12  2.240.10  2.070.08 
IPTV  15103 / 15103  24 hrs  16  10  0.530.13  1.380.11  1.250.07  1.370.09  1.450.03  1.440.02 
1220 / 1219  15 yrs  82  5  7.390.33  4.690.20  4.920.14  4.640.16  3.970.12  4.020.14 
2.4 Complexity
Given sequences with events per each, the computational complexity for learning a mixture model of Hawkes processes is . Applying the proposed learning strategy, we need to update the model based on various augmented sequence sets in different iterations, and each augmented sequence may have events. Denote the maximum number of iterations as . The computational complexity of our method is . Fortunately, the proposed learning method is mainly designed for the case of short sequences, whose numbers of events are often very small. Given the improvements on learning results brought from our learning method, which will be shown in the following section, the increase of the computational complexity appears to be tolerable.
3 Experiments
We denote our adversarial selfpaced learning method for mixture models of Hawkes processes as ASPLMixHP. To demonstrate its effectiveness, we compare our method with stateoftheart methods on three realworld datasets. In particular, we consider four competitive alternatives to our method. ) MMHP: The multitask multidimensional Hawkes process (Luo et al., 2015), which learns one Hawkes process per sequence and applies means to the learned clusters of all sequences. ) DMHP: The Dirichlet mixture model of Hawkes processes (Xu & Zha, 2017), which learns the proposed mixture model directly from observed sequences based on variational inference. ) DMHPStitch: The DMHP model learned based on the augmented sequences generated by random stitching. ) SPLMixHP: The selfpaced learning of the mixture model of Hawkes process, which applies the original selfpaced learning strategy (Kumar et al., 2010), , in each iteration, we select the sequences with the highest likelihood per event for the next learning iteration, to learn the target mixture model via MLE, as shown in Figure 1(b). For our ASPLMixHP method, both random superposition and random stitching are applied as feasible data augmentation methods. The hyperparameter is set to be empirically in the following experiments.
After learning models based on training sequences, we evaluate the performance of various methods on testing sequences, calculating the average loglikelihood of the testing sequences
(8) 
This measurement reflects the fitness of a trained model to the testing samples. Each method is tested in 15 trials. In each trial, the sequences are randomly divided into training and testing sets. The model trained on the training set is applied to the testing set. The averaged testing loglikelihood and its 95% confidence interval are calculated.
We apply our method to ) cluster LinkedIn users according to their jobhopping behaviors (Xu et al., 2018a), ) cluster patients according to their admissions (Xu et al., 2017), and ) cluster IPTV users according to their daily viewing records (Luo et al., 2014). These three datasets suffer from data sparsity — generally, each sequence in these three datasets contains just events or fewer. For the LinkedIn dataset, in each trial the jobhopping behaviors of LinkedIn users are used to train a mixture model, and the records of the remaining users are used for testing. These records involve IT companies and universities, which are treated as the event types in the model. For the MIMICIII dataset, the diseases in patients’ admissions are categorized into classes. For each patient, his/her admissions in ten years are observed event sequences, which are modeled as a mixture model of Hawkes processes. We use sequences for training and sequences for testing in each trial. For the IPTV viewing records, we obtain daily viewing records of kinds of TV programs from users in each trial, for training a mixture model of Hawkes processes. The records of the following days are used for testing the model.
Table 1 lists the results of various methods on three realworld datasets. Experimental results show that our ASPLMixHP method works well, obtaining higher testing loglikelihood than the other methods.
4 Conclusion and Future Work
We propose an adversarial selfpaced learning method for mixture models of Hawkes processes. Our method combines data augmentation techniques with a selfpaced learning strategy, generating and selecting easy sequences iteratively for the target model, from the adversarial sequences of a potentially misspecified model. We test our method on realworld datasets and demonstrate its potential to improve learning results in cases with short training sequences. In the future, we plan to further reduce its computational complexity and improve its scalability to imblanced largescale clustering problems. Beyond mixture models of Hawkes processes, we will extend the proposed method to the mixture models of other temporal point processes.
Acknowledgments This research was supported in part by DARPA, DOE, NIH, ONR and NSF.
References

Barreno et al. (2006)
Barreno, M., Nelson, B., Sears, R., Joseph, A. D., and Tygar, J. D.
Can machine learning be secure?
In Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pp. 16–25. ACM, 2006.  Bengio et al. (2009) Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009.
 Hawkes (1971) Hawkes, A. G. Spectra of some selfexciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.

Huang et al. (2011)
Huang, L., Joseph, A. D., Nelson, B., Rubinstein, B. I., and Tygar, J.
Adversarial machine learning.
In
Proceedings of the 4th ACM workshop on Security and artificial intelligence
, pp. 43–58. ACM, 2011.  Johnson et al. (2016) Johnson, A. E., Pollard, T. J., Shen, L., Liwei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. MIMICIII, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Kumar et al. (2010) Kumar, M. P., Packer, B., and Koller, D. Selfpaced learning for latent variable models. In Advances in Neural Information Processing Systems, pp. 1189–1197, 2010.
 Liu & Chawla (2009) Liu, W. and Chawla, S. A game theoretical model for adversarial learning. In 2009 IEEE International Conference on Data Mining Workshops, pp. 25–30. IEEE, 2009.
 Lowd & Meek (2005) Lowd, D. and Meek, C. Adversarial learning. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 641–647. ACM, 2005.
 Luo et al. (2014) Luo, D., Xu, H., Zha, H., Du, J., Xie, R., Yang, X., and Zhang, W. You are what you watch and when you watch: Inferring household structures from IPTV viewing data. IEEE Transactions on Broadcasting, 60(1):61–72, 2014.
 Luo et al. (2015) Luo, D., Xu, H., Zhen, Y., Ning, X., Zha, H., Yang, X., and Zhang, W. Multitask multidimensional hawkes processes for modeling event sequences. In Proceedings of the 24th International Conference on Artificial Intelligence, pp. 3685–3691. AAAI Press, 2015.
 Xu & Zha (2017) Xu, H. and Zha, H. A Dirichlet mixture model of Hawkes processes for event sequence clustering. In Advances in Neural Information Processing Systems, pp. 1354–1363, 2017.
 Xu et al. (2017) Xu, H., Luo, D., and Zha, H. Learning Hawkes processes from short doublycensored event sequences. In International Conference on Machine Learning, pp. 3831–3840, 2017.
 Xu et al. (2018a) Xu, H., Carin, L., and Zha, H. Learning registered point processes from idiosyncratic observations. In International Conference on Machine Learning, 2018a.
 Xu et al. (2018b) Xu, H., Luo, D., Chen, X., and Carin, L. Benefits from superposed Hawkes processes. In International Conference on Artificial Intelligence and Statistics, pp. 623–631, 2018b.
 Yang & Zha (2013) Yang, S.H. and Zha, H. Mixture of mutually exciting processes for viral diffusion. In International Conference on Machine Learning, pp. 1–9, 2013.
Comments
There are no comments yet.