I Introduction
Online information is generated at an unprecedented rate. Every minute, 500,000 comments are posted on Facebook, 400 hours of videos are uploaded on Youtube, and 500,000 tweets are published on Twitter. A possible approach to make sense out of this mass of information is to cluster publication events together. Grouping similar publications together help understanding topics of interest or generate summaries of daily news. Many clustering algorithms are based on text similarity, that is, how similar the words of two published documents are [Blei2003LDA, Rathore2018, Bahdanau2015NeuralMT]. Another relevant variable to group information together is the time of publication [Du2012KernelCascade, Blei2006DynamicTopicModel]. For example, two news articles about forest fires might be unrelated if the second article were published years after the first one despite a close lexical similarity. Imagine a news website that publishes a series about history every day at midday. Temporal dynamics would help understand that a publication the next day at midday is likely to be related to previous publications, even if the story (and thus the vocabulary) is different.
Many models that aim at understanding the temporal dynamics of clusters work by selecting a subset of observations according to a temporal sampling function [Amr2008RCRP, Blei2006DynamicTopicModel, Yin2018ShortTextDHP]. However, sampling observations in time implies defining a sampling function that might not correctly model the temporal dynamics at stake. Besides, these works are based on a Dirichlet prior (DP) for clustering. The DP considers counts as a parameter, where a document always counts for 1. It has been argued that such modeling is not fit to account for the arrival of documents in continuoustime settings. In [Du2015DHP], the authors combine techniques of standard textual clustering with point processes. The idea is to infer the timesampling function parameters as well as the rest of the model. Explicitly, they derive the DirichletHawkes process (DHP) prior for documents cluster allocation that takes time as a parameter and yields noninteger counts. It has been argued that this method cannot handle limited cases where text is less informative (e.g., short texts, overlapping vocabularies) [Yin2018ShortTextDHP].
Our present work develops the Powered Dirichlet Hawkes process (PDHP) as a mean to handle this case. Besides, we highlight other limiting cases for which DHP fails whereas PDHP yields good results, for instance when temporal information conveys little information (overlapping Hawkes intensities, few observations). We also show there are cases where documents within a textual cluster do not follow the same temporal dynamics, which the DHP is not designed to handle. For instance, an article published by a popular newspaper is unlikely to have the same influence on subsequent similar articles (temporal dynamics) as the same article published by a less popular newspaper. We overcome all these limitations by developing the Powered DirichletHawkes process, which yields better results than DHP on every dataset considered (up to +0.3 NMI). It also allows us to distinguish textual clusters from temporal clusters (documents that follow the same dynamic independently from their content).
Our contributions are listed below:

We highlight and explain the limitations of the DHP prior: it does not handle weakly informative temporal and textual information and it is not designed to consider different dynamics between text and time.

We derive the Powered Dirichlet Hawkes process (PDHP) as a new prior in Bayesian nonparametric for the temporal clustering of a stream textual documents, which is a generalization of the DirichletHawkes process (DHP) and of the Uniform process (UP).

We show how the PDHP prior performs better than DHP and UP priors through thorough evaluation and comparison on several synthetic datasets and realworld datasets from Reddit.

We show that PDHP prior allows to select the information clusters are based on; we choose to favor their generation more according to documents’ textual content or temporal dynamics.
Ii Background
Iia How publication times carry valuable information
Before reviewing existing methods incorporating a temporal dimension into text clustering, we detail how this information is relevant to the task. Recent works on the online spread of textual documents have highlighted several key properties regarding the link between textual content and date of publication.
Firstly, it has been shown that textual documents do not get published independently one from the other. Often, the arrival of a document is conditional on the publication of earlier documents. A straightforward illustration is that a new research paper is built on previous publications and is likely to treat a similar topic; the present article exists because of all the references it cites. A 2012 research paper highlights the critical role played by interactions in the republication of a tweet on Twitter [Myers2012CoC]
. The authors claim the probability of retweets vary by 71% on average when considering temporal interactions. More recent works find that although the interaction between publications plays a significant role in later publications, the interaction matrix is often sparse
[Poux2021interactions] – an article on textual clustering is more likely to appear conditional to publications about NLP, whose vocabulary is only a small subset of the scientific literature’s one. It highlights the need to cluster words together to retrieve temporal interaction relevant to a textual clustering problem. In this context, a cluster should carry information about the interaction between the documents it contains.Secondly, a problem that arises is the temporal aspect of interaction. It has been shown that online information interaction decays quickly with time [Cao2019AdsDataset]. Although the rate at which interaction influence decays depends on the dataset, it seems to fade rapidly for most online spreading processes [Haralabopoulos2014LifespanInfo]. To keep the temporal information relevant, clusters must depend on time. For example, two series of news articles about vaccines might not be related (one might not trigger the other) if one was published in 2010 and the other in 2021; they are two different clusters since both obey their own dynamics, although their vocabulary is similar.
IiB Temporal clustering of textual documents
The use of temporal dimension in documents clustering has been studied on many occasions; a notable spike of interest happened in 2006. Many authors tackled the problem of inferring timedependent clusters from models based on LDA [Blei2006DynamicTopicModel, Wang2006TopicsOverTime, Iwata2009]. However, most of these models are parametric, meaning the number of clusters is fixed at the beginning of the algorithm. Depending on the considered time range and the dataset, the number of clusters needs to be finetuned with several independent runs, making them hardly usable for many realworld applications. In all three references cited, the authors mention that a nonparametric version of the model might be derivable.
In 2008, A. Ahmed & al proposed the Recurrent Chinese Restaurant Process (RCRP) as an answer to this problem [Amr2008RCRP]. Instead of considering a fixedsize dataset, this model can handle a stream of documents arriving in chronological order, and the number of clusters is automatically updated. In this model, time is split into episodes to capture the temporal aspect of cluster formation; it considers an integer count of publications within a given time window. A later version of the model from 2010, the DistanceDependent Chinese Restaurant Process (DDCRP), tries to alleviate this approximation by replacing fixedtime episodes with a continuoustime sampling function [Blei2010DDCRP]. However, the model still considers integer counts with only their distribution over time changing. Thus, the model is not designed to consider every temporal information in a continuoustime setting.
In 2015, N. Du & al answered this problem by combining the Dirichlet process with the Hawkes process, used to model the appearance of events in a continuoustime setting. The key idea is to replace the counts of a Dirichlet process with the intensity function of the Hawkes process. The resulting DirichletHawkes process (DHP) is then used as a prior for clustering documents appearing in a continuoustime stream. The inference is realized with a Sequential MonteCarlo (SCM) algorithm. Following DHP, two articles have been published extending the idea: the Hierarchical Dirichlet Hawkes process (HDHP) [Valera2017HDHP] in 2016 and Indian Buffet Hawkes process in 2018 [Tan2018IBHP]. Another work proposed an EM algorithm for the inference [Xu2017EMDHP]
in 2017 (it uses a heuristic method to update the number of clusters and cannot handle a stream of documents).
A common feature of all the models we mentioned is that they use a nonparametric Dirichlet process (DP) prior or variations built on it, such as DHP and HDHP. Yet, on several occasions, it has been pointed out that there are no specific reasons to use this process in particular and that alternative forms might work better depending on the dataset. In [Welling2006AlterDP], the author relaxes several conditions associated with DP and shows that alternative priors are an equally valid choice in Bayesian modeling. In [Wallach2010UnifP], the authors derive the Uniform process (UP) and show that it performs better on a document clustering task. In [Poux2021PDP], the authors generalize UP and DP within a more general framework, the Powered Dirichlet process (PDP), and show it performs better than DP on several datasets.
Moreover, it has recently been highlighted that DHP does not work well when the textual information within documents conveys little information, that is when the text is short [Yin2018ShortTextDHP] or when vocabularies overlap significantly. To answer this problem, the authors develop an approach based on Dirichlet process mixtures, which is not designed for continuoustime document streams – the temporal aspect comes from a sampling function as in [Amr2008RCRP, Blei2010DDCRP]. There are other limiting cases for DHP, for instance when temporal information is conveys little information (few observations, overlapping temporal intensities) or when documents within textual clusters do not follow the same temporal dynamics. To overcome those limitations, we develop the Powered DirichletHawkes process in the next section.
Iii Model and algorithm
Iiia Dirichlet prior and alternatives
We briefly recall the definition of a Dirichlet prior. A Dirichlet prior for clustering implements the assumption that the more a cluster is populated, the more chances a new observation belongs to it (“richgetricher” property). Besides, there is still a chance that a new observation gets assigned to a newly created cluster. It is often expressed using a metaphor, the Chinese Restaurant process (CRP), and it goes as follows: if an client arrives in a Chinese restaurant, they will sit at one of the already occupied tables with a probability proportional to the number of persons already sat at this table. They can also sit alone at a new table with a probability inversely proportional to the total number of clients in the restaurant. When their choice is made, the next client arrives, and the process is repeated. Let be the cluster chosen by the customer, the table assignment of previous customers up to , the population of table , the number of already occupied tables and the concentration parameter. The process can be written formally as:
(1) 
The Uniform process [Wallach2010UnifP] has been proposed as an alternative to the DP prior. In this context, a new customer entering the restaurant has an identical chance to sit at either of the occupied tables, and a chance to sit at an empty table inversely proportional to the number of occupied tables. Formally:
(2) 
Finally, the Powered Dirichlet process [Poux2021PDP] generalizes the two above, stating that the probability for a new client to sit at a new table depends arbitrarily on the number of customers already sat at this table:
(3) 
where is an hyperparameter. Varying allows to give more or less importance to the “richgetricher” hypothesis of DP. Note that and that . We will use this more general form in the rest of this work and make vary to compare those priors in the experimental section.
IiiB Hawkes processes
A Hawkes process is defined as a selfstimulating temporal point process. It is used to determine the probability of an event happening given the realization of all previous events in a continuous space. Point processes are fully characterized by the intensity function , which is related to the probability of an event happening between and by . In the case of Hawkes processes, is defined conditionally on all the events that happened at times lower than . In our setup, we define one Hawkes process for each cluster, independent from the others. The intensity of the Hawkes process associated with cluster is defined as:
(4) 
where is the time of the observed in cluster , is the history of events in cluster up to ,
is a vector of coefficients,
is a vector of kernel functions with the same dimension as and represents the dot product. The kernel functions are set on stone. We will later infer the weights vector to determine which entries of the kernel vector are the most relevant for a given situation. This technique has become standard in Hawkes processes modeling and used in several occasions [Du2012KernelCascade, Yu2017]. Finally, we consider an additional timeindependent Hawkes process (that is a Poisson process) of intensity . This process is used as the DirichletHawkes equivalent of the concentration parameter in a Dirichlet process (see Eq. 1). It translates the probability of opening a new cluster as the realization of a Poisson process. In the same way that in DP no observation is assigned to a cluster whose counts is but instead to a new cluster, no observation will be associated with the Poisson process but instead to a new Hawkes process.Finally, the likelihood of a combination of independent Hawkes processes can be written:
(5) 
where is the upper time of the considered observation window, going from to . Note that because no event will be assigned to the Poisson process.
IiiC Powered DirichletHawkes process
Following the reasoning in [Du2015DHP], we substitute the counts of the PDP with the inferred Hawkes intensities in the PDP, resulting in the following form for the Powered DirichletHawkes prior:
(6) 
where is the arrival time of document . We reformulated the DirichletHawkes process in order to allow nonlinear dependence () on the noninteger counts ().
IiiD Textual modeling
We choose to model the textual content of documents as the result of a DirichletMultinomial distribution. This model is purposely simple to ease the understanding, but can easily be replaced by a more complex one. A more complete textual modeling is out of the scope of this work, which aims to highlight the efficiency of the PDHP. Here, a document will be associated to a given cluster according to words count in every cluster and words count in the document only. The generative process is as follows:
(7) 
where is the cluster of document , and is the word of document . Let
be the marginal joint distribution of every document’s cluster allocation up to the
one. The likelihood of the document belonging to cluster can then be expressed as:(8) 
where is the total number of words in cluster from observations previous to , is the total number of words in document , the count of word in cluster , the count of word in document and .
IiiE Posterior distribution
The resulting posterior distribution of the
document over clusters is calculated using Bayes theorem. It is proportional to the product of the textual likelihood Eq.
5 and the temporal Powered DirichletHawkes prior Eq.8:(9) 
We recall that is defined Eq. 4. The textual likelihood of cluster is computed by setting .
IiiF Algorithm and changes induced by PDHP
We use a similar algorithm to the one in [Du2015DHP]. Briefly, the algorithm is a sequential MonteCarlo (SMC) that takes one document at a time in their order of arrival. The algorithm starts with a number of particles whose weights are , each of which will keep track of a hypothesis on documents clusters. After a few iterations, particles that contained unlikely allocation hypotheses are discarded and replaced by more likely ones. The likeliness of a hypothesis is encoded in the weights of each particle .
For each particle, when a new document arrives, (1) the cluster of the document is sampled according to a Categorical distribution over all clusters, whose weights are determined by Eq. 9. After the cluster of the new document has been sampled, (2) the kernel weights from Eq. 4 are updated using Eq. 5. For efficiency purpose, we infer using Gibbs sampling from a set of precomputed vectors. We finally (3) update the weights of each particle according to the posterior Eq. 9 such as . If the weight of a particle falls below a value , the particle is discarded and replaced by another existing one with sufficient weight. The whole process is illustrated Fig. 1. By updating incrementally the likelihood associated with each of the precomputed sample vectors, the algorithm treats each new observation in constant time .
The task of updating kernels coefficients (2) is the same as in any Hawkes process, and the task of updating particles weights and resampling them (3) is common to any SMC algorithm. The change induced by the PDHP compared to the DHP happens at step (1). First of all, we note that for the PDHP prior is identical to the DHP prior. From [Poux2021PDP], lowering the value of reduces the “richgetricher” aspect of the PDP (“richgetlessricher”), whereas increasing it leads to a “richgetmorericher” effect. These metaphors can be translated as follows in our temporal context: for lower values of , the relative difference between cluster’s temporal intensities plays a less important role in cluster selection, whereas higher values of tend to exacerbate these differences and make the temporal aspect of the greatest consequence on the choice of a cluster. In other words, tuning the value of allows to give more or less importance to the temporal aspect of the clustering. This is illustrated in Fig. 2, where we plot the probability for various clusters to be chosen (which is directly proportional to the posterior distribution, Eq. 9) according to when their textual likelihood and Hawkes process intensity is known. Note that for , the probability for any cluster to get chosen is directly proportional to its textual likelihood (DirichletUniform process), whereas when increases, the probability of getting chosen gets closer to a selection only based on the temporal aspect.
This makes the main interest of the PDHP model. Tuning the parameter allows one to choose whether inferred clusters are based on textual or temporal considerations. It generalizes several stateoftheart works, which are special cases of the PDHP for different values of . The DHP [Du2015DHP] is equivalent to PDHP for ; the UP [Wallach2010UnifP] is equivalent to PDHP when . In the following sections, we show how finetuning systematically yields significantly better results than setting it to or (up to a gain of 0.3 on our experiments’ normalized mutual information metric). We also show how varying it allows to recover one kind of clustering or the other (textual or temporal) with high accuracy and see how it affects clustering results on several realworld datasets.
Iv Experiments
Iva Synthetic data generation
We simulate a case where only two clusters are considered. Each cluster has its own vocabulary distribution over 1 000 words and its own kernel weights , with Gaussian Hawkes kernel functions of parameters =(3, 0.5), (7, 0.5) and (11, 0.5) (see Eq. 4). Finally, we set . We first simulate one independent Hawkes process per cluster using the Tick Python library [Bacry2017Tick]. The processes are stopped at time , which makes a rough average of 7 000 events per run. Then we associate each simulated observation with a sample of 20 words drawn from the corresponding cluster’s word distribution. Inference has been performed using a 8 core processor (i77700HQ) with 8GB of RAM on a laptop, which underlines how scalable the algorithm is. As stated before, the algorithm treats each new document in constant time , which ranged from 0.05s on synthetic data to maximum 1s on realworld data. Note that this number is directly proportional to the number of active inferred clusters, and thus depends strongly on the dataset.
We generate ten such datasets for every considered value of vocabulary overlap and Hawkes intensities overlap, which leave us with
200 datasets. Overlap is defined as the common area of two distributions, normalized by the total area under the distributions. For example, if the vocabulary of one cluster ranges from words ”1” to ”100” with uniform distribution, and the vocabulary of another cluster from words ”50” to ”150” with uniform distribution, the overlap equals 50%. We define the overlap of Hawkes process intensity in the same way. If the triggering Hawkes kernel of one cluster is a Gaussian function with
and one associated observation at , and the triggering kernel of the other is also a Gaussian function but with also with an associated observation at , the overlap equals 32% (see Fig. 3). When computing the Hawkes intensity overlap, every observation within a cluster and its associated timestamp are considered. The definition of overlaps is illustrated in Fig. 3. To enforce a given vocabulary overlap (Fig. 3right), we shift the word distributions of the clusters from which events’ vocabulary is sampled. To enforce a given Hawkes intensities overlap (Fig. 3left), we shift the event times of every event in one of the clusters until we get the correct overlap ().Note that we consider ten different datasets instead of considering ten runs per dataset for two reasons. Firstly, the generation of Hawkes processes is highly stochastic, so a model might perform significantly better on a single dataset only by chance. Secondly, given the way the SMC algorithm works, the standard deviation between runs is small: at each iteration,
clustering hypotheses are tested, which is equivalent to running times a single clustering algorithm. We heuristically set , as we observe no significant improvement using more particles.The other parameters we use for clustering synthetic data are: , , with the Gaussian function, and .^{1}^{1}1All codes and implementations are available at https://github.com/GaelPouxMedard/PDHP
We are interested in varying both vocabulary and intensities overlap to exhibit the limits of DHP and how PDHP overcomes them. Note that in the synthetic data experiments in [Du2015DHP] (Figs.3a and 3b), the intensities overlap is almost null, which makes the task easier for the Hawkes part of the algorithm. The primary metric we use throughout the experimental section is the normalized mutual information (NMI). During the experiments, we also considered the Adjusted normalized rand index and the Vmeasure, which are well adapted to evaluate clustering results when the number of inferred clusters is different from the true number of clusters. The observed trends in results from these other metrics are identical to the ones observed for NMI. Therefore we choose to report only the results of the latter for clarity. These additional measurements are provided in the linked repository along with the code and datasets.
IvB PDHP yields better results as vocabulary overlap increases
We report our results when the intensities overlap is null, with varying and the vocabulary overlap in Fig. 5a. Because we consider ten different datasets for each set of overlap parameters, it makes no sense to report the absolute average NMI since it can vary greatly from one dataset to the other. Instead, we plot the relative NMI difference between PDHP and DHP (), which we expect to be less dependent on the datasets we consider. However, to give an idea of the typical performance for some parameters, we also provide raw results for one run in Fig. 4.
There is a clear correlation between efficiency, vocabulary overlap and , with a gain on NMI up to of its maximal value over DHP. As stated at the end of the ”Model” section, this result was expected: the more vocabulary overlap grows, the less textual content carries valuable information for clustering the documents. This observation supports the concerns raised in [Yin2018ShortTextDHP] about the efficiency of DHP for clustering short text documents. However, Hawkes intensities overlap being null, the arrival time of events carries highly valuable information when textual content does not allow to distinguish clusters well. Therefore, PDHP provides a way to tackle the problem raised in [Yin2018ShortTextDHP] without the need to sample observations.
Conversely, when vocabulary overlap is null, the textual content provides enough information to distinguish clusters correctly. The temporal dimension only allows refining the results with no significant improvement for all values of .
Finally, we can see how the DirichletUniform process (DUP, ) consistently yields worse performances under these settings. Once again, this is expected since, in this synthetic experiment, intensities overlap carry valuable information about events clustering; DUP only considers textual information and therefore misses valuable clues.
IvC PDHP yields similar results for null vocabulary overlap
We report similar results in Fig. 5b. Here, we consider a null vocabulary overlap for various values of and of Hawkes intensities overlap. The situation is now the opposite: the textual content always carries valuable information about clusters, whereas the temporal aspect does not. We observe the same trend as in Fig. 5a –note that the color scale is the same. Varying the value of does not significantly change the performances of clustering, meaning the textual content always carries enough information. This plot shows that PDHP can handle greater intensities overlap without collapsing into unrealistic clustering. Since in most realcase applications, many clusters with various dynamics may coexist simultaneously, it is comforting that the PDHP can also handle this case.
IvD PDHP yields better results in more realistic situations
We finally report the results for intermediate values of intensities and vocabulary overlaps in Fig. 5c. In realworld applications, it seldom happens that topics vocabularies do not overlap at all. For example, a quick analysis of The Gutenberg Webster’s Unabridged Dictionary by Project Gutenberg
shows that there are 22% of English words that are associated with more than one definition. A more detailed analysis would need to consider the usage frequency of words to get correct statistics. Still, this number provides an estimate of the effective vocabulary overlap in realworld situations.
In Fig. 5c, we present the results for a fixed vocabulary overlap of 0.5 for various values of and intensities overlap. Once again, we see that, on average, using PDHP can increase the NMI over DHP up to +20% of the maximum possible value.
IvE PDHP finds textual or temporal clusters depending on
We now slightly modify our experimental setup. Instead of considering that textual clusters and Hawkes intensities are perfectly correlated, we consider a decorrelated case. A document whose vocabulary is drawn from cluster can now follow the same temporal dynamics as cluster . If we imagine a dataset of news articles published online, it is clear why this might happen frequently. If popular newspapers such as New York Times or Reuters publish an article on topic at time t, it is likely to trigger snowball publications of similar articles from less popular journals. “Popularity” is chosen as an indicator in this example, but it may be any other external parameter (centrality in news networks, support of publications, etc.). In this case, the article’s textual content allows to uncover a “story of publication”, that is, how the article has been spread, when publication spikes are, etc. However, the temporal information would help understand the dynamics of publications interaction: which reduced set of articles triggered the publication of subsequent ones.
In [Du2015DHP], it is assumed that every document within clusters follow a unique dynamics. We relax this hypothesis in our datasets as follows. For null textual and temporal overlaps, after a dataset has been generated, we resample the textual clusters of a fraction of randomly selected events, as well as the words associated with the event. Doing so, we decorrelate temporal and textual clusters. Therefore, an event is now described by two cluster indicators: its temporal cluster (which Hawkes intensity made the event appear where it is) and a textual cluster (which vocabulary has been used to sample the words the event contains).
For completeness, we also show the results for for various decorrelations for one run in Fig. 6. To better understand the tendency of NMIs with respect to , we plot the average difference between the NMI of textual clustering and the NMI temporal clustering over all the datasets. Explicitly: . The results are reported in Fig. 7.
As supposed at the end of the “Model” section, varying allows retrieving one clustering or the other. Note that the value of transition from text to time clustering depends directly on the dataset considered: number of words sampled, vocabulary size, overlaps, etc.
IvF PDHP efficiently infers the temporal dynamics of each cluster
Finally, we mention that PDHP correctly infers kernels’ parameters in every situation where events are correctly assigned to their temporal cluster. We looked at the mean absolute error (MAE) between the vector
used to generate the dataset and the inferred one. When documents are correctly classified, the MAE according to the actual
entries is systematically lower than 0.1. We do not discuss this metric further because it is directly correlated to the NMI metric. If documents are correctly classified, the inferred intensity function is based on the correct observations and corresponds to the one used for data generation. If documents are not correctly classified, the inferred intensity function is close to the optimal one ( MAE) given the available information but may be far from the one used to generate datasets because events are misclassified.IvG Realworld application on Reddit
We use the PDHP prior to model real streams of textual documents. We consider three Reddit datasets^{2}^{2}2Available for download at https://files.pushshift.io/reddit/submissions/ about different topics. The News dataset is made of 73.000 titles extracted from the subreddits inthenews, neutralnews, news, nottheonion, offbeat, open_news, qualitynews, truenews and worldnews, from April 2019. We chose this month because of the wide variety of events that happened then (for instance, Sri Lanka Easter bombings, Julian Assange arrest, first direct picture of a black hole, NotreDame cathedral fire). We also consider 15.000 post titles of the subreddit TodayILearned (TIL dataset) and 13.000 post titles of the subreddit AskScience (AskScience dataset) on January 2019. We extracted the nouns, verbs, adjectives and symbols from the textual data. We run the experiments using the following parameters: , , , and . The kernel vector
is made of Gaussian functions, with means located at 0.5, 1, 4, 8, 12, 24, 48, 72, 96, 120, 144 and 168 hours. The variance of each are set to 1, 1, 3, 8, 12, 12, 24, 24, 24, 24, 24 and 24 hours. The algorithm will then infer the weights
associated with each entry of the kernel vector for each cluster.IvG1 PDHP recovers meaningful stories
As an illustrative example, we consider the inferred clusters the most related to Sri Lanka easter bombings of April 21 2019 in Fig.8. We plot the triggering kernels on a loglog scale, because most of the intensity is focused on small times: dynamics of information spread are bursty [Karsai2018BurstyHumanDynamics]. We see that inferred dynamics change with , which is expected since clusters do not contain the same documents. For , the Uniform process makes clusters based on textual information only; the triggering kernel is inferred afterward. For on the contrary, clusters are formed based on the triggering kernel, and textual information follows; we see from the rightplot that this cluster captures publications exhibiting a daily intensity cycle. Given the intensity spikes on 21, 22, and 23, it is not surprising that articles about Sri Lanka bombings are also part of this cluster. Note that the more increases, the more intense the triggering kernel is around 24h. We see from Fig.8 that DHP is a specific case of the dynamics one can retrieve using temporal information.
IvG2 PDHP favors temporal or textual clustering depending on
We report the values of loglikelihoods for every dataset and various values of in Fig.9. The textual likelihood is defined Eq.8, and the likelihood of a Hawkes process is defined Eq.5. Note that does not appear in any of these expressions. Those likelihoods evaluate how well the textual or temporal aspect of the dataset is modeled without considering the PDHP process. As expected, varying makes the model more sensitive to either textual or temporal data —low favors a textbased clustering, whereas high favors a timebased clustering.
IvG3 PDHP infers sharper textual clusters for low
We evaluate how meaningful textual cluster are using entropy. We would like a set of words within a cluster to carry a meaningful information. A way to measure this is to see how spread the vocabulary of a cluster is. Let be the count of word in cluster . The normalized Shannon entropy of a cluster is defined as:
(10) 
An entropy of 0 means the vocabulary of the cluster is concentrated on a single word; an entropy of 1 means that every of the words is present to the same extent. In Fig.10, we plot the mean entropy for various values of for all the datasets, along with the standard error over the clusters. The results show that vocabulary is more concentrated within clusters for low values of . The inflection point of the curves corresponds to what has been previously observed with likelihoods in Fig.9. On the contrary, higher values of lead to clusters that comprise a less dense vocabulary. This is expected because as increases, the textual information is no longer the most relevant data for cluster formation.
V Conclusion
We built the Powered DirichletHawkes process as a generalization of the DirichletHawkes process and Uniform process and showed how it improves performance on various datasets. When textual information conveys little information, or when temporal information conveys little information, and when both do, our model is able to correctly retrieve the original clusters used in the generation process with high accuracy. A central consideration in document clustering is that there are no “right” clusters. For instance, we illustrate how textual content and temporal dynamics can be decorrelated in reallife applications. The framework we developed is flexible enough to allow users to choose the weight they wish to give to temporal or textual information depending on the situation; when textual and temporal clusters are decorrelated, the model allows one to choose which of those to infer.
Many future extensions are possible for PDHP. For instance, it would be interesting to develop its hierarchical version (PHDHP) as it has already been done with HDHP for DHP. Another interesting perspective would be to create a version considering multivariate Hawkes processes to study how textual clusters’ dynamics relate to each other. Given several recent works have been based on the regular DirichletHawkes process, it would be insightful to study how their results vary when using the Powered HawkesDirichlet process instead. A study on the influence of the language model used along with PDHP would also be interesting since the text model we used here was simple on purpose (our focus being on the PDHP prior and not on the model it gets associated with).
Finally, it would be interesting to see how this model would work in another context where temporal and textual information are intertwined. For instance, in latent social network inference, we may be able to create clusters according to the observed temporal dynamics of publications, or according to the textual information shared between users, or according to a combination of both.
Comments
There are no comments yet.