Log In Sign Up

Modeling sequences and temporal networks with dynamic community structures

by   Tiago P. Peixoto, et al.
University of Bath

In evolving complex systems such as air traffic and social organizations, collective effects emerge from their many components' dynamic interactions. While the dynamic interactions can be represented by temporal networks with nodes and links that change over time, they remain highly complex. It is therefore often necessary to use methods that extract the temporal networks' large-scale dynamic community structure. However, such methods are subject to overfitting or suffer from effects of arbitrary, a priori imposed timescales, which should instead be extracted from data. Here we simultaneously address both problems and develop a principled data-driven method that determines relevant timescales and identifies patterns of dynamics that take place on networks as well as shape the networks themselves. We base our method on an arbitrary-order Markov chain model with community structure, and develop a nonparametric Bayesian inference framework that identifies the simplest such model that can explain temporal interaction data.


page 8

page 9


A Bayesian Nonparametric Latent Space Approach to Modeling Evolving Communities in Dynamic Networks

The evolution of communities in dynamic (time-varying) network data is a...

Constant State of Change: Engagement Inequality in Temporal Dynamic Networks

The temporal changes in complex systems of interactions have excited the...

Detecting change points in the large-scale structure of evolving networks

Interactions among people or objects are often dynamic in nature and can...

Detecting Overlapping Temporal Community Structure in Time-Evolving Networks

We present a principled approach for detecting overlapping temporal comm...

A Nonparametric Bayesian Model for Sparse Temporal Multigraphs

As the availability and importance of temporal interaction data–such as ...

Analysis of the competition among viral strains using a temporal interaction-driven contagion model

The temporal dynamics of social interactions were shown to influence the...

An Efficient Procedure for Mining Egocentric Temporal Motifs

Temporal graphs are structures which model relational data between entit...

I Introduction

To reveal the mechanisms that form complex systems, community-detection methods describe large-scale patterns in their networks of interactions Fortunato (2010). Traditionally, these methods describe only static network structures, without taking into account that networks form dynamically over time. Only recently have researchers proposed methods that incorporate higher-order temporal structures to capture dynamics of networks Holme and Saramäki (2012); Holme (2015); Mucha et al. (2010); Rosvall and Bergstrom (2010); Ronhovde et al. (2011); Bassett et al. (2013); Pfitzner et al. (2013); Bazzi et al. (2014); Sarzynska et al. (2014); Gauvin et al. (2014); Yang et al. (2010); Xu and Iii (2013); Xu and Hero (2014); Xu (2014); Peel and Clauset (2015); Peixoto (2015a); MacMahon and Garlaschelli (2015); Ghasemian et al. (2015); Zhang et al. (2016). Another class of methods attempt to describe the dynamics that take place on a network, which can also be used as a proxy for the network structure. Traditionally, these methods use memoryless Markov chains Rosvall and Bergstrom (2008); Delvenne et al. (2010); Lambiotte et al. (2014), but researchers have recently showed that incorporating memory effects can reveal crucial information about the system Rosvall et al. (2014); Lambiotte et al. (2015); De Domenico et al. (2015). However, both avenues of research have encountered central limitations: On the one hand, methods that uses memory to describe dynamics on networks rely on extrinsic methods to detect the appropriate memory order Rosvall et al. (2014); Xu et al. (2016). On the other hand, methods that attempt to describe dynamics of networks adapt static descriptions by aggregating time windows into discrete layers Kivelä et al. (2014), and completely ignore dynamics within the time windows.

Both for dynamics on and of networks, the methods require or impose ad hoc

time scales that should instead be determined solely from data, or avoided entirely. Furthermore, since most descriptions also depend on a large number of additional degrees of freedom from all possible network partitions, they are prone to


when random fluctuations in high dimensional data are mistaken for actual structure 

Guimerà et al. (2004). The imposed ad hoc time scales exacerbate this problem, since they further increase the number of degrees of freedom in the descriptions. Without a principled methodology to counteract this increasing complexity, it becomes difficult to separate meaningful effects from artifacts.

We present a general and principled approach to this problem by tackling dynamics on and of networks simultaneously (see Fig. 1). In contrast to approaches that incorporate temporal layers in methods for static network descriptions, we build our approach on describing the actual dynamics. We first formulate a generative model of discrete temporal processes based on arbitrary-order hidden Markov chains with community structure Baum et al. (1970); Rabiner and Juang (1986); Jääskinen et al. (2014); Xiong et al. (2016). Since our model generates arbitrary sequences of events, it does not require aggregation in time windows, and hence needs no a priori imposition of time scales. This model can be used to describe dynamics taking place on network systems that take into account memory effects Rosvall et al. (2014); Lambiotte et al. (2015) of arbitrary order. We then use this model as the basis for the description of temporal networks, where the sequence of events represents the occurrence of edges in the network Pfitzner et al. (2013). In both cases, we employ a nonparametric Bayesian inference framework that allows us to select the most parsimonious model according to the statistical evidence available, and hence is able to detect the most appropriate Markov order, as well as other properties such as the number of communities, while avoiding overfitting. In particular, if there is no structure in the data — either via a fully random dynamics or a lack of large-scale structure in the network — our method is able to identify this. As we also show, the model can be used to predict future network dynamics and evolution from past observations.

Our model builds on the original idea of the stochastic block model (SBM) Holland et al. (1983); Karrer and Newman (2011)

, where the nodes are divided into groups, and the edge placement probabilities depend on the node memberships. In Sec. 

II we adapt this idea to discrete-time Markov chains, that operate on conditional probabilities in sequences of tokens, where both the individual tokens and their preceding subsequence are divided into groups. Then, in Sec. III, we extend this model to a community-based temporal networks that operates on the sequence of added edges in the evolution of a network. In both cases, we illustrate the use of the model with empirical data. In Sec. IV we discuss extensions of the approach to dynamical systems with continuous time, as well as nonstationary Markov chains. We finalize in Sec. V with a conclusion.

(a) Dynamics on networks

Temporal sequence of tokens (nodes in a network):

(b) Dynamics of networks

Temporal sequence of edges in a network:

Figure 1: Our modeling framework allows simultaneously for the description of (a) arbitrary dynamics taking place on networks, represented as sequence of arbitrary “tokens” that are associated with nodes, and (b) dynamics of networks themselves, where the tokens are edges (pairs of nodes) that appear in sequence.

Ii Inference of markov chains

Here we consider general time-series composed of a sequence of discrete tokens , where is a single token from an alphabet of size observed at discrete time , and are the previous tokens at time . An th-order Markov chain with transition probabilities generates such a sequence with probability


where is the number of transitions in . Given a specific sequence , we want to infer the transitions probabilities

. The simplest approach is to compute the maximum-likelihood estimate, i.e.


where , which amounts simply to the frequency of observed transitions. Putting this back into the likelihood of Eq. 1, we have


This can be expressed via the conditional entropy , with , where is the total number of observed transitions. Hence, the maximization of the likelihood of Eq. 1 yields, seemingly, the transition probabilities which most compress the sequence. There is, however, an important caveat with this approach. Namely, it cannot be used when we are interested in determining the most appropriate order of the model. This is because if we increase , the maximum likelihood of Eq. 3 will also invariably increase: As the number of memories becomes larger, while the total number of transition remains fixed, the conditional entropy estimated from the frequency of transitions is bound to decrease. Hence, for some large enough value of there will be only one observed transition conditioned on every memory, yielding a zero conditional entropy and a maximum likelihood of one. This would be an extreme case of overfitting, where by increasing the number of degrees of freedom of the model one is not able to distinguish actual structure from stochastic fluctuations. Also, this approach does not yield true compression of the data, since it omits the description of the model (which becomes larger with ), and thus is crucially incomplete. To address this problem, one must use a Bayesian formulation, and maximize instead the complete evidence


which consists in the sum of all possible models weighted according to some prior probabilities

, that encode our a priori assumptions. This approach has been considered before by Strelioff et al Strelioff et al. (2007), where it has been shown to yield the correct model order for data sampled from Markov chains — as long as there is enough statistics — as well as meaningful values also when this is not the case, that balances the structure present in the data with its statistical weight.

Although this Bayesian approach addresses satisfactorily the overfitting problem, it misses opportunities of detecting structures in data. As we show below, it is possible to extend this model in such a way as to make a direct connection to the problem of finding communities in networks, yielding a stronger explanatory power when modeling sequences, and serving as a basis for a model where the sequence itself represents a temporal network.

ii.1 Markov chains with communities

[width=.59trim=0 -1.6cm 0 0]diagram_n1.pdf






Figure 2:

Schematic representation of the Markov model with communities for a sequence of tokens

. The nodes are either memories (top row) or tokens (bottom row) and the directed edges represent observed transitions. (a) A partition of the tokens and memories for a model. (b) A “unified” formulation of a model, where the tokens and memories have the same partition, and hence can be represented as a single set of nodes. (c) A partition of the tokens and memories for a model.

Instead of directly inferring the transition probabilities of Eq. 1, we propose an alternative formulation: We assume that both the memories and tokens are distributed in disjoint groups (see Fig. 2). That is, and are the group memberships of the tokens and memories, respectively, such that the transition probabilities can be parametrized as


Here is the relative probability at which token is selected among those that belong to same group (playing a role analogous to degree-correction in the SBM Karrer and Newman (2011)111The degree-correction feature, as well as different choice of prior probabilities in the Bayesian description, are the main differences from the sparse Markov chains developed in Refs. Jääskinen et al. (2014); Xiong et al. (2016), which are otherwise similar, but not identical to the approach proposed here., and is the overall transition probability from memory group to token group . In the case , for example, each token appears twice in the model, both as token and memory. (An alternative and often useful approach for is to consider a single unified partition for both tokens and memories, as shown in Fig. 2b and described in detail in Appendix B.) The maximum likelihood estimates for the parameters are


where is the number of observed transitions between groups and , is the total outgoing (or incoming) transitions, and is the total number of occurrences of token . Putting this back in the likelihood, we have


This is almost the same as the maximum likelihood of the degree-corrected stochastic block model (DCSBM) Karrer and Newman (2011), where plays the role of the adjacency matrix of a bipartite multigraph connecting tokens and memories. The only differences are constant terms that do not alter the position of the maximum with respect to the node partition. This implies that, in certain situations, there is no difference between inferring the structure directly from its topology or from dynamical processes taking place on it (see Appendix D for more details).

Figure 3: Part of the US Air flights itineraries during 2011. Only itineraries containing Atlanta or Las Vegas are shown. Edges incident on memories of the type are colored in red, whereas are colored in blue. The node colors and overlaid hierarchical division correspond to part of the model inferred for the whole dataset.

As before, this maximum likelihood approach cannot be used if we do not known the order of the Markov chain, otherwise it will overfit. In fact, this problem is now aggravated by the larger number of parameters this model has. Therefore, we employ a Bayesian formulation and construct a generative processes for the model parameters themselves. We do this by introducing prior probability densities for the parameters and

, with hyperparameter sets

and , and computing the integrated likelihood


Now, instead of inferring the hyperparameters, we can make a noninformative choice for and that reflects our a priori lack of preference towards any particular model Jaynes (2003). Doing so in this case yields a likelihood (see Appendix A for details),




with . The expression above has the following combinatorical interpretation: corresponds to the likelihood of a microcanonical model Peixoto (2016) where a random sequence is produced with exactly total transitions between groups and , and with each token occurring exactly times (see the Appendix A for a proof). The remaining likelihoods are the prior probabilities on the discrete parameters and

, which are uniform distributions of the type

, where is the total number of possibilities given the imposed constraints 222In the microcanonical model the marginal likelihood is identical to the joint probability of the data and parameters, i.e. , where the parameters on the right-hand side are the only ones which are compatible with the memory/token partition and the observed chain..

US Air Flights War and Peace Taxi movements RockYou password list
Table 1: Description length (in bits) as well as inferred number of token and memory groups, and , respectively, for different data sets and different Markov order (see Appendix F for dataset descriptions). The value corresponds to the direct Bayesian parametrization of Markov chains of Ref. Strelioff et al. (2007). Values in grey correspond to the minimum of each column. At the bottom is shown the compression obtained with gzip and LZMA, two popular variations of Lempel-Ziv Ziv and Lempel (1977, 1978), for the same datasets.

If we apply Stirling’s factorial approximation to Eq. 10 we obtain , with the right-hand side given by Eq. 7. Hence, the remaining terms (Eqs. 11 and 12) serve as a penalty to the maximum likelihood estimate that prevents overfitting as the size of the model increases via , , or  Peixoto (2013, 2014a, 2015b); Rosvall and Bergstrom (2007).

To make the above model fully nonparametric, we include priors for the remaining discrete parameters: the node partitions and , as well as the memory group counts, , so that the complete joint likelihood becomes


This can be done in the same manner as for the SBM Peixoto (2013, 2014a, 2015b). In particular, to avoid underfitting the model Peixoto (2013) and to reveal hierarchical modular structures Peixoto (2014a), the uniform priors of Eqs. 11 and 12

can be replaced by multilevel Bayesian hierarchies (i.e. a sequence of priors and hyperpriors). In order to fit the model above we need to find the partitions

and that maximize the joint likelihood , or, equivalently, minimize the description length  Grünwald (2007); Rosvall and Bergstrom (2007). This latter quantity has a straightforward but important information-theoretical interpretation: it corresponds to the amount of information necessary to describe both the data and the model simultaneously. Because of its complete character, minimizing it indeed amounts to achieving true compression of data, differently from the parametric maximum likelihood approach mentioned earlier. Because the whole process is functionally equivalent to inferring the SBM for networks, the same algorithms can be used Peixoto (2014b) (see Appendix A for a summary of the inference details).

This Markov chain model with communities succeeds in providing a better description for a variety of empirical sequences when compared with the common Markov chain parametrization (see Table 1). Not only do we observe a smaller description length systematically, but we also find evidence for higher order memory in all examples. We emphasize that we are protected against overfitting: If we randomly shuffle the order of the tokens in each dataset, we always infer a fully random model with and . Therefore, we are not susceptible to the spurious results of nonstatistical methods Guimerà et al. (2004).

To illustrate the effects of community structure on the Markov dynamics, we use the US air flight itineraries as an example (Fig. 3). In this dataset, the itineraries of passengers were recorded, and here each airport stop is treated as a token in a sequence (see Appendix F for more details). When we infer our model, the itinerary memories are grouped together if their destination probabilities are similar. As a result, it becomes possible, for example, to distinguish “transit hubs” from “destination hubs” Rosvall et al. (2014). We use Atlanta and Las Vegas to illustrate: Many roundtrip routes transit through Atlanta from the origin to the final destination and return to it two legs later on the way back to the origin. On the other hand, Las Vegas often is the final destination of a roundtrip such that the stop two legs later represents a more diverse set of origins (Fig. 3). This pattern is captured in our model by the larger number of memory groups that involve Las Vegas than those that involve Atlanta. It is indeed these equivalence classes (i.e. memories and tokens that belong to the same group) that effectively reduce the overall complexity of the data, and provide better compression at a higher memory order, when compared to the traditional Markov chain formulation. Consequently, the community-based Markov model can capture patterns of higher-order memory that conventional methods obscure.

As we show in the next section, this model can also be used as the basis for a temporal network model, where the tokens in the sequence are edges placed in time.

Iii Temporal networks

High school proximity () Enron email (, ) Internet AS (, )
APS citations (, ) loans (, ) Chess moves (, )
Hospital contacts (, ) Infectious Sociopatterns (, ) Reality Mining (, )
Table 2: Description length (in nats) as well as inferred number of node, token and memory groups, , and , respectively, for different data sets and different Markov order (see Appendix F). The value is a lower-bound on the predictive likelihood of the validation set (corresponding to half of the entire sequence) given the training set and its best parameter estimate.

A general model for temporal networks consists in treating the edge sequence as a time series Holme and Saramäki (2012); Holme (2015); Scholtes et al. (2014). We can in principle use the model above without any modification by considering the observed edges as tokens in the Markov chain, i.e., , where and are the endpoints of the edge at time (see Fig. 1b). However, this can be suboptimal if the networks are sparse, i.e., if only some relatively small subset of all possible edges occur, and thus there are insufficient data to reliably fit the model. Therefore, we adapt the model above by including an additional generative layer between the Markov chain and the observed edges. We do so by partitioning the nodes of the network into groups, i.e. determines the membership of node in one of groups, such that each edge is associated with a label . The idea then is to define a Markov chain for the sequence of edge labels, with the actual edges being sampled conditioned only on the labels. Since this reduces the number of possible tokens from to , it has a more controllable number of parameters that can better match the sparsity of the data. We further assume that, given the node partitions, the edges themselves are sampled in a degree-corrected manner, conditioned on the edge labels,


where is the probability of a node being selected inside a group, with . The total likelihood conditioned on the label sequence becomes


where is the degree of node , and is the total number of edges between groups and . Performing maximum likelihood, one obtains . But since we want to avoid overfitting the model, we once more use noninformative priors, but this time on , integrate over them, obtaining (omitting henceforth the trivial Kronecker delta term above)


with . Combining this with Eq. 9 as , we have the complete likelihood of the temporal network


This likelihood can be rewritten in such a way that makes clear that it is composed of one purely static and one purely dynamic part,


The first term of Eq. 19 is precisely the nonparametric likelihood of the static DCSBM that generates the aggregated graph with adjacency matrix given the node partition , which itself is given by


if Stirling’s approximation is used. The second term in Eq. 19 is the likelihood of the Markov chain of edge labels given by Eq. 9 (with , and ). This model, therefore, is a direct generalization of the static DCSBM, with a likelihood composed of two separate static and dynamic terms. One recovers the static DCSBM exactly by choosing — making the state transitions memoryless — so that the second term in Eq. 19 above contributes only with a trivial constant to the overall likelihood. Equivalently, we can view the DCSBM as a special case with of this temporal network model.

To make the model nonparametric, we again include the same prior as before for the node partition , in addition to token/memory partition , such that the total nonparametric joint likelihood is maximized, . In this way we are again protected against overfitting, and we can infer not only the number of memory and token groups, and , as before, but also the number of groups in the temporal network itself, . If, for example, the temporal network is completely random — i.e. the edges are placed randomly both in the aggregated network as well as in time — we always infer .

Figure 4: Inferred temporal model for a proximity network of high-school students Mastrandrea et al. (2015). (a) The static part of the model divides the students into groups (square nodes) that almost match the known classes (text labels). (b) The dynamic part of the model divides the directed multigraph group pairs in (a) into groups (grey circles). The model corresponds to a unified Markov chain on the edge labels, where the memory and tokens have identical partitions (see Appendix B for details).

We employ this model in a variety of dynamic network datasets from different domains (see Table 2, and Appendix F for dataset descriptions). In all cases, we infer models with that identify many groups for the tokens and memories, meaning that the model succeeds in capturing temporal structures. In most cases, models with best describe the data, implying there is no sufficient evidence for higher-order memory, with the exception of the network of chess moves, which is best described by a model with . To illustrate how the model characterizes the temporal structure of these systems, we focus on the proximity network of high school students, which corresponds to the voluntary tracking of students for a period of days Mastrandrea et al. (2015). Whenever the distance between two students fell below a threshold, an edge between them was recorded at that time. In Fig. 4 we see the best-fitting model for this data. The groups inferred for the aggregated network correspond exactly to the known division into 9 classes, as indicated in the figure, with the exception of the PC class, which was divided into two groups. The groups show a clear assortative structure, where most connections occur within each class. The clustering of the edge labels in the second part of the model reveals the temporal dynamics. We observe that the edges connecting nodes of the same group cluster either in single-node or small groups, with a high incidence of self-loops. This means that if an edge that connects two students of the same class appears in the sequence, the next edge is most likely also inside the same class, indicating that the students of the same class are clustered in space and time. The remaining edges between students of different classes are separated into two large groups. This division indicates that the different classes meet each other at different times. Indeed, the classes are located in different parts of the school building and typically go to lunch separately Mastrandrea et al. (2015).

iii.1 Temporal prediction

A direct advantage of being able to extract such temporal patterns is that they can be used to make predictions. This is in particular true of the Bayesian approach, since it can even be used to predict tokens and memories not previously observed. We demonstrate this by dividing a sequence into two equal-sized contiguous parts, , i.e. a training set and a validation set . If we observe only the training set, a lower bound on likelihood of the validation set conditioned on it given by where and is the difference in the description length between the training set and the entire data (see Appendix C) for a proof). This lower bound will become tight when both the validation and training sets become large, and hence can be used as an asymptotic approximation of the predictive likelihood. Table 2 shows empirical values for the same datasets as considered before, where corresponds to using only the static DCSBM to predict the edges, ignoring any time structure. The temporal network model provides better prediction in all cases.

Iv Model extensions

iv.1 Continuous time

So far, we have considered sequences and temporal networks that evolve discretely in time. Although this is the appropriate description for many types of data, such as text, flight itineraries and chess moves, in many other cases events happen instead in real time. In this case, the time series can be represented — without any loss of generality — by an “embedded” sequence of tokens placed in discrete time, together with an additional sequence of waiting times , where is the real time difference between tokens and . Employing a continuous-time Markov chain description, the data likelihood can be written as


with given by Eq. 1, and




is a maximum-entropy distribution governing the waiting times, according to the frequency . Substituting this in Eq. 22, we have



. To compute the nonparametric Bayesian evidence, we need a conjugate prior for the frequencies



where and are hyperparameters, interpreted, respectively, as the number and sum of prior observations. A fully noninformative choice would entail and , which would yield the so-called Jeffreys prior Jeffreys (1946),

. Unfortunately, this prior is improper, i.e. it is not a normalized distribution. In order to avoid this, we use instead

and , taking into account the global average. Using this prior, we obtain the Bayesian evidence for the waiting times as


Hence, if we employ the Bayesian parametrization with communities for the discrete embedded model as we did previously, we have


with given by Eq. 13.

Since the partition of memories and tokens only influences the first term of Eq. 28, corresponding to the embedded discrete-time Markov chain, , the outcome of the inference for any particular Markov order will not take into account the distribution of waiting times — although the preferred Markov order might be influenced by it. We can change this by modifying the model above, assuming that the waiting times are conditioned on the group membership of the memories,


where is a frequency associated with memory group . The Bayesian evidence is computed in the same manner, integrating over with the noninformative prior of Eq. 25, yielding


where .

As an example of the use of this model variation, we consider a piano reduction of Beethoven’s fifth symphony 333Extracted in MIDI format from the Mutopia project at, represented as a sequence of notes of an alphabet of size . We consider both model variants where the timings between notes are discarded, and where they are included. If individual notes occur simultaneously as part of a chord, we consider them to have occurred in relative alphabetical order, with a interval of seconds between them. The results of the inference can be seen in table 3. The discrete-time model favors a Markov chain, whereas the continuous-time model favors . This is an interesting result that shows that the timings alone can influence the most appropriate Markov order. We can see in more detail why by inspecting the typical waiting times conditioned on the memory groups, as shown in Fig. 5. For the discrete-time model, the actual continuous waiting times (which are not used during inference) are only weakly correlated with the memory groups. On the other hand, for the continuous-time model we find that the memories are divided in such a way that they are very strongly correlated with the waiting times: There is a group of memories for which the ensuing waiting times are always — corresponding to node combinations that are always associated with chords. The remaining memories are divided into further groups that display at least two distinct time scales, i.e. short and long pauses between notes. These statistically significant patterns are only visible for the higher order model.

Discrete time Continuous time
Table 3: Joint likelihoods for the discrete- and continuous-time Markov models, inferred for a piano reduction of Beethoven’s fifth symphony, as described in the text, for different Markov orders . Values in grey correspond to the maximum likelihood of each column.




Figure 5: Average waiting time per memory group, , for the Markov models inferred from Beethoven’s fifth symphony. (a) discrete-time model, ignoring the waiting times between notes. (b) continuous-time model, with waiting times incorporated into the inference.

iv.1.1 Bursty dynamics

In the above model the waiting times are distributed according to the exponential distribution of Eq. 

23, which has a typical time scale given by . However, one often encounters processes where the dynamics is bursty, i.e. the waiting times between events lack any characteristic scale, and are thus distributed according to a power-law


for , otherwise . One could in principle repeat the above calculations with the above distribution to obtain the inference procedure for this alternative model. However, this is in fact not necessary, since by making the transformation of variable


we obtain for Eq. 31


which is the same exponential distribution of Eq. 23. Hence, we need only to perform the transformation of Eq. 32 for the waiting times prior to inference, to use the bursty model variant, while maintaining the exact same algorithm.

iv.2 Nonstationarity





Figure 6: Markov model fit for a concatenated text composed of “War and peace”, by Leo Tolstoy, and “À la recherche du temps perdu”, by Marcel Proust. Edge endpoints in red and blue correspond to token an memories, respectively, that involve letters exclusive to French. (a) Version of the model with where no distinction in made between the same token occurring in the different novels, yielding a description length . (b) Version with where each token is annotated with its occurring novel, yielding a description length .

An underlying assumption of the Markov model proposed is that the same transition probabilities are used for the whole duration of the sequence, i.e. the Markov chain is stationary. Generalizations of the model can be considered where these probabilities change over time. Perhaps the simplest generalization is to assume that the dynamics is divided into

discrete “epochs”, such that one replaces tokens

by a pair , where represents the epoch where token was observed. In fact, does not need to be associated with a temporal variable — it could be any arbitrary covariate that describes additional aspects of the data. By incorporating this type of “annotation” into the tokens, one can use a stationary Markov chain describing the augmented tokens that in fact corresponds to a non-stationary one if one omits the variable from the token descriptors — effectively allowing for arbitrary extensions of the model by simply incorporating appropriate covariates, and without requiring any modification to the inference algorithm.

Another consequence of this extension is that the same token can belong to different groups if it is associated with two or more different covariates, and . Therefore, this inherently models a situation where the group membership of tokens and memory vary in time.

As an illustration of this application of the model, we consider two literary texts: an English translation of “War and peace”, by Leo Tolstoy, and the French original of “À la recherche du temps perdu”, by Marcel Proust. First, we concatenate both novels together, treating it as a single text. If we fit our Markov model to it, we obtain the model shown in Fig. 6a. In that figure, we have highlighted tokens and memories that involve letters that are exclusive to the French language, and thus occur most predominantly in the second novel. We observe that the model essentially finds a mixture between English and French. If, however, we indicate in each token to which novel it belongs, e.g. and , we obtain the model of Fig. 6b. In this case, the model is forced to separate between the two novels, and one indeed learns the French patterns differently from English. Since this “nonstationary” model possesses a larger number of memory and tokens, one would expect a larger description length. However, in this cases it has a smaller description length than the mixed alternative, indicating indeed that both patterns are sufficiently different to warrant a separate description. Therefore, this approach is capable of uncovering change points Peel and Clauset (2015), where the rules governing the dynamics change significantly from one period to another.

V Conclusion

We presented a dynamical variation of the degree-corrected stochastic block model that can capture long pathways or large-scale structures in sequences and temporal networks. The model is based on a nonparametric variable-order hidden Markov chain, and can be used not only to infer the most appropriate Markov order but also the number of groups in the model. Because of its nonparametric nature, it requires no a priori imposition of time scales, which get inferred according to the statistical evidence available.

We showed that the model successfully finds meaningful large-scale temporal structures in empirical systems, and that it predicts their temporal evolution. We demonstrated also the versatility of the approach, by showing how the model can be easily extended to situations where the dynamics occur in continuous time, or is nonstationary.

Our approach provides a principled and more powerful alternative to approaches that force the division of the network-formation dynamics into discrete time windows, and require the appropriate amount of dynamical memory to be known in advance.


Appendix A Bayesian Markov chains with communities

As described in the main text, a Bayesian formulation of the Markov model consists in specifying prior probabilities for the model parameters, and integrating over them. In doing so, we convert the problem from one of parametric inference (i.e. when the model parameters need to be specified or fitted from data), to a nonparametric one (i.e. no parameters need to be specified or fitted). In this way, the approach possesses intrinsic regularization, where the order of the model can be inferred from data alone, without overfitting Jaynes (2003); MacKay (2003).

To accomplish this, we rewrite the model likelihood (using Eqs. 1 and 5) as


and observe the normalization constraints , and . Since this is just a product of multinomials, we can choose conjugate Dirichlet priors probability densities and , which allows us to exactly compute the integrated likelihood,


where and . We recover the Bayesian version of the common Markov chain formulation (see Ref. Strelioff et al. (2007)) if we put each memory and token in their own groups. This remains a parametric distribution, since we need to specify or infer the hyperparameters. However, in the absence of prior information it is more appropriate to make a noninformative choice that encodes our a priori lack of knowledge or preference towards any particular model, , which simplifies the above in a way that can be written exactly as Eqs. 6-9 in the main text. As was mentioned there, the likelihood of Eq. 7 corresponds to a microcanonical model that generates random sequences with hard constraints. In order to see this, consider a chain where there are only transitions in total between token group and memory group , and each token occurs exactly