I Introduction
To reveal the mechanisms that form complex systems, communitydetection methods describe largescale patterns in their networks of interactions Fortunato (2010). Traditionally, these methods describe only static network structures, without taking into account that networks form dynamically over time. Only recently have researchers proposed methods that incorporate higherorder temporal structures to capture dynamics of networks Holme and Saramäki (2012); Holme (2015); Mucha et al. (2010); Rosvall and Bergstrom (2010); Ronhovde et al. (2011); Bassett et al. (2013); Pfitzner et al. (2013); Bazzi et al. (2014); Sarzynska et al. (2014); Gauvin et al. (2014); Yang et al. (2010); Xu and Iii (2013); Xu and Hero (2014); Xu (2014); Peel and Clauset (2015); Peixoto (2015a); MacMahon and Garlaschelli (2015); Ghasemian et al. (2015); Zhang et al. (2016). Another class of methods attempt to describe the dynamics that take place on a network, which can also be used as a proxy for the network structure. Traditionally, these methods use memoryless Markov chains Rosvall and Bergstrom (2008); Delvenne et al. (2010); Lambiotte et al. (2014), but researchers have recently showed that incorporating memory effects can reveal crucial information about the system Rosvall et al. (2014); Lambiotte et al. (2015); De Domenico et al. (2015). However, both avenues of research have encountered central limitations: On the one hand, methods that uses memory to describe dynamics on networks rely on extrinsic methods to detect the appropriate memory order Rosvall et al. (2014); Xu et al. (2016). On the other hand, methods that attempt to describe dynamics of networks adapt static descriptions by aggregating time windows into discrete layers Kivelä et al. (2014), and completely ignore dynamics within the time windows.
Both for dynamics on and of networks, the methods require or impose ad hoc
time scales that should instead be determined solely from data, or avoided entirely. Furthermore, since most descriptions also depend on a large number of additional degrees of freedom from all possible network partitions, they are prone to
overfittingwhen random fluctuations in high dimensional data are mistaken for actual structure
Guimerà et al. (2004). The imposed ad hoc time scales exacerbate this problem, since they further increase the number of degrees of freedom in the descriptions. Without a principled methodology to counteract this increasing complexity, it becomes difficult to separate meaningful effects from artifacts.We present a general and principled approach to this problem by tackling dynamics on and of networks simultaneously (see Fig. 1). In contrast to approaches that incorporate temporal layers in methods for static network descriptions, we build our approach on describing the actual dynamics. We first formulate a generative model of discrete temporal processes based on arbitraryorder hidden Markov chains with community structure Baum et al. (1970); Rabiner and Juang (1986); Jääskinen et al. (2014); Xiong et al. (2016). Since our model generates arbitrary sequences of events, it does not require aggregation in time windows, and hence needs no a priori imposition of time scales. This model can be used to describe dynamics taking place on network systems that take into account memory effects Rosvall et al. (2014); Lambiotte et al. (2015) of arbitrary order. We then use this model as the basis for the description of temporal networks, where the sequence of events represents the occurrence of edges in the network Pfitzner et al. (2013). In both cases, we employ a nonparametric Bayesian inference framework that allows us to select the most parsimonious model according to the statistical evidence available, and hence is able to detect the most appropriate Markov order, as well as other properties such as the number of communities, while avoiding overfitting. In particular, if there is no structure in the data — either via a fully random dynamics or a lack of largescale structure in the network — our method is able to identify this. As we also show, the model can be used to predict future network dynamics and evolution from past observations.
Our model builds on the original idea of the stochastic block model (SBM) Holland et al. (1983); Karrer and Newman (2011)
, where the nodes are divided into groups, and the edge placement probabilities depend on the node memberships. In Sec.
II we adapt this idea to discretetime Markov chains, that operate on conditional probabilities in sequences of tokens, where both the individual tokens and their preceding subsequence are divided into groups. Then, in Sec. III, we extend this model to a communitybased temporal networks that operates on the sequence of added edges in the evolution of a network. In both cases, we illustrate the use of the model with empirical data. In Sec. IV we discuss extensions of the approach to dynamical systems with continuous time, as well as nonstationary Markov chains. We finalize in Sec. V with a conclusion.Ii Inference of markov chains
Here we consider general timeseries composed of a sequence of discrete tokens , where is a single token from an alphabet of size observed at discrete time , and are the previous tokens at time . An thorder Markov chain with transition probabilities generates such a sequence with probability
(1) 
where is the number of transitions in . Given a specific sequence , we want to infer the transitions probabilities
. The simplest approach is to compute the maximumlikelihood estimate, i.e.
(2) 
where , which amounts simply to the frequency of observed transitions. Putting this back into the likelihood of Eq. 1, we have
(3) 
This can be expressed via the conditional entropy , with , where is the total number of observed transitions. Hence, the maximization of the likelihood of Eq. 1 yields, seemingly, the transition probabilities which most compress the sequence. There is, however, an important caveat with this approach. Namely, it cannot be used when we are interested in determining the most appropriate order of the model. This is because if we increase , the maximum likelihood of Eq. 3 will also invariably increase: As the number of memories becomes larger, while the total number of transition remains fixed, the conditional entropy estimated from the frequency of transitions is bound to decrease. Hence, for some large enough value of there will be only one observed transition conditioned on every memory, yielding a zero conditional entropy and a maximum likelihood of one. This would be an extreme case of overfitting, where by increasing the number of degrees of freedom of the model one is not able to distinguish actual structure from stochastic fluctuations. Also, this approach does not yield true compression of the data, since it omits the description of the model (which becomes larger with ), and thus is crucially incomplete. To address this problem, one must use a Bayesian formulation, and maximize instead the complete evidence
(4) 
which consists in the sum of all possible models weighted according to some prior probabilities
, that encode our a priori assumptions. This approach has been considered before by Strelioff et al Strelioff et al. (2007), where it has been shown to yield the correct model order for data sampled from Markov chains — as long as there is enough statistics — as well as meaningful values also when this is not the case, that balances the structure present in the data with its statistical weight.Although this Bayesian approach addresses satisfactorily the overfitting problem, it misses opportunities of detecting structures in data. As we show below, it is possible to extend this model in such a way as to make a direct connection to the problem of finding communities in networks, yielding a stronger explanatory power when modeling sequences, and serving as a basis for a model where the sequence itself represents a temporal network.
ii.1 Markov chains with communities
[width=.59trim=0 1.6cm 0 0]diagram_n1.pdf  [width=.4]diagram_unfied.pdf 

[width=]diagram_n2.pdf 
Schematic representation of the Markov model with communities for a sequence of tokens
. The nodes are either memories (top row) or tokens (bottom row) and the directed edges represent observed transitions. (a) A partition of the tokens and memories for a model. (b) A “unified” formulation of a model, where the tokens and memories have the same partition, and hence can be represented as a single set of nodes. (c) A partition of the tokens and memories for a model.Instead of directly inferring the transition probabilities of Eq. 1, we propose an alternative formulation: We assume that both the memories and tokens are distributed in disjoint groups (see Fig. 2). That is, and are the group memberships of the tokens and memories, respectively, such that the transition probabilities can be parametrized as
(5) 
Here is the relative probability at which token is selected among those that belong to same group (playing a role analogous to degreecorrection in the SBM Karrer and Newman (2011)) ^{1}^{1}1The degreecorrection feature, as well as different choice of prior probabilities in the Bayesian description, are the main differences from the sparse Markov chains developed in Refs. Jääskinen et al. (2014); Xiong et al. (2016), which are otherwise similar, but not identical to the approach proposed here., and is the overall transition probability from memory group to token group . In the case , for example, each token appears twice in the model, both as token and memory. (An alternative and often useful approach for is to consider a single unified partition for both tokens and memories, as shown in Fig. 2b and described in detail in Appendix B.) The maximum likelihood estimates for the parameters are
(6) 
where is the number of observed transitions between groups and , is the total outgoing (or incoming) transitions, and is the total number of occurrences of token . Putting this back in the likelihood, we have
(7) 
This is almost the same as the maximum likelihood of the degreecorrected stochastic block model (DCSBM) Karrer and Newman (2011), where plays the role of the adjacency matrix of a bipartite multigraph connecting tokens and memories. The only differences are constant terms that do not alter the position of the maximum with respect to the node partition. This implies that, in certain situations, there is no difference between inferring the structure directly from its topology or from dynamical processes taking place on it (see Appendix D for more details).
As before, this maximum likelihood approach cannot be used if we do not known the order of the Markov chain, otherwise it will overfit. In fact, this problem is now aggravated by the larger number of parameters this model has. Therefore, we employ a Bayesian formulation and construct a generative processes for the model parameters themselves. We do this by introducing prior probability densities for the parameters and
, with hyperparameter sets
and , and computing the integrated likelihood(8) 
Now, instead of inferring the hyperparameters, we can make a noninformative choice for and that reflects our a priori lack of preference towards any particular model Jaynes (2003). Doing so in this case yields a likelihood (see Appendix A for details),
(9) 
where
(10)  
(11)  
(12) 
with . The expression above has the following combinatorical interpretation: corresponds to the likelihood of a microcanonical model Peixoto (2016) where a random sequence is produced with exactly total transitions between groups and , and with each token occurring exactly times (see the Appendix A for a proof). The remaining likelihoods are the prior probabilities on the discrete parameters and
, which are uniform distributions of the type
, where is the total number of possibilities given the imposed constraints ^{2}^{2}2In the microcanonical model the marginal likelihood is identical to the joint probability of the data and parameters, i.e. , where the parameters on the righthand side are the only ones which are compatible with the memory/token partition and the observed chain..US Air Flights  War and Peace  Taxi movements  RockYou password list  
gzip  
LZMA 
If we apply Stirling’s factorial approximation to Eq. 10 we obtain , with the righthand side given by Eq. 7. Hence, the remaining terms (Eqs. 11 and 12) serve as a penalty to the maximum likelihood estimate that prevents overfitting as the size of the model increases via , , or Peixoto (2013, 2014a, 2015b); Rosvall and Bergstrom (2007).
To make the above model fully nonparametric, we include priors for the remaining discrete parameters: the node partitions and , as well as the memory group counts, , so that the complete joint likelihood becomes
(13) 
This can be done in the same manner as for the SBM Peixoto (2013, 2014a, 2015b). In particular, to avoid underfitting the model Peixoto (2013) and to reveal hierarchical modular structures Peixoto (2014a), the uniform priors of Eqs. 11 and 12
can be replaced by multilevel Bayesian hierarchies (i.e. a sequence of priors and hyperpriors). In order to fit the model above we need to find the partitions
and that maximize the joint likelihood , or, equivalently, minimize the description length Grünwald (2007); Rosvall and Bergstrom (2007). This latter quantity has a straightforward but important informationtheoretical interpretation: it corresponds to the amount of information necessary to describe both the data and the model simultaneously. Because of its complete character, minimizing it indeed amounts to achieving true compression of data, differently from the parametric maximum likelihood approach mentioned earlier. Because the whole process is functionally equivalent to inferring the SBM for networks, the same algorithms can be used Peixoto (2014b) (see Appendix A for a summary of the inference details).This Markov chain model with communities succeeds in providing a better description for a variety of empirical sequences when compared with the common Markov chain parametrization (see Table 1). Not only do we observe a smaller description length systematically, but we also find evidence for higher order memory in all examples. We emphasize that we are protected against overfitting: If we randomly shuffle the order of the tokens in each dataset, we always infer a fully random model with and . Therefore, we are not susceptible to the spurious results of nonstatistical methods Guimerà et al. (2004).
To illustrate the effects of community structure on the Markov dynamics, we use the US air flight itineraries as an example (Fig. 3). In this dataset, the itineraries of passengers were recorded, and here each airport stop is treated as a token in a sequence (see Appendix F for more details). When we infer our model, the itinerary memories are grouped together if their destination probabilities are similar. As a result, it becomes possible, for example, to distinguish “transit hubs” from “destination hubs” Rosvall et al. (2014). We use Atlanta and Las Vegas to illustrate: Many roundtrip routes transit through Atlanta from the origin to the final destination and return to it two legs later on the way back to the origin. On the other hand, Las Vegas often is the final destination of a roundtrip such that the stop two legs later represents a more diverse set of origins (Fig. 3). This pattern is captured in our model by the larger number of memory groups that involve Las Vegas than those that involve Atlanta. It is indeed these equivalence classes (i.e. memories and tokens that belong to the same group) that effectively reduce the overall complexity of the data, and provide better compression at a higher memory order, when compared to the traditional Markov chain formulation. Consequently, the communitybased Markov model can capture patterns of higherorder memory that conventional methods obscure.
As we show in the next section, this model can also be used as the basis for a temporal network model, where the tokens in the sequence are edges placed in time.
Iii Temporal networks
High school proximity ()  Enron email (, )  Internet AS (, )  
—  —  —  —  —  —  
APS citations (, )  prosper.com loans (, )  Chess moves (, )  
—  —  —  —  —  —  
Hospital contacts (, )  Infectious Sociopatterns (, )  Reality Mining (, )  
—  —  —  —  —  —  
A general model for temporal networks consists in treating the edge sequence as a time series Holme and Saramäki (2012); Holme (2015); Scholtes et al. (2014). We can in principle use the model above without any modification by considering the observed edges as tokens in the Markov chain, i.e., , where and are the endpoints of the edge at time (see Fig. 1b). However, this can be suboptimal if the networks are sparse, i.e., if only some relatively small subset of all possible edges occur, and thus there are insufficient data to reliably fit the model. Therefore, we adapt the model above by including an additional generative layer between the Markov chain and the observed edges. We do so by partitioning the nodes of the network into groups, i.e. determines the membership of node in one of groups, such that each edge is associated with a label . The idea then is to define a Markov chain for the sequence of edge labels, with the actual edges being sampled conditioned only on the labels. Since this reduces the number of possible tokens from to , it has a more controllable number of parameters that can better match the sparsity of the data. We further assume that, given the node partitions, the edges themselves are sampled in a degreecorrected manner, conditioned on the edge labels,
(14) 
where is the probability of a node being selected inside a group, with . The total likelihood conditioned on the label sequence becomes
(15) 
where is the degree of node , and is the total number of edges between groups and . Performing maximum likelihood, one obtains . But since we want to avoid overfitting the model, we once more use noninformative priors, but this time on , integrate over them, obtaining (omitting henceforth the trivial Kronecker delta term above)
(16) 
with . Combining this with Eq. 9 as , we have the complete likelihood of the temporal network
(17)  
(18) 
This likelihood can be rewritten in such a way that makes clear that it is composed of one purely static and one purely dynamic part,
(19) 
The first term of Eq. 19 is precisely the nonparametric likelihood of the static DCSBM that generates the aggregated graph with adjacency matrix given the node partition , which itself is given by
(20) 
if Stirling’s approximation is used. The second term in Eq. 19 is the likelihood of the Markov chain of edge labels given by Eq. 9 (with , and ). This model, therefore, is a direct generalization of the static DCSBM, with a likelihood composed of two separate static and dynamic terms. One recovers the static DCSBM exactly by choosing — making the state transitions memoryless — so that the second term in Eq. 19 above contributes only with a trivial constant to the overall likelihood. Equivalently, we can view the DCSBM as a special case with of this temporal network model.
To make the model nonparametric, we again include the same prior as before for the node partition , in addition to token/memory partition , such that the total nonparametric joint likelihood is maximized, . In this way we are again protected against overfitting, and we can infer not only the number of memory and token groups, and , as before, but also the number of groups in the temporal network itself, . If, for example, the temporal network is completely random — i.e. the edges are placed randomly both in the aggregated network as well as in time — we always infer .
We employ this model in a variety of dynamic network datasets from different domains (see Table 2, and Appendix F for dataset descriptions). In all cases, we infer models with that identify many groups for the tokens and memories, meaning that the model succeeds in capturing temporal structures. In most cases, models with best describe the data, implying there is no sufficient evidence for higherorder memory, with the exception of the network of chess moves, which is best described by a model with . To illustrate how the model characterizes the temporal structure of these systems, we focus on the proximity network of high school students, which corresponds to the voluntary tracking of students for a period of days Mastrandrea et al. (2015). Whenever the distance between two students fell below a threshold, an edge between them was recorded at that time. In Fig. 4 we see the bestfitting model for this data. The groups inferred for the aggregated network correspond exactly to the known division into 9 classes, as indicated in the figure, with the exception of the PC class, which was divided into two groups. The groups show a clear assortative structure, where most connections occur within each class. The clustering of the edge labels in the second part of the model reveals the temporal dynamics. We observe that the edges connecting nodes of the same group cluster either in singlenode or small groups, with a high incidence of selfloops. This means that if an edge that connects two students of the same class appears in the sequence, the next edge is most likely also inside the same class, indicating that the students of the same class are clustered in space and time. The remaining edges between students of different classes are separated into two large groups. This division indicates that the different classes meet each other at different times. Indeed, the classes are located in different parts of the school building and typically go to lunch separately Mastrandrea et al. (2015).
iii.1 Temporal prediction
A direct advantage of being able to extract such temporal patterns is that they can be used to make predictions. This is in particular true of the Bayesian approach, since it can even be used to predict tokens and memories not previously observed. We demonstrate this by dividing a sequence into two equalsized contiguous parts, , i.e. a training set and a validation set . If we observe only the training set, a lower bound on likelihood of the validation set conditioned on it given by where and is the difference in the description length between the training set and the entire data (see Appendix C) for a proof). This lower bound will become tight when both the validation and training sets become large, and hence can be used as an asymptotic approximation of the predictive likelihood. Table 2 shows empirical values for the same datasets as considered before, where corresponds to using only the static DCSBM to predict the edges, ignoring any time structure. The temporal network model provides better prediction in all cases.
Iv Model extensions
iv.1 Continuous time
So far, we have considered sequences and temporal networks that evolve discretely in time. Although this is the appropriate description for many types of data, such as text, flight itineraries and chess moves, in many other cases events happen instead in real time. In this case, the time series can be represented — without any loss of generality — by an “embedded” sequence of tokens placed in discrete time, together with an additional sequence of waiting times , where is the real time difference between tokens and . Employing a continuoustime Markov chain description, the data likelihood can be written as
(21) 
with given by Eq. 1, and
(22) 
where
(23) 
is a maximumentropy distribution governing the waiting times, according to the frequency . Substituting this in Eq. 22, we have
(24) 
where
. To compute the nonparametric Bayesian evidence, we need a conjugate prior for the frequencies
,(25) 
where and are hyperparameters, interpreted, respectively, as the number and sum of prior observations. A fully noninformative choice would entail and , which would yield the socalled Jeffreys prior Jeffreys (1946),
. Unfortunately, this prior is improper, i.e. it is not a normalized distribution. In order to avoid this, we use instead
and , taking into account the global average. Using this prior, we obtain the Bayesian evidence for the waiting times as(26)  
(27) 
Hence, if we employ the Bayesian parametrization with communities for the discrete embedded model as we did previously, we have
(28) 
with given by Eq. 13.
Since the partition of memories and tokens only influences the first term of Eq. 28, corresponding to the embedded discretetime Markov chain, , the outcome of the inference for any particular Markov order will not take into account the distribution of waiting times — although the preferred Markov order might be influenced by it. We can change this by modifying the model above, assuming that the waiting times are conditioned on the group membership of the memories,
(29) 
where is a frequency associated with memory group . The Bayesian evidence is computed in the same manner, integrating over with the noninformative prior of Eq. 25, yielding
(30) 
where .
As an example of the use of this model variation, we consider a piano reduction of Beethoven’s fifth symphony ^{3}^{3}3Extracted in MIDI format from the Mutopia project at http://www.mutopiaproject.org., represented as a sequence of notes of an alphabet of size . We consider both model variants where the timings between notes are discarded, and where they are included. If individual notes occur simultaneously as part of a chord, we consider them to have occurred in relative alphabetical order, with a interval of seconds between them. The results of the inference can be seen in table 3. The discretetime model favors a Markov chain, whereas the continuoustime model favors . This is an interesting result that shows that the timings alone can influence the most appropriate Markov order. We can see in more detail why by inspecting the typical waiting times conditioned on the memory groups, as shown in Fig. 5. For the discretetime model, the actual continuous waiting times (which are not used during inference) are only weakly correlated with the memory groups. On the other hand, for the continuoustime model we find that the memories are divided in such a way that they are very strongly correlated with the waiting times: There is a group of memories for which the ensuing waiting times are always — corresponding to node combinations that are always associated with chords. The remaining memories are divided into further groups that display at least two distinct time scales, i.e. short and long pauses between notes. These statistically significant patterns are only visible for the higher order model.
Discrete time  Continuous time  

[width=]beethoven_fifth_op67order1dtsFalse.pdf 
[width=]beethoven_fifth_op67order2dtsTrue.pdf 
iv.1.1 Bursty dynamics
In the above model the waiting times are distributed according to the exponential distribution of Eq.
23, which has a typical time scale given by . However, one often encounters processes where the dynamics is bursty, i.e. the waiting times between events lack any characteristic scale, and are thus distributed according to a powerlaw(31) 
for , otherwise . One could in principle repeat the above calculations with the above distribution to obtain the inference procedure for this alternative model. However, this is in fact not necessary, since by making the transformation of variable
(32) 
we obtain for Eq. 31
(33) 
which is the same exponential distribution of Eq. 23. Hence, we need only to perform the transformation of Eq. 32 for the waiting times prior to inference, to use the bursty model variant, while maintaining the exact same algorithm.
iv.2 Nonstationarity
[width=.49]warandpeaceducote.png  [width=.49]warandpeaceducoteseparate.png 
An underlying assumption of the Markov model proposed is that the same transition probabilities are used for the whole duration of the sequence, i.e. the Markov chain is stationary. Generalizations of the model can be considered where these probabilities change over time. Perhaps the simplest generalization is to assume that the dynamics is divided into
discrete “epochs”, such that one replaces tokens
by a pair , where represents the epoch where token was observed. In fact, does not need to be associated with a temporal variable — it could be any arbitrary covariate that describes additional aspects of the data. By incorporating this type of “annotation” into the tokens, one can use a stationary Markov chain describing the augmented tokens that in fact corresponds to a nonstationary one if one omits the variable from the token descriptors — effectively allowing for arbitrary extensions of the model by simply incorporating appropriate covariates, and without requiring any modification to the inference algorithm.Another consequence of this extension is that the same token can belong to different groups if it is associated with two or more different covariates, and . Therefore, this inherently models a situation where the group membership of tokens and memory vary in time.
As an illustration of this application of the model, we consider two literary texts: an English translation of “War and peace”, by Leo Tolstoy, and the French original of “À la recherche du temps perdu”, by Marcel Proust. First, we concatenate both novels together, treating it as a single text. If we fit our Markov model to it, we obtain the model shown in Fig. 6a. In that figure, we have highlighted tokens and memories that involve letters that are exclusive to the French language, and thus occur most predominantly in the second novel. We observe that the model essentially finds a mixture between English and French. If, however, we indicate in each token to which novel it belongs, e.g. and , we obtain the model of Fig. 6b. In this case, the model is forced to separate between the two novels, and one indeed learns the French patterns differently from English. Since this “nonstationary” model possesses a larger number of memory and tokens, one would expect a larger description length. However, in this cases it has a smaller description length than the mixed alternative, indicating indeed that both patterns are sufficiently different to warrant a separate description. Therefore, this approach is capable of uncovering change points Peel and Clauset (2015), where the rules governing the dynamics change significantly from one period to another.
V Conclusion
We presented a dynamical variation of the degreecorrected stochastic block model that can capture long pathways or largescale structures in sequences and temporal networks. The model is based on a nonparametric variableorder hidden Markov chain, and can be used not only to infer the most appropriate Markov order but also the number of groups in the model. Because of its nonparametric nature, it requires no a priori imposition of time scales, which get inferred according to the statistical evidence available.
We showed that the model successfully finds meaningful largescale temporal structures in empirical systems, and that it predicts their temporal evolution. We demonstrated also the versatility of the approach, by showing how the model can be easily extended to situations where the dynamics occur in continuous time, or is nonstationary.
Our approach provides a principled and more powerful alternative to approaches that force the division of the networkformation dynamics into discrete time windows, and require the appropriate amount of dynamical memory to be known in advance.
References
 Fortunato (2010) Santo Fortunato, “Community detection in graphs,” Physics Reports 486, 75–174 (2010).
 Holme and Saramäki (2012) Petter Holme and Jari Saramäki, “Temporal networks,” Physics Reports 519, 97–125 (2012).
 Holme (2015) Petter Holme, “Modern temporal network theory: A colloquium,” arXiv:1508.01303 [physics] (2015), arXiv: 1508.01303.
 Mucha et al. (2010) Peter J. Mucha, Thomas Richardson, Kevin Macon, Mason A. Porter, and JukkaPekka Onnela, “Community Structure in TimeDependent, Multiscale, and Multiplex Networks,” Science 328, 876–878 (2010).
 Rosvall and Bergstrom (2010) Martin Rosvall and Carl T. Bergstrom, “Mapping Change in Large Networks,” PLoS ONE 5, e8694 (2010).
 Ronhovde et al. (2011) P. Ronhovde, S. Chakrabarty, M. Sahu, K. F Kelton, N. A Mauro, K . K Sahu, and Z. Nussinov, “Detecting hidden spatial and spatiotemporal structures in glasses and complex physical systems by multiresolution network clustering,” 1102.1519 (2011).
 Bassett et al. (2013) Danielle S. Bassett, Mason A. Porter, Nicholas F. Wymbs, Scott T. Grafton, Jean M. Carlson, and Peter J. Mucha, “Robust detection of dynamic community structure in networks,” Chaos: An Interdisciplinary Journal of Nonlinear Science 23, 013142 (2013).
 Pfitzner et al. (2013) René Pfitzner, Ingo Scholtes, Antonios Garas, Claudio J. Tessone, and Frank Schweitzer, “Betweenness Preference: Quantifying Correlations in the Topological Dynamics of Temporal Networks,” Phys. Rev. Lett. 110, 198701 (2013).
 Bazzi et al. (2014) Marya Bazzi, Mason A. Porter, Stacy Williams, Mark McDonald, Daniel J. Fenn, and Sam D. Howison, “Community detection in temporal multilayer networks, and its application to correlation networks,” arXiv:1501.00040 [nlin, physics:physics, qfin] (2014), arXiv: 1501.00040.
 Sarzynska et al. (2014) Marta Sarzynska, Elizabeth A. Leicht, Gerardo Chowell, and Mason A. Porter, “Null Models for Community Detection in SpatiallyEmbedded, Temporal Networks,” arXiv:1407.6297 [nlin, physics:physics, qbio] (2014), arXiv: 1407.6297.

Gauvin et al. (2014)
Laetitia Gauvin, André Panisson, and Ciro Cattuto, “Detecting the Community Structure and Activity Patterns of Temporal Networks: A NonNegative Tensor Factorization Approach,”
PLoS ONE 9, e86028 (2014).  Yang et al. (2010) Tianbao Yang, Yun Chi, Shenghuo Zhu, Yihong Gong, and Rong Jin, “Detecting communities and their evolutions in dynamic social networks—a Bayesian approach,” Mach Learn 82, 157–189 (2010).
 Xu and Iii (2013) Kevin S. Xu and Alfred O. Hero Iii, “Dynamic Stochastic Blockmodels: Statistical Models for TimeEvolving Networks,” in Social Computing, BehavioralCultural Modeling and Prediction, Lecture Notes in Computer Science No. 7812, edited by Ariel M. Greenberg, William G. Kennedy, and Nathan D. Bos (Springer Berlin Heidelberg, 2013) pp. 201–210.
 Xu and Hero (2014) K.S. Xu and A.O. Hero, “Dynamic Stochastic Blockmodels for TimeEvolving Social Networks,” IEEE Journal of Selected Topics in Signal Processing 8, 552–562 (2014).
 Xu (2014) Kevin S. Xu, “Stochastic Block Transition Models for Dynamic Networks,” arXiv:1411.5404 [physics, stat] (2014), arXiv: 1411.5404.
 Peel and Clauset (2015) Leto Peel and Aaron Clauset, “Detecting Change Points in the LargeScale Structure of Evolving Networks,” in TwentyNinth AAAI Conference on Artificial Intelligence (2015).
 Peixoto (2015a) Tiago P. Peixoto, “Inferring the mesoscale structure of layered, edgevalued, and timevarying networks,” Phys. Rev. E 92, 042807 (2015a).
 MacMahon and Garlaschelli (2015) Mel MacMahon and Diego Garlaschelli, “Community Detection for Correlation Matrices,” Phys. Rev. X 5, 021006 (2015).
 Ghasemian et al. (2015) Amir Ghasemian, Pan Zhang, Aaron Clauset, Cristopher Moore, and Leto Peel, “Detectability thresholds and optimal algorithms for community structure in dynamic networks,” arXiv:1506.06179 [condmat, physics:physics, stat] (2015).
 Zhang et al. (2016) Xiao Zhang, Cristopher Moore, and M. E. J. Newman, “Random graph models for dynamic networks,” arXiv:1607.07570 [physics] (2016), arXiv: 1607.07570.
 Rosvall and Bergstrom (2008) Martin Rosvall and Carl T. Bergstrom, “Maps of random walks on complex networks reveal community structure,” PNAS 105, 1118–1123 (2008).
 Delvenne et al. (2010) J.C. Delvenne, S. N. Yaliraki, and M. Barahona, “Stability of graph communities across time scales,” PNAS 107, 12755–12760 (2010).
 Lambiotte et al. (2014) R. Lambiotte, J. C. Delvenne, and M. Barahona, “Random Walks, Markov Processes and the Multiscale Modular Organization of Complex Networks,” IEEE Transactions on Network Science and Engineering 1, 76–90 (2014).
 Rosvall et al. (2014) Martin Rosvall, Alcides V. Esquivel, Andrea Lancichinetti, Jevin D. West, and Renaud Lambiotte, “Memory in network flows and its effects on spreading dynamics and community detection,” Nature Communications 5 (2014), 10.1038/ncomms5630.
 Lambiotte et al. (2015) Renaud Lambiotte, Vsevolod Salnikov, and Martin Rosvall, “Effect of memory on the dynamics of random walks on networks,” jcomplexnetw 3, 177–188 (2015).
 De Domenico et al. (2015) Manlio De Domenico, Andrea Lancichinetti, Alex Arenas, and Martin Rosvall, “Identifying Modular Flows on Multilayer Networks Reveals Highly Overlapping Organization in Interconnected Systems,” Phys. Rev. X 5, 011027 (2015).
 Xu et al. (2016) Jian Xu, Thanuka L. Wickramarathne, and Nitesh V. Chawla, “Representing higherorder dependencies in networks,” Science Advances 2, e1600028 (2016).
 Kivelä et al. (2014) Mikko Kivelä, Alex Arenas, Marc Barthelemy, James P. Gleeson, Yamir Moreno, and Mason A. Porter, “Multilayer networks,” jcomplexnetw 2, 203–271 (2014).
 Guimerà et al. (2004) Roger Guimerà, Marta SalesPardo, and Luís A. Nunes Amaral, “Modularity from fluctuations in random graphs and complex networks,” Phys. Rev. E 70, 025101 (2004).
 Baum et al. (1970) Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss, “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains,” Ann. Math. Statist. 41, 164–171 (1970).

Rabiner and Juang (1986)
L. Rabiner and B.H. Juang, “An introduction to hidden Markov models,”
IEEE ASSP Magazine 3, 4–16 (1986).  Jääskinen et al. (2014) Väinö Jääskinen, Jie Xiong, Jukka Corander, and Timo Koski, “Sparse Markov Chains for Sequence Data,” Scand J Statist 41, 639–655 (2014).
 Xiong et al. (2016) Jie Xiong, Väinö Jääskinen, and Jukka Corander, “Recursive Learning for Sparse Markov Models,” Bayesian Anal. 11, 247–263 (2016).
 Holland et al. (1983) Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt, “Stochastic blockmodels: First steps,” Social Networks 5, 109–137 (1983).
 Karrer and Newman (2011) Brian Karrer and M. E. J. Newman, “Stochastic blockmodels and community structure in networks,” Phys. Rev. E 83, 016107 (2011).
 Strelioff et al. (2007) Christopher C. Strelioff, James P. Crutchfield, and Alfred W. Hübler, “Inferring Markov chains: Bayesian estimation, model comparison, entropy rate, and outofclass modeling,” Phys. Rev. E 76, 011106 (2007).
 (37) The degreecorrection feature, as well as different choice of prior probabilities in the Bayesian description, are the main differences from the sparse Markov chains developed in Refs. Jääskinen et al. (2014); Xiong et al. (2016), which are otherwise similar, but not identical to the approach proposed here.
 Jaynes (2003) E. T. Jaynes, Probability Theory: The Logic of Science, edited by G. Larry Bretthorst (Cambridge University Press, Cambridge, UK ; New York, NY, 2003).
 Peixoto (2016) Tiago P. Peixoto, “Nonparametric Bayesian inference of the microcanonical stochastic block model,” arXiv:1610.02703 [physics, stat] (2016), arXiv: 1610.02703.
 (40) In the microcanonical model the marginal likelihood is identical to the joint probability of the data and parameters, i.e. , where the parameters on the righthand side are the only ones which are compatible with the memory/token partition and the observed chain.
 Ziv and Lempel (1977) J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Transactions on Information Theory 23, 337–343 (1977).
 Ziv and Lempel (1978) J. Ziv and A. Lempel, “Compression of individual sequences via variablerate coding,” IEEE Transactions on Information Theory 24, 530–536 (1978).
 Peixoto (2013) Tiago P. Peixoto, “Parsimonious Module Inference in Large Networks,” Phys. Rev. Lett. 110, 148701 (2013).
 Peixoto (2014a) Tiago P. Peixoto, “Hierarchical Block Structures and HighResolution Model Selection in Large Networks,” Phys. Rev. X 4, 011047 (2014a).
 Peixoto (2015b) Tiago P. Peixoto, “Model Selection and Hypothesis Testing for LargeScale Network Models with Overlapping Groups,” Phys. Rev. X 5, 011033 (2015b).
 Rosvall and Bergstrom (2007) Martin Rosvall and Carl T. Bergstrom, “An informationtheoretic framework for resolving community structure in complex networks,” PNAS 104, 7327–7331 (2007).
 Grünwald (2007) Peter D. Grünwald, The Minimum Description Length Principle (The MIT Press, 2007).

Peixoto (2014b)
Tiago P. Peixoto, “Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models,”
Phys. Rev. E 89, 012804 (2014b).  Scholtes et al. (2014) Ingo Scholtes, Nicolas Wider, René Pfitzner, Antonios Garas, Claudio J. Tessone, and Frank Schweitzer, “Causalitydriven slowdown and speedup of diffusion in nonMarkovian temporal networks,” Nat Commun 5 (2014), 10.1038/ncomms6024.
 Mastrandrea et al. (2015) Rossana Mastrandrea, Julie Fournet, and Alain Barrat, “Contact Patterns in a High School: A Comparison between Data Collected Using Wearable Sensors, Contact Diaries and Friendship Surveys,” PLoS ONE 10, e0136497 (2015).
 Jeffreys (1946) Harold Jeffreys, “An Invariant Form for the Prior Probability in Estimation Problems,” Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 186, 453–461 (1946).
 (52) Extracted in MIDI format from the Mutopia project at http://www.mutopiaproject.org.
 MacKay (2003) David J. C. MacKay, Information Theory, Inference and Learning Algorithms, first edition ed. (Cambridge University Press, 2003).
 Persson et al. (2016) Christian Persson, Ludvig Bohlin, Daniel Edler, and Martin Rosvall, “Maps of sparse Markov chains efficiently reveal community structure in network flows with memory,” arXiv:1606.08328 [physics] (2016), arXiv: 1606.08328.
 (55) http://www.transtats.bts.gov/.
 (56) Extracted verbatim from https://www.gutenberg.org/cache/epub/2600/pg2600.txt.
 (57) Retrieved from http://www.infochimps.com/datasets/uberanonymizedgpslogs, also available at https://github.com/dima42/ubergpsanalysis.
 (58) Retrieved from http://downloads.skullsecurity.org/passwords/rockyouwithcount.txt.bz2.
 Klimt and Yang (2004) Bryan Klimt and Yiming Yang, “The Enron Corpus: A New Dataset for Email Classification Research,” in Machine Learning: ECML 2004, Lecture Notes in Computer Science No. 3201, edited by JeanFrançois Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi (Springer Berlin Heidelberg, 2004) pp. 217–226.
 (60) Retrieved from http://www.caida.org.
 (61) Retrieved from http://journals.aps.org/datasets.
 (62) Retrieved from http://konect.unikoblenz.de/networks/prosperloans.
 (63) Retrieved from http://ficsgames.org/download.html.
 Vanhems et al. (2013) Philippe Vanhems, Alain Barrat, Ciro Cattuto, JeanFrançois Pinton, Nagham Khanafer, Corinne Régis, Byeula Kim, Brigitte Comte, and Nicolas Voirin, “Estimating Potential Infection Transmission Routes in Hospital Wards Using Wearable Proximity Sensors,” PLoS ONE 8, e73970 (2013).
 Isella et al. (2011) Lorenzo Isella, Juliette Stehlé, Alain Barrat, Ciro Cattuto, JeanFrançois Pinton, and Wouter Van den Broeck, “What’s in a crowd? Analysis of facetoface behavioral networks,” Journal of Theoretical Biology 271, 166–180 (2011).
 Eagle and (Sandy) Pentland (2006) Nathan Eagle and Alex (Sandy) Pentland, “Reality Mining: Sensing Complex Social Systems,” Personal Ubiquitous Comput. 10, 255–268 (2006).
Appendix A Bayesian Markov chains with communities
As described in the main text, a Bayesian formulation of the Markov model consists in specifying prior probabilities for the model parameters, and integrating over them. In doing so, we convert the problem from one of parametric inference (i.e. when the model parameters need to be specified or fitted from data), to a nonparametric one (i.e. no parameters need to be specified or fitted). In this way, the approach possesses intrinsic regularization, where the order of the model can be inferred from data alone, without overfitting Jaynes (2003); MacKay (2003).
To accomplish this, we rewrite the model likelihood (using Eqs. 1 and 5) as
(34) 
and observe the normalization constraints , and . Since this is just a product of multinomials, we can choose conjugate Dirichlet priors probability densities and , which allows us to exactly compute the integrated likelihood,
(35) 
where and . We recover the Bayesian version of the common Markov chain formulation (see Ref. Strelioff et al. (2007)) if we put each memory and token in their own groups. This remains a parametric distribution, since we need to specify or infer the hyperparameters. However, in the absence of prior information it is more appropriate to make a noninformative choice that encodes our a priori lack of knowledge or preference towards any particular model, , which simplifies the above in a way that can be written exactly as Eqs. 69 in the main text. As was mentioned there, the likelihood of Eq. 7 corresponds to a microcanonical model that generates random sequences with hard constraints. In order to see this, consider a chain where there are only transitions in total between token group and memory group , and each token occurs exactly