Markov chains  are a cornerstone in the modelling of categorical sequence data. They model a sequence of discrete states or symbols by means of a discrete-time, stochastic process that has the Markov property, i.e. the next state only depends on the current one. Markov 
first applied Markov chains to the sequence of letters in the poem “Eugene Onegin” by Alexander Pushkin. Today, aside from natural language processing and speech recognition, Markov chain models are used to model sequences in biology, finance, computer science, and network analysis. Many sequential data in those applications do not satisfy the Markov assumption, i.e. the next state depends not only on the current state but rather on a longer history of states. To model such data, we can relax the Markov assumption by using Markov chains of higher order, where the order determines the number of previous states that transitions can depend on.
The use of higher-order Markov models raises a long-standing question: What is the optimal order to model a given data set? Underestimating the optimal order hinders the modelling of relevant patterns in the sequence, reducing the accuracy of predictions and limiting the performance of compression. Increasing the order
exponentially increases the size of the parameter space of the model, which leads to models that are subject to the curse of dimensionality. Apart from scalability issues, such models come with a high risk to overfit, to not generalize to new data, and to exhibit poor out-of-sample prediction performance. Over the last decades, researchers have developed various techniques to address model inference and selection for Markov chains. However, standard approaches do not incorporate knowledge aboutconstraints on transitions, i.e. specific sequences of states that are not possible due to the underlying process. An important application where such constraints are crucial is the use of Markov chains to model data on paths in a network, i.e. sequences of nodes with variable lengths that are traversed in a known graph. Examples include traces of information propagating in social networks, travel itineraries of passengers in transportation networks, or users navigating hyperlinked pages on the Web. In such sequences, transitions from one state to other states are constrained by the topology of the network. Due to recent discoveries of non-Markovian characteristics of paths in social networks, transportation, or information systems, the use of higher-order models has become a key approach to develop new network analytic methods for time series data . The need for techniques to reliably learn the optimal order of Markov chain models of paths in limited amounts of data has been identified as a key challenge in this area.
Despite a large body of inference and model selection techniques for unconstrained categorical sequences [4, 14, 2, 6, 8, 1, 3, 9, 10, 11, 12, 16, 23, 24, 26, 28, 27, 33, 34, 35, 36, 38, 42], ignoring that paths are subject to constraints leads to under-fitting and limits data efficiency 
. The reason why existing methods under-fit can be best understood for techniques relying on AIC and BIC, which account for the degrees of freedom to penalize the complexity of higher-order models. The topology of a network constrains possible state transitions, which reduces the degrees of freedoms for larger orders. Standard methods thus over-penalize model complexity and are likely to underestimate the optimal order.
In summary, although the detection of the optimal Markov order for categorical sequential data is a well-studied problem, data on paths pose additional challenges that have not been solved satisfactorily. Addressing this gap, we use a multi-order modelling framework to derive frequentist and Bayesian model selection techniques for path data. We experimentally validate them in data with known ground truth and evaluate their data efficiency, i.e. how much data they need to select the correct order. We repeat our experiment for different maximum model orders, sample sizes as well as for situations where we have only partial knowledge on the network topology. We find that the Bayesian approach proposed in our work outperforms all other methods for all scenarios, offering substantial improvements in estimation performance and data efficiency. Regarding a recently published method to address this problem based on the likelihood-ratio test
, we correct the degree of freedom calculation and discover issues that question the Chi-Square approximation of the test statistic.
2 Background and problem formulation
We first introduce order estimation techniques for higher-order Markov models for categorical sequences. We then explain additional challenges that occur when modelling paths in networks and specify the learning problem that we address.
Order Estimation for Higher-order Markov chains.
Consider a categorical sequence of length , where each symbol or state takes values from alphabet . A higher-order Markov chain of order assumes that each symbol is independent of all except the last symbols, i.e.
where for we recover an ordinary memoryless Markov chain . We use the term “zeroth-order Markov chain” to refer to a model where symbols are i.i.d. For a given -th order Markov chain with a specific choice of parameters that we denote as
, the conditional probability of symbolfor a given history of symbols is
where consists of all transition probabilities . Organizing by histories
, we can bring them in vector-form as:
For each history , is a point in a probability simplex in -dimensional space, which has degrees of freedom due to the normalization. The parameter space of a -th order Markov chain is then the Cartesian product of over all histories of length , ():
Therefore, a Markov chain of order has degrees of freedom.
For a given sequence and fixed order we can, e.g., use likelihood maximization to learn the parameters of the -th order Markov chain that best “explains” patterns in the sequence . Higher-order Markov chains with different orders are nested, i.e. for any choice of parameters of a -th order model, a point in the parameter space of a model with order exists such that the likelihoods of the two models are identical. This implies that a maximization of model likelihood trivially chooses the largest order available, since a model with smaller order can never have larger likelihood. Markov order detection is thus an exemplary instance of a model selection problem in which need to account both for the goodness-of-fit (e.g. expressed in the likelihood) and model parsimony (e.g. expressed in the degrees of freedom). A number of methods have been developed for this long-standing problem, including methods based on various estimators [3, 27, 11, 42, 12], surrogate data [38, 28, 9], generalizations of a likelihood ratio test , or mutual information [25, 26].
Some of the earliest works have applied a likelihood-ratio test [4, 2, 6, 8] to determine the optimal Markov order of a sequence. Considering that higher-order Markov chains with orders and are nested models, and assuming that the sequence is sufficiently long, Wilks’ theorem allows us to approximate the distribution of the test statistic based on the -distribution, where the parameter of the distribution is the difference between the degrees of freedom of a - and a -th order model. We can use this to calculate a
-value of the null hypothesis that the fitted parameteris in the subset of the model parameters which correspond to a -th order Markov chain. This approach naturally accounts for the expected increase in model likelihood that is due to the larger degrees of freedom of a Markov chain with order . Tong  applied Akaike’s information criterion (AIC)  to Markov order detection. This criterion balances the degrees of freedom of a -th order Markov chain with the goodness-of-fit captured in the likelihood . Schwarz et al.  proposed the Bayesian information criterion (BIC), which is based on the asymptotic behavior of Bayes estimators. Similar to the AIC, it incorporates both the goodness-of-fit and the degrees of freedom of the model. Katz  showed that AIC is an inconsistent estimator, prone to over-fitting even at large sample sizes, while BIC has been proven to be consistent 
. Most relevant for our work is the Bayes factor method that is based on Bayes rule. It was used by Liu and Lawrence , Strelioff et al. , Singer et al.  to determine the optimal order of Markov chain models for general categorical sequences.
Different from the works above, we address the problem of learning the optimal Markov order based on data of paths in a known network. Examples for such data include click streams generated by multiple sessions of users that navigate a hyperlinked document graph, collections of ticket data that capture the flight itineraries of passengers in an airline network, or information propagating along the edges of a social network. We assume that the data is an unordered multiset of constrained, variable-length sequences that correspond to paths in a given network with nodes and edges . A path in a network is a sequence of nodes , , where consecutive nodes are connected by an edge, i.e. .111We do not require nodes or edges to be unique, i.e. we do not distinguish between paths, trail, or walks . We denote the set of successors of node as and successors of a path as . We denote the set of all paths of length in a network as . To model such data, researchers in network science have studied higher-order network models , which combine higher-order Markov chains with network-based models for paths. Like higher-order Markov chains, a higher-order network (HON) with order assumes a -th order Markov property (Eq. 1). However, for inference and model selection tasks, it explicitly accounts for those state transitions that correspond to a possible path in a given network . For a given -order HON on with parameters that we denote as , the transition probabilities are
Again grouping parameters by histories, we obtain:
The difference in modelling a large number of (typically) short paths and a single long sequence might look small, but it poses a major issue. Neither Markov chains nor HONs of order capture probabilities for histories shorter than . For a single long sequence omitting the first symbols might be negligible. However, for data with a large number of short paths we would omit the first symbols of each path. Instead of a single HON, Scholtes  proposes to combine multiple HONs of orders as “layers” of a “multi-order network” (MON) model . Similar to a Markov chain of order , a MON with maximal order assumes a -th order Markov property (Eq. 1), modelling all transition probabilities for histories longer than with the -th order layer. Transition probabilities for histories of length are modelled with the -th layer. The probability of a sequence for a MON model is:
Since it combines multiple HONs, a MON model has degrees of freedom. With for the empty history , its parameter space is:
In the remainder of this work we address the problem of learning the optimal maximum order of a multi-order network model, given paths and a network .
3 Learning the Markov order of paths
We introduce four model selection techniques to learn the optimal maximum Markov order for a multi-order network model of paths. We first derive a Bayesian method to learn both the optimal parameters of a multi-order network model as well as the Markov order of paths using Bayes factors . We then adapt three MLE-based Markov order estimation techniques to our problem.
Bayesian learning of model parameters and optimal maximum Markov order.
We say that, in a path every node represents one observation of a transition from history to node , where . A single path of length contains transitions for histories with multiple lengths, from length zero (with empty history ) to . We denote the number of observed transitions from history to node in a data set of paths as
For each given maximum order , the Bayesian approach keeps track of the probability density over the parameter space of a multi-order model. Model parameters consist of transition probabilities to nodes from for all histories . We can model the probability density of by a Dirichlet distribution [18, 35] with concentration (hyper)parameters ( denotes the well-known multivariate Beta function):
We denote the prior with and use
for a so-called ”flat” prior, where the probability density is constant over the parameter space. This corresponds to choosing all hyperparameters as one, i.e.
From Bayes rule, we obtain the following update rule for the distribution of model parameters with maximum order on a network given path data and prior :
For given parameters of a MON model of order on , the likelihood does not depend on the choice of the prior, i.e. . Using the -th order Markov property (Eq. 1), we can compute the likelihood of parameters given data as
where is the number of permutations of paths in , is the maximal length of paths in , and counts the transitions in the data assuming the -th order Markov property as follows:
We organize as vectors . We then use Eq. 15
, and the moments of the Dirichlet distribution to derive the marginal likelihood (derivation inAppendix A):
Substituting back into Eq. 14, we arrive at the simple update rule:
This yields a Bayesian method to infer parameters of a multi-order model based on a large collection of paths and a fixed maximum order . To learn the optimal order we again apply Bayes rule. We first choose the orders that we want to compare, e.g., up to a maximum value of . We denote the choice of the prior over as . For the inference of the optimal order for -th order Markov chain models of unconstrained sequences (rather than multi-order models of paths), Strelioff et al.  mention two common priors: a uniform prior and a prior that additionally penalizes more complex models. Due to its superior performance for our problem, we limit our discussion to the uniform prior (see results for the latter prior in Appendix B):
For a data set and order
, we use Bayes rule to calculate posterior probabilities:
which we calculate analytically from Eq. 17 since the model evidence is equal to marginal likelihood. The ratio of probabilities is the Bayes factor. To facilitate the comparison between Bayesian and MLE-based model selection, we use hypothesis tests from  to output a single order, instead of assigning probabilities to each order. We chose the maximal that is significantly more likely than all models with . We use significance levels from , i.e. we find “positive” evidence in favor of over iff and “very strong” evidence iff . Since likelihoods are marginalized over the parameter space, larger do not necessarily have larger likelihood. This naturally introduces Occam’s razor  and avoids overfitting.
MLE-based Markov order estimation
Apart from the Bayesian method above, we can use maximum likelihood estimation (MLE) to infer transition probabilities based on the paths observed in a data set . We obtain an MLE of parameters as a ratio of transition frequencies
To learn the optimal order we can use information criteria like AIC or BIC, which for a multi-order model with estimated parameters are given as:
where is the total number of transitions on all paths in and are the degrees of freedom of a multi-order model. Note that depends on the maximum order and the topology of the network , where sparser networks typically lead to smaller degrees of freedom. While we omit it due to space constraints and inferior performance, in Appendix C we additionally adapt the EDC  to multi-order models.
We finally discuss a method to learn the optimal maximum order of a multi-order network model based on a likelihood-ratio test, which was previously adapted in . For two candidate orders , we define the test statistic based on the ratio of model likelihoods as:
Due to the nestedness of models (Section 2),
approximately follows a Chi-square distribution, i.e.
where are the degrees of freedom of a multi-order model with maximum order . As explained in Section 2, we can learn the optimal order by calculating a -value of the null hypothesis that the observed increase in likelihood is due to the additional complexity of a model with larger oder . While the same idea was used in , we highlight two important differences: First, for the degrees of freedom  does not distinguish between the underlying network and observed transitions in paths, i.e. it relies on the assumption that every possible edge in the network is traversed at least once. Second, despite a correct explanation in the text, there is a mistake in Eq. (8) and (9) of  that leads to an overestimation of the degrees of freedom, which we corrected in Section 2.
4 Experimental analysis
We now experimentally evaluate the model selection techniques introduced above in synthetic paths with known ground truth order. For a ground truth maximum order , we first generate a random undirected and unweighted graph . We then choose a multi-order network (MON) model uniformly at random from the space of possible models (Eq. 9). We finally use that model to generate a set of paths with a given size . Assuming that we have access to the path data and the network constraining paths, we construct MON models with different maximum orders , fit them to the paths , and determine the optimal maximum order using each of the four model selection techniques discussed in Section 3. We repeat the procedure above for random networks and determine the rates at which the methods select each order as optimal. We repeat this experiment for different data sizes . In Fig. 1 we present results for a random graph with nodes in edges and paths with ground truth order (analogous figure for in Appendix D). The figure shows the frequencies (-axis) at which the methods select order (four curves) for data with varying size (-axis). Shaded areas in Fig. 1 indicate the range of data sizes where the correct order was detected in all
experiments (AIC, BIC, BF) or, for the likelihood ratio test, where the observed type-I error rate was smaller than the significance threshold.
The results in Fig. 1 indicate a strong difference between the methods in terms of data efficiency, i.e the minimal number of observations needed to reliably detect the ground truth order. To further highlight this aspect of data efficiency, in Fig. 2 we compare the ranges of data sizes for which the four methods determine the correct order. Fig. 2 (a) shows these ranges for (i.e. the same experiments presented in Fig. 1), while Fig. 2 (b) shows the results for .
All experiments above assume that we have full information on the network that constrains which paths are possible. However, in real data we often have partial information about the underlying network. For instance, we might have access to the network of streets in a city, but lack information on road signs that indicate forbidden turns and one-way lanes. To simulate this situation, we add random edges to the network used to generate paths, i.e. we use to constrain the model, where denote transitions that are apparently allowed, but cannot be realized in the actual network . We show the ranges of data size where each method detects the correct order for the partially known constraint in panels (c) and (d) in Fig. 2. These experiments are performed with the same parameters as above, adding additional directed edges to the network.
In Fig. 1 we note that the likelihood ratio test overfits the Markov order of paths. More precisely, using the -approximation of the test statistic, the type-I error rate converges to the chosen significance threshold from above. The type-I error rate is zero for small data sizes, exceeds the significance threshold around observations, and as the data-size increases, slowly approaches . For moderate data sizes, our results in Fig. 1 further indicate that the likelihood ratio test is prone to make type-I errors with a probability that is orders of magnitude larger than . The curve of shows the frequency with which order three is chosen rather than the correct order , which is a lower bound for the type-I error rate. For the chosen significance threshold , we find that the probability to make a type I error exceeds , while for it exceeds . Despite having a small network and a large number of observations, for and ground truth the likelihood ratio test even chose order with frequency larger than , which we expect to happen with probability . This tendency of the likelihood ratio test to overfit did not show in the experiments with partial knowledge of the constraint. We suspect that the overestimation of the degrees of freedom that is due to the partial knowledge masks the tendency to overfit. Due to this overcounting, the -distribution (with too large degrees of freedom) is not a good approximation for the distribution of the test statistic even in the limit of large data size. In summary, we cannot know whether the likelihood ratio test over- or underfits, or chooses the correct order, even if the network is small ( nodes edges), the ground truth order is small () and the number of observed transitions is large ().
Of the four methods discussed in our work, we find that the Bayesian method performed best. It lacks the tendency to overfit exhibited by the likelihood ratio test, and it is considerably more data efficient compared to AIC and BIC. In Fig. 1 (a), for with full knowledge of the network, we find that the likelihood ratio test needs more than times more data, AIC needs almost times more data, and BIC needs more than times more data compared to the Bayesian method where we demand ”very strong” evidence. For with full knowledge on the constraint (Fig. 1 (b)), these differences are even larger: The likelihood ratio test needs more than times more data than Bayesian method demanding ”very strong evidence”, while AIC needs almost and BIC needs more than times more data. We find that the likelihood ratio test needed marginally fewer data only in the case for the partially known constraint with (Fig. 1 (c)), where it needs of the data needed using Bayes factors. For (Fig. 1 (d)) it is again less data-efficient, demanding more data than the Bayesian approach. We highlight that the proposed Bayesian method is based on an analytical expression for the model evidence, i.e. we do not require Monte Carlo methods for the integration. Hence, the observed improvements over MLE-based techniques are gained despite the fact that the Bayesian method has the same computational complexity.
We finally comment on potential threats to the validity of our work. We first note that the overfitting of the likelihood ratio test shown in our experiments was not observed in . We suspect that this is due to the mistake in the degrees of freedom calculation, which we mentioned and corrected in Section 3. This mistake tends to additionally penalize higher orders, thus masking the tendency of the likelihood ratio test to over-fit for moderate data sizes. Furthermore, we used the argument of Scholtes  to justify the use of MON models as opposed to unconstrained Markov chains. To ensure that the mistake mentioned above does not challenge the advantages of multi-order models over unconstrained Markov chains, we experimentally reproduced the finding of  that multi-order models are more data-efficient. The interested reader can find those results in the Section E.1. Another potential threat to validity could be that our findings depend on the networks that we used as underlying constraints, and which determine the degrees of freedom of the model. In Section E.2 we checked whether our results qualitatively change under different network constraints and found no evidence for it. Finally, to illustrate the application of our method to real data on paths (where, however, the ground truth Markov order is unknown), in the Appendix F we report the detected Markov order for several data sets that were previously analyzed in .
An increasing volume of categorical sequence data contains large numbers of short, variable-length paths observed in a network, e.g., web users navigating through networks of linked documents, information propagating in social networks, or passengers travelling through transportation networks. The analysis of such path data requires higher-order network models that capture sequential patterns, while accounting for the constraints imposed by the underlying network. Learning the optimal order of such higher-order network models is an open problem , which we tackle in this work. We correct a recently proposed solution based on the likelihood ratio test, and adapt AIC and BIC-based estimators of the Markov order to our specific problem. We further derive a Bayesian approach to learn an optimal multi-order network model for paths. Our experimental results show that the proposed Bayesian method considerably outperforms MLE-based methods. We find that it more reliably learns the correct Markov order of paths, requires less data, and is robust in situations where we have only partial knowledge about the constraints imposed by the underlying network.
Our work opens several perspectives for future research: First, our formulation allows to include higher-order constraints such as, e.g., non-backtracking walks or sequences constrained to complex “path motifs”, which could improve model selection. Second, while we have focussed on a flat prior (Eq. 13), it is an open question how different choices of the prior affect Bayesian model selection. Third, different from MLE-based approaches, our Bayesian method assigns probabilities to multi-order models with different maximum orders , which can be leveraged for ensemble learning techniques. Finally, it would be interesting to adapt other Markov order detection techniques mentioned in Section 2 to multi-order models of paths, e.g. techniques based on surrogate data [9, 28, 38] or conditional mutual information . We believe that our work serves as a baseline for the development of further methods to learn the Markov order of paths.
The broader impact of our work is due to the growing interest in the role of higher-order interactions in complex networks, i.e. non-dyadic relations that not only involve pairs of nodes directly connected via edges. Data on paths observed in social, technical, and biological networks have become an important source of information on non-dyadic, indirect interactions in complex networks. Neglecting such higher-order interactions limits our ability to analyse rich time-stamped data on networks, and hinders the modelling of complex systems. Higher-order interactions influence if and how nodes can indirectly influence each other, thus fundamentally challenging our understanding of the causal topology of complex systems. Consequently, the development of higher- and multi-order models for paths in networked systems has become a major focus of the interdisciplinary network science community [17, 5].
Recent works in this field have highlighted opportunities resulting from higher-order generalizations of network analytic methods such as, e.g. random walk models, community detection, centrality measures, node embedding, anomaly detection, or link prediction[29, 32, 31, 40, 41, 22, 30]. However, a major stumbling block is the lack of methods suitable to determine optimal higher-order models for paths that are constrained to a known network topology. Combining higher-order models of paths studied in network science with model selection and statistical learning, we seek to close this gap. Apart from enabling network scientists to learn the optimal Markov order of paths in a network, our work contributes to answering the important question for which systems standard network models (of first-order) are sufficient, and in which cases such models are likely to underfit interaction patterns.
Luka Petrović and Ingo Scholtes acknowledge support by the Swiss National Science Foundation, grant 176938. Ingo Scholtes acknowledges support by the project bergisch.smart.mobility, funded by the German state of North Rhine-Westphalia.
- Akaike  H. Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
- Anderson and Goodman  T. W. Anderson and L. A. Goodman. Statistical inference about markov chains. The Annals of Mathematical Statistics, pages 89–110, 1957.
- Baigorri et al.  A. Baigorri, C. Gonçalves, and P. Resende. Markov chain order estimation and 2-divergence measure. arXiv preprint arXiv:0910.0264, 2009.
- Bartlett  M. S. Bartlett. The frequency goodness of fit test for probability chains. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 47, pages 86–95. Cambridge University Press, 1951.
- Battiston et al.  F. Battiston, G. Cencetti, I. Iacopini, V. Latora, M. Lucas, A. Patania, J.-G. Young, and G. Petri. Networks beyond pairwise interactions: structure and dynamics. arXiv e-prints, art. arXiv:2006.01764, June 2020.
- Billingsley  P. Billingsley. Statistical methods in markov chains. The Annals of Mathematical Statistics, pages 12–40, 1961.
- Bollobás  B. Bollobás. Modern graph theory, volume 184. Springer Science & Business Media, 2013.
- Chatfield  C. Chatfield. Statistical inference regarding markov chain models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 22(1):7–20, 1973.
- Corrêa et al.  D. C. Corrêa, J. M. Moore, T. Jüngling, and M. Small. Constrained markov order surrogates. Physica D: Nonlinear Phenomena, page 132437, 2020.
- Csiszár et al.  I. Csiszár, P. C. Shields, et al. The consistency of the bic markov order estimator. The Annals of Statistics, 28(6):1601–1619, 2000.
- Dalevi and Dubhashi  D. Dalevi and D. Dubhashi. The peres-shields order estimator for fixed and variable length markov models with applications to dna sequence similarity. In International Workshop on Algorithms in Bioinformatics, pages 291–302. Springer, 2005.
- Dorea  C. C. Y. Dorea. Optimal penalty term for edc markov chain order estimator. Annales de l’ISUP, 2008. ISSN 1626-1607.
- for London  T. for London. Rolling Origin and Destination Survey (RODS) database, 2014. URL http://www.tfl.gov.uk/info-for/open-data-users/our-feeds.
- Hoel  P. G. Hoel. A test for markoff chains. Biometrika, 41(3/4):430–433, 1954.
- Kass and Raftery  R. E. Kass and A. E. Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.
- Katz  R. W. Katz. On some criteria for estimating the order of a markov chain. Technometrics, 23(3):243–249, 1981.
- Lambiotte et al.  R. Lambiotte, M. Rosvall, and I. Scholtes. From networks to optimal higher-order models of complex systems. Nature physics, page 1, 2019.
- Liu and Lawrence  J. S. Liu and C. E. Lawrence. Bayesian inference on biopolymer models. Bioinformatics (Oxford, England), 15(1):38–52, 1999.
- MacKay and Mac Kay  D. J. MacKay and D. J. Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
- Markov  A. Markov. Extension of law of big numbers on variables, depending from each other. Izvestiya Fiziko-Matematicheskogo Obschestva pri Kazanskom Universitete, 2:135–156, 1906.
- Markov  A. Markov. Example of statistical research on text of “eugene onegin”, illustrating interconnection of trials in chain. Izvestiya Akademii Nauk SPb, 6:153–162, 1913.
- Mellor  A. Mellor. The temporal event graph. Journal of Complex Networks, 6(4):639–659, 2018.
- Menéndez et al.  M. Menéndez, L. Pardo, M. Pardo, and K. Zografos. Testing the order of markov dependence in dna sequences. Methodology and computing in applied probability, 13(1):59–74, 2011.
- Moore et al.  J. M. Moore, D. C. Corrêa, and M. Small. Is bach’s brain a markov chain? recurrence quantification to assess markov order for short, symbolic, musical compositions. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(8):085715, 2018.
- Papapetrou and Kugiumtzis  M. Papapetrou and D. Kugiumtzis. Markov chain order estimation with conditional mutual information. Physica A: Statistical Mechanics and its Applications, 392(7):1593–1601, 2013.
- Papapetrou and Kugiumtzis  M. Papapetrou and D. Kugiumtzis. Markov chain order estimation with parametric significance tests of conditional mutual information. Simulation Modelling Practice and Theory, 61:1–13, 2016.
- Peres and Shields  Y. Peres and P. Shields. Two new markov order estimators. arXiv preprint math/0506080, 2005.
- Pethel and Hahs  S. Pethel and D. Hahs. Exact significance test for markov order. Physica D: Nonlinear Phenomena, 269:42–47, 2014.
- Rosvall et al.  M. Rosvall, A. V. Esquivel, A. Lancichinetti, J. D. West, and R. Lambiotte. Memory in network flows and its effects on spreading dynamics and community detection. Nature communications, 5:4630, 2014.
- Saebi et al.  M. Saebi, G. L. Ciampaglia, L. M. Kaplan, and N. V. Chawla. Honem: Network embedding using higher-order patterns in sequential data. arXiv preprint arXiv:1908.05387, 2019.
- Scholtes  I. Scholtes. When is a network a network?: Multi-order graphical model selection in pathways and temporal networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1037–1046. ACM, 2017.
- Scholtes et al.  I. Scholtes, N. Wider, and A. Garas. Higher-order aggregate networks in the analysis of temporal networks: path structures and centralities. The European Physical Journal B, 89(3):61, 2016.
- Schwarz et al.  G. Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.
- Singer et al.  P. Singer, D. Helic, B. Taraghi, and M. Strohmaier. Detecting memory and structure in human navigation patterns using markov chain models of varying order. PloS one, 9(7):e102070, 2014.
- Strelioff et al.  C. C. Strelioff, J. P. Crutchfield, and A. W. Hübler. Inferring markov chains: Bayesian estimation, model comparison, entropy rate, and out-of-class modeling. Physical Review E, 76(1):011106, 2007.
- Tong  H. Tong. Determination of the order of a markov chain by akaike’s information criterion. Journal of applied probability, 12(3):488–497, 1975.
- TransStat  R. TransStat. Origin and Destination Survey database, 2014. URL http://www.transtats.bts.gov/Tables.asp?DB_ID=125.
- Van der Heyden et al.  M. J. Van der Heyden, C. G. Diks, B. P. Hoekstra, and J. DeGoede. Testing the order of discrete markov chains using surrogate data. Physica D: Nonlinear Phenomena, 117(1-4):299–313, 1998.
- West and Leskovec  R. West and J. Leskovec. Human wayfinding in information networks. In Proceedings of the 21st international conference on World Wide Web, pages 619–628, 2012.
- Xu et al.  J. Xu, T. L. Wickramarathne, and N. V. Chawla. Representing higher-order dependencies in networks. Science advances, 2(5):e1600028, 2016.
- Xu et al.  J. Xu, M. Saebi, B. Ribeiro, L. M. Kaplan, and N. V. Chawla. Detecting anomalies in sequential data with higher-order networks. arXiv preprint arXiv:1712.09658, 2017.
- Zhao et al.  L. Zhao, C. Dorea, and C. Gonçalves. On determination of the order of a markov chain. Statistical inference for stochastic processes, 4(3):273–282, 2001.
Appendix A Marginal likelihood: step by step derivation
In this section we show the step by step derivation of the marginal likelihood, shown in Eq. 17.
First we write the definition of the marginal likelihood:
We use the fact that random variablesfor different are independent to pull the products outside of the integral.
We note that the integral above is, by definition, a moment of the Dirichlet distribution.
Therefore we substitute the formula for the moment of the Dirichlet distribution, ( denotes the well-known multivariate beta function)
which is the final result.
Appendix B Exponential penalization of higher orders
The other apriori distribution of the orders of the Markov chain that some researchers use  is the prior that additionally penalizes higher orders based on the number of the degrees of freedom. We applied this intuition to the MON models:
We use the same experimental setup as for the Fig. 1 in the main paper: nodes, edges, paths with ground truth markov order , and run independent experiments for each data-size. In Fig. 3 we show the frequencies of choosing the order with the Bayes factor method. We show in the upper row the uniform prior with either “positive” or “very strong” evidence requirement, and in the lower we show the prior that penalizes the number of degrees of freedom exponentially, as defined in Eq. 31. We obtain that the exponential prior needs more than 50 times more data to detect the correct order.
This behaviour is expected, because the prior is biased towards smaller orders, and we need a lot of evidence to overcome this bias. In our other experiments, we saw similar behaviour for , and the partially known constraint.
Appendix C Efficient determination criterion (EDC)
For the sake of transparency, we show here our attempt to adapt the efficient determination criterion to the problem of determining the Markov order of paths. As the results will show, this attempt did not yield a good estimator of the Markov order, which is why we did not include it in the comparison in the main work.
Regarding the EDC for general Markov chains, Zhao et al.  introduced it and proved its consistency. Similarly to AIC and BIC, it weighs the log-likelihood of a model with its degrees of freedom.
While Zhao et al.  gave boundaries for the penalty term for the degrees of freedom, the optimal
was derived by Dorea .
We adapted this estimator to variable-length paths in a network by simply substituting the Markov chain model for the multi-order network model. We therefore (re)defined EDC for the context of paths in the following manner:
We tested this version of the EDC in the same experiment where we have tested the other methods Fig. 1, and obtained the results that we show in Fig. 4. This adaptation of EDC clearly over-fitted, because with increasing evidence, the detected order also increases. We leave the further investigation of EDC in the context of paths in a network for future work.
Appendix D Other results
In this section, we show results of experiments identical to the one we presented in the Fig. 1 (random network as the underlying topology of the process with nodes and edges, and independent experiments per data-size), albeit with the ground truth order . We can see the results in Fig. 5. These results are also shown, in the form of data ranges, in Fig. 2 in the main manuscript. They do not qualitatively differ from the results from : Bayes factors with the demand of very strong evidence performed best.
Appendix E Validity
At the end of the discussion of the main manuscript we mentioned two threats to validity of our work that we alleviated by additional experiments. The first issue is related to the justification of constrained models as opposed to unconstrained models. The second issue is that we only used a single choice of parameters with nodes and edges in our experiment, which raises the concern whether our results generalize to other topologies. Addressing these issues, here we present the results of additional experiments.
e.1 Constrained vs unconstrained models
First, we note that, our work relies on the assumption that we do not need to consider models without constraints, which is the main result of Scholtes . However, we pointed out a flaw in , in the formula for the degrees of freedom. Thus, we checked whether a model selection on unconstrained models showed better data-efficiency than the model selection on constrained models, and present the results in Fig. 6. In the top row of Fig. 6 we show the results from Fig. 1, and in the bottom, we show the results with the setup that is exactly the same aside that models are unconstrained (as a constraint, we used a fully connected topology, which allows every transition).
We can see in Fig. 6, as expected from the argument given in the introduction of the main manuscript, that the constrained models are more sensitive (data-efficient) than the unconstrained models that do not assume any apriori knowledge of impossible transitions.
e.2 Effect of network topology on data efficiency
In this section we address the question whether our results may be sensitive to the network chosen for the ground truth path-generation.
We vary the number of edges in the Erdös-Rényi model that generates the underlying topology of the ground truth model that generates paths. We ran experiments like we have shown in the Fig. 1. in the main manuscript with different numbers of edges: . For each case, we determine the lower boundary of the data range where the method guessed the correct Markov order of paths every time i.e. the lower boundary of the light-green area in Fig. 1 of the main manuscript, and plot the dependency of the lower boundary to the degrees of freedom of the model. We found the number of degrees of freedom to be the most meaningful to show, because it encompasses the number of nodes and edges of the underlying topology, and the ground truth Markov order of the paths, all at the same time. The dependency is shown in Fig. 7. The results show no evidence that the evidence should be rejected.
Appendix F Real-world data
To demonstrate the application of the methods in practical scenarios, we show the Markov order detected using all methods discussed in the paper in three different empirical data sets.
The first data set represents the click stream paths in the Wikispeedia game (WIKI) . Users are instructed to navigate from a given starting article to a given goal using only the hyperlinks connecting the pages. For the network constraint , we used the hyperlink network between the Wikipedia articles that the users navigated on. This graph has 4592 nodes and 239 764 directed edges. There are in total 76 192 observed paths, with 475 889 observations of transitions.
Second data-set consists of flight routes in the network of US airports measured in 2001 (FLY) . For the network constraint , we assumed that every possible connection between the airports had at least one passenger using it. This graph has 175 nodes and 1598 edges. There are in total 286 810 observed paths, with 1.2 million observations of transitions.
Third data set captures passenger trajectories in the network of London subway stations (TUBE) . For the network constraint , we took the map of the London tube. The paths were inferred from the origin-destination data, as the shortest path between the origin and the destination. This graph has 276 nodes and 666 directed edges. There are around 4.3 million observed paths, with 34 million observations of transitions.
|Data||Max order tested||AIC||BIC||LR test||BF ”very strong”|
We notice that the detected order for WIKI data differs from the one detected in , and attribute this difference to the fact that they used paths between only the 100 most visited nodes, ignoring the rest of the articles. Thus they have used only the part of the network for which there is enough evidence to assert higher-order correlations. The reason why we did not follow their procedure is two-fold: first, we wanted to see whether there would be enough evidence to detect the higher-order correlations even when we consider the whole network, and second, we wanted to test whether we could use the methods in a larger real-world network with a reasonable processing time. While there wasn’t enough evidence to assert a higher order, we find that the methods are sufficiently scalable that we needed a few hours on a consumer-grade machine to test up to fourth order.
The orders, detected with likelihood ratio test, in the other two data-sets coincided with  despite the mistake in the calculation of the degrees of freedom in . We notice that the Bayes factor is indeed more sensitive, detecting larger optimal orders than the other methods. These results confirm that our method is relevant to infer the optimal Markov order in real-world path data.